# Model

Plan:

- Develop a model to predict property value
- Use drivers identified in explore to build predictive regression models
- Create and run a baseline model with sklearn's `DummyRegressor` to compare our results to
- Create and run `Linear Regression`, `LassoLars`, and Polynomial regression models
- Use the insights from the highest-performing model (with highest test RMSE) to confirm our initial hypotheses and insights on the features that are the biggest drivers of property value

In [4]:
import pandas as pd

from sklearn.preprocessing import PolynomialFeatures

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LassoLars

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import MinMaxScaler

from wrangle import split_data


## Preprocessing before Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides']`

Scale features:
- MinMax

Before scaling, split data

In [5]:
df = pd.read_csv('wine_data.csv') 



In [6]:
features = ['alcohol', 'volatile acidity', 'chlorides']
target = ['quality']

train, validate, test = split_data(df[features + target + ['color']],
                                   validate_size=.15, test_size=.15, 
                                   stratify_col='color', random_state=123)

# drop color column
train = train.iloc[:,:-1]
validate = validate.iloc[:,:-1]
test = test.iloc[:,:-1]

In [8]:
print(len(train), len(validate), len(test))
train.head()

3724 798 798


Unnamed: 0,alcohol,volatile acidity,chlorides,quality
1179,11.0,0.64,0.094,5
3674,12.1,0.26,0.025,7
1590,9.5,0.29,0.046,5
2743,10.1,0.22,0.054,6
1659,11.9,0.33,0.038,7


In [9]:
# remove target
X_train = train[features]
X_validate = validate[features]
X_test = test[features]

# only add target
y_train = train[target]
y_vaildate = validate[target]
y_test = test[target]

In [10]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

## Modeling before Clustering

**Baseline Model**

In [12]:
dummy = DummyRegressor().fit(X_train_scaled, y_train)

In [13]:
train['baseline_pred'] = dummy.predict(X_train_scaled)

In [14]:
train.head()

Unnamed: 0,alcohol,volatile acidity,chlorides,quality,baseline_pred
1179,11.0,0.64,0.094,5,5.8029
3674,12.1,0.26,0.025,7,5.8029
1590,9.5,0.29,0.046,5,5.8029
2743,10.1,0.22,0.054,6,5.8029
1659,11.9,0.33,0.038,7,5.8029


Evaluate

In [15]:
# RMSE
mean_squared_error(train['quality'],
                   train['baseline_pred'],
                   squared=False)

0.886909313701455

In [16]:
# R2
r2_score(train['quality'],
                   train['baseline_pred'])

0.0

**Linear Regression Model**

In [17]:
lm = LinearRegression().fit(X_train_scaled, y_train)

In [18]:
train['lm_pred'] = lm.predict(X_train_scaled)

Evaluate

In [19]:
# RMSE
mean_squared_error(train['quality'],
                   train['lm_pred'],
                   squared=False)

0.7564477570865422

In [20]:
# R2
r2_score(train['quality'],
                   train['lm_pred'])

0.2725561981288629

In [21]:
pd.DataFrame(index=list(X_train.columns) + ['intercept'],
             columns=['coefficients'],
             data=np.append(lm.coef_ * scaler.scale_, lm.intercept_))

Unnamed: 0,coefficients
alcohol,0.33807
volatile acidity,-1.259275
chlorides,0.333208
intercept,5.255723


In [22]:
train

Unnamed: 0,alcohol,volatile acidity,chlorides,quality,baseline_pred,lm_pred
1179,11.0,0.64,0.094,5,5.8029,5.592061
3674,12.1,0.26,0.025,7,5.8029,6.419471
1590,9.5,0.29,0.046,5,5.8029,5.509709
2743,10.1,0.22,0.054,6,5.8029,5.803366
1659,11.9,0.33,0.038,7,5.8029,6.268040
...,...,...,...,...,...,...
1376,10.5,0.26,0.049,8,5.8029,5.886557
2341,9.4,0.26,0.201,6,5.8029,5.565327
1978,9.1,0.25,0.052,6,5.8029,5.426851
684,9.7,0.61,0.081,6,5.8029,5.186017


Our model starts its prediction at 5.26 and:
- adds .33 for every 1 unit of alcohol
- subtracts -1.25 for every .1 units of volatile acidity
- adds .03 for every .1 for every .1 units of chorides

In [25]:
def run_lm_model(X_train_scaled, y_train):
    
    lm = LinearRegression().fit(X_train_scaled, y_train)
    
    # RMSE
    rmse = mean_squared_error(y_train, lm.predict(X_train_scaled), squared=False)
    # R2
    r2 = r2_score(y_train, lm.predict(X_train_scaled))
    
    
    print(f'RMSE = {rmse}\nR2 = {r2}')
    display(pd.DataFrame(index=list(X_train_scaled.columns) + ['intercept'],
             columns=['coefficients'],
             data=np.append(lm.coef_ * scaler.scale_, lm.intercept_)))
    
    return rmse, r2

## Preprocessing after Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides', 'clusters_1']`


Encode clusters

In [6]:
features = ['alcohol', 'volatile acidity', 'chlorides']
target = ['quality']

train, validate, test = split_data(df[features + target + ['color']],
                                   validate_size=.15, test_size=.15, 
                                   stratify_col='color', random_state=123)

# drop color column
train = train.iloc[:,:-1]
validate = validate.iloc[:,:-1]
test = test.iloc[:,:-1]

In [8]:
print(len(train), len(validate), len(test))
train.head()

3724 798 798


Unnamed: 0,alcohol,volatile acidity,chlorides,quality
1179,11.0,0.64,0.094,5
3674,12.1,0.26,0.025,7
1590,9.5,0.29,0.046,5
2743,10.1,0.22,0.054,6
1659,11.9,0.33,0.038,7


In [9]:
# remove target
X_train = train[features]
X_validate = validate[features]
X_test = test[features]

# only add target
y_train = train[target]
y_vaildate = validate[target]
y_test = test[target]

In [10]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

## Modeling after Clustering

**Baseline Model**

In [12]:
dummy = DummyRegressor().fit(X_train_scaled, y_train)

In [13]:
train['baseline_pred'] = dummy.predict(X_train_scaled)

In [14]:
train.head()

Unnamed: 0,alcohol,volatile acidity,chlorides,quality,baseline_pred
1179,11.0,0.64,0.094,5,5.8029
3674,12.1,0.26,0.025,7,5.8029
1590,9.5,0.29,0.046,5,5.8029
2743,10.1,0.22,0.054,6,5.8029
1659,11.9,0.33,0.038,7,5.8029


Evaluate

In [15]:
# RMSE
mean_squared_error(train['quality'],
                   train['baseline_pred'],
                   squared=False)

0.886909313701455

In [16]:
# R2
r2_score(train['quality'],
                   train['baseline_pred'])

0.0

**Linear Regression Model**

In [17]:
lm = LinearRegression().fit(X_train_scaled, y_train)

In [18]:
train['lm_pred'] = lm.predict(X_train_scaled)

Evaluate

In [19]:
# RMSE
mean_squared_error(train['quality'],
                   train['lm_pred'],
                   squared=False)

0.7564477570865422

In [20]:
# R2
r2_score(train['quality'],
                   train['lm_pred'])

0.2725561981288629

In [21]:
pd.DataFrame(index=list(X_train.columns) + ['intercept'],
             columns=['coefficients'],
             data=np.append(lm.coef_ * scaler.scale_, lm.intercept_))

Unnamed: 0,coefficients
alcohol,0.33807
volatile acidity,-1.259275
chlorides,0.333208
intercept,5.255723


Our model starts its prediction at 5.26 and:
- adds .33 for every 1 unit of alcohol
- subtracts -1.25 for every .1 units of volatile acidity
- adds .03 for every .1 for every .1 units of chorides