# Model

Plan:

- Develop a model to predict quality
- Use drivers identified in explore to build predictive regression models
- Create and run a baseline model with sklearn's `DummyRegressor` to compare our results to
- Create and run `Linear Regression`, `LassoLars`, and Polynomial regression models
- Use the insights from the highest-performing model (with highest test RMSE) to confirm our initial hypotheses and insights on the features that are the biggest drivers of property value

In [11]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import PolynomialFeatures

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LassoLars

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import MinMaxScaler

from wrangle import split_data


## Preprocessing before Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides']`

Encode Clusters

Scale features:
- MinMax

Before scaling, split data

In [12]:
df = pd.read_csv('wine_data_model.csv') 
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,red,clusters_1,clusters_2,clusters_3
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0,1,3
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1,0,1,3
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1,0,1,3
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1,1,3,3
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,1,0,1,3


Encode Clusters

In [13]:
df = pd.concat([df, 
                pd.get_dummies(df[['clusters_1','clusters_2','clusters_3']].astype(str))],
                axis=1)
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,...,clusters_1_2,clusters_1_3,clusters_2_0,clusters_2_1,clusters_2_2,clusters_2_3,clusters_3_0,clusters_3_1,clusters_3_2,clusters_3_3
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,...,0,0,0,1,0,0,0,0,0,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,...,0,0,0,1,0,0,0,0,0,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,...,0,0,0,1,0,0,0,0,0,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,...,0,0,0,0,0,1,0,0,0,1
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,...,0,0,0,1,0,0,0,0,0,1


Split data

In [14]:
train, validate, test = split_data(df,
                                   validate_size=.15, test_size=.15, random_state=123)

Scale data

In [15]:
scaler = MinMaxScaler()

train_sc = pd.concat([pd.DataFrame(data=scaler.fit_transform(train.drop(columns=['quality'])),
                                   columns=train.drop(columns=['quality']).columns),
                      train[['quality']].reset_index().iloc[:,1]],
                      axis=1)

validate_sc = pd.concat([pd.DataFrame(data=scaler.transform(validate.drop(columns=['quality'])),
                                   columns=validate.drop(columns=['quality']).columns),
                         validate[['quality']].reset_index().iloc[:,1]],
                         axis=1)

test_sc = pd.concat([pd.DataFrame(data=scaler.transform(test.drop(columns=['quality'])),
                                   columns=test.drop(columns=['quality']).columns),
                     test[['quality']].reset_index().iloc[:,1]],
                     axis=1)

In [16]:
train_sc

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,...,clusters_1_3,clusters_2_0,clusters_2_1,clusters_2_2,clusters_2_3,clusters_3_0,clusters_3_1,clusters_3_2,clusters_3_3,quality
0,0.289256,0.160000,0.174699,0.029032,0.043406,0.107639,0.317972,0.252264,0.372093,0.171429,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,5
1,0.190083,0.060000,0.168675,0.125806,0.035058,0.156250,0.331797,0.266925,0.550388,0.137143,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,6
2,0.314050,0.146667,0.150602,0.119355,0.070117,0.072917,0.241935,0.366106,0.310078,0.137143,...,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,5
3,0.214876,0.153333,0.301205,0.167742,0.043406,0.142361,0.241935,0.206123,0.356589,0.160000,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,7
4,0.148760,0.120000,0.000000,0.309677,0.043406,0.041667,0.241935,0.260457,0.558140,0.120000,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,6
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
3719,0.347107,0.033333,0.150602,0.016129,0.035058,0.048611,0.184332,0.143596,0.201550,0.080000,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,8
3720,0.264463,0.046667,0.228916,0.051613,0.058431,0.111111,0.207373,0.245364,0.317829,0.080000,...,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,8
3721,0.099174,0.200000,0.006024,0.032258,0.060100,0.086806,0.124424,0.160845,0.759690,0.285714,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,6
3722,0.355372,0.126667,0.198795,0.022581,0.055092,0.086806,0.216590,0.153083,0.201550,0.108571,...,1.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,6


## Modeling before Clustering

In [17]:
features = ['alcohol', 'volatile acidity', 'chlorides']
target = ['quality']

In [18]:
# pick features
X_train = train_sc[features]
X_validate = validate_sc[features]
X_test = test_sc[features]

# only add target
y_train = train_sc[target]
y_vaildate = validate_sc[target]
y_test = test_sc[target]

**Baseline Model**

In [19]:
dummy = DummyRegressor().fit(X_train, y_train)

Evaluate

In [20]:
# RMSE
mean_squared_error(y_train, dummy.predict(X_train),
                   squared=False)

0.877247065526987

In [21]:
# R2
r2_score(y_train, dummy.predict(X_train))

0.0

**Linear Regression Model**

In [22]:
def run_lm_model(X, y):
    
    # run model
    lm = LinearRegression().fit(X, y)
    
    # RMSE
    rmse = mean_squared_error(y, lm.predict(X), squared=False)
    # R2
    r2 = r2_score(y, lm.predict(X))
    
    
    print(f'RMSE = {rmse}\nR2 = {r2}')
    
    return rmse, r2

In [23]:
run_lm_model(X_train, y_train)

RMSE = 0.7398947098676343
R2 = 0.2886292628795276


(0.7398947098676343, 0.2886292628795276)

In [24]:
run_lm_model(X_validate, y_vaildate)

RMSE = 0.744114472234354
R2 = 0.2669769407054128


(0.744114472234354, 0.2669769407054128)

## Modeling on first group of clusters

Cluster 3 in the first group of clusters yields much higher quality.

In [25]:
features = ['alcohol', 'volatile acidity', 'chlorides','clusters_1_3']
target = ['quality']

In [26]:
# pick features
X_train = train_sc[features]
X_validate = validate_sc[features]
X_test = test_sc[features]

# only add target
y_train = train_sc[target]
y_vaildate = validate_sc[target]
y_test = test_sc[target]

**Linear Regression Model**

In [27]:
run_lm_model(X_train, y_train)

RMSE = 0.7395038708507858
R2 = 0.28938060763898354


(0.7395038708507858, 0.28938060763898354)

In [28]:
run_lm_model(X_validate, y_vaildate)

RMSE = 0.7440493843731463
R2 = 0.2671051704826418


(0.7440493843731463, 0.2671051704826418)

## Modeling on second group of clusters

Cluster 0 in the second group of clusters yields much higher quality.

In [29]:
features = ['alcohol', 'volatile acidity', 'chlorides', 'clusters_2_0']
target = ['quality']

In [30]:
# pick features
X_train = train_sc[features]
X_validate = validate_sc[features]
X_test = test_sc[features]

# only add target
y_train = train_sc[target]
y_vaildate = validate_sc[target]
y_test = test_sc[target]

**Linear Regression Model**

In [31]:
run_lm_model(X_train, y_train)

RMSE = 0.7394397912775634
R2 = 0.2895037556474729


(0.7394397912775634, 0.2895037556474729)

In [32]:
run_lm_model(X_validate, y_vaildate)

RMSE = 0.7440616909123617
R2 = 0.26708092619526


(0.7440616909123617, 0.26708092619526)

## Modeling on third group of clusters

Cluster 2 in the third group of clusters yields much higher quality.

In [33]:
features = ['alcohol', 'volatile acidity', 'chlorides', 'clusters_3_2']
target = ['quality']

In [34]:
# pick features
X_train = train_sc[features]
X_validate = validate_sc[features]
X_test = test_sc[features]

# only add target
y_train = train_sc[target]
y_vaildate = validate_sc[target]
y_test = test_sc[target]

**Linear Regression Model**

In [35]:
run_lm_model(X_train, y_train)

RMSE = 0.7392443941862726
R2 = 0.28987920354296515


(0.7392443941862726, 0.28987920354296515)

In [36]:
run_lm_model(X_validate, y_vaildate)

RMSE = 0.7439451470525476
R2 = 0.26731050535328305


(0.7439451470525476, 0.26731050535328305)