# Model

Plan:

- Develop a model to predict quality
- Use drivers identified in explore to build predictive regression models
- Create and run a baseline model with sklearn's `DummyRegressor` to compare our results to
- Create and run models with and without clusters
- Use the insights from the highest-performing model (with highest test RMSE) to confirm our initial hypotheses and insights on the features that are the biggest drivers of property value

In [1]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import PolynomialFeatures

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LassoLars

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import MinMaxScaler

from sklearn.cluster import KMeans

from wrangle import split_data

import wrangle as w

## Preprocessing before Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides']`

Encode Clusters

Scale features:
- MinMax

Before scaling, split data

In [2]:
df = w.wrangle()
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,red
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1
5,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,1


- Split the data
- Scale
- Create clusters
- Encode clusters

In [3]:
def preprecess_data(df):
    
    # split data
    train, validate, test = split_data(df)
    
    # MinMax scale features
    scaler = MinMaxScaler()
    
    train = pd.concat([pd.DataFrame(data=scaler.fit_transform(train.drop(columns=['quality'])),
                                   columns=train.drop(columns=['quality']).columns),
                      train[['quality']].reset_index().iloc[:,1]],
                      axis=1)
    validate = pd.concat([pd.DataFrame(data=scaler.transform(validate.drop(columns=['quality'])),
                                       columns=validate.drop(columns=['quality']).columns),
                             validate[['quality']].reset_index().iloc[:,1]],
                             axis=1)

    test = pd.concat([pd.DataFrame(data=scaler.transform(test.drop(columns=['quality'])),
                                       columns=test.drop(columns=['quality']).columns),
                         test[['quality']].reset_index().iloc[:,1]],
                         axis=1)
    
    # create 1st group of clusters
    feats1 = ['fixed acidity', 'chlorides', 'alcohol']

    kmeans1 = KMeans(n_clusters=4, random_state=123).fit(train[feats1])

    train['clusters_1'] = kmeans1.predict(train[feats1])
    validate['clusters_1'] = kmeans1.predict(validate[feats1])
    test['clusters_1'] = kmeans1.predict(test[feats1])

    # create 2nd group of clusters
    feats2 = ['fixed acidity', 'alcohol']

    kmeans2 = KMeans(n_clusters=4, random_state=123).fit(train[feats2])

    train['clusters_2'] = kmeans2.predict(train[feats2])
    validate['clusters_2'] = kmeans2.predict(validate[feats2])
    test['clusters_2'] = kmeans2.predict(test[feats2])
    
    # create 3rd group of clusters
    feats3 = ['free sulfur dioxide', 'residual sugar', 'alcohol']

    kmeans3 = KMeans(n_clusters=4, random_state=123).fit(train[feats3])

    train['clusters_3'] = kmeans3.predict(train[feats3])
    validate['clusters_3'] = kmeans3.predict(validate[feats3])
    test['clusters_3'] = kmeans3.predict(test[feats3])
    
    # encode clusters
    train = pd.concat([train, 
                        pd.get_dummies(train[['clusters_1','clusters_2','clusters_3']].astype(str))],
                        axis=1)
    validate = pd.concat([validate, 
                            pd.get_dummies(validate[['clusters_1','clusters_2','clusters_3']].astype(str))],
                            axis=1)
    test = pd.concat([test, 
                        pd.get_dummies(test[['clusters_1','clusters_2','clusters_3']].astype(str))],
                        axis=1)
    
    return train, validate, test

In [4]:
train, validate, test = preprecess_data(df)



## Modeling before Clustering

**Baseline Model**

In [5]:
def run_baseline_model(train, test, features, target):
    
    # split X and y
    X_train = train[features]
    X_test = test[features]

    y_train = train[target]
    y_test = test[target]
    
    # run model
    dummy = DummyRegressor().fit(X_train, y_train)    
    
    # RMSE
    train_rmse = mean_squared_error(y_train, dummy.predict(X_train), squared=False)
    test_rmse = mean_squared_error(y_test, dummy.predict(X_test), squared=False)
    # R2
    train_r2 = r2_score(y_train, dummy.predict(X_train))
    test_r2 = r2_score(y_test, dummy.predict(X_test))
    
    print(f'Train:\tRMSE = {train_rmse}\tR2 = {train_r2}')
    print(f'Test:\tRMSE = {test_rmse}\tR2 = {test_r2}')
    
    return train_rmse, train_r2, test_rmse, test_r2

In [6]:
run_baseline_model(train, validate,
                   features=['alcohol', 'volatile acidity', 'chlorides'], target=['quality'])

Train:	RMSE = 0.877247065526987	R2 = 0.0
Test:	RMSE = 0.8709556563730244	R2 = -0.004223135332819483


(0.877247065526987, 0.0, 0.8709556563730244, -0.004223135332819483)

**Linear Regression Model**

In [7]:
def run_linear_model(train, test, features, target):
    
    # split X and y
    X_train = train[features]
    X_test = test[features]

    y_train = train[target]
    y_test = test[target]
    
    # run model
    lm = LinearRegression().fit(X_train, y_train)    
    
    # RMSE
    train_rmse = mean_squared_error(y_train, lm.predict(X_train), squared=False)
    test_rmse = mean_squared_error(y_test, lm.predict(X_test), squared=False)
    # R2
    train_r2 = r2_score(y_train, lm.predict(X_train))
    test_r2 = r2_score(y_test, lm.predict(X_test))
    
    print(f'Train:\tRMSE = {train_rmse}\tR2 = {train_r2}')
    print(f'Test:\tRMSE = {test_rmse}\tR2 = {test_r2}')
    
    return train_rmse, train_r2, test_rmse, test_r2

In [8]:
run_linear_model(train, validate,
                   features=['alcohol', 'volatile acidity', 'chlorides'], target=['quality'])

Train:	RMSE = 0.7398947098676343	R2 = 0.2886292628795276
Test:	RMSE = 0.7464181576392369	R2 = 0.2624312195350209


(0.7398947098676343,
 0.2886292628795276,
 0.7464181576392369,
 0.2624312195350209)

## Modeling on first group of clusters

Cluster 2 in the first group of clusters yields much higher quality. Others have similar mean qualities. So, we will only run cluster 3 to reduce noise

**Linear Regression Model**

In [9]:
run_linear_model(train, validate,
                 features=['alcohol', 'volatile acidity', 'chlorides','clusters_1_2'],
                 target=['quality'])

Train:	RMSE = 0.7394488750805455	R2 = 0.289486299060413
Test:	RMSE = 0.7465946379814651	R2 = 0.2620824022467747


(0.7394488750805455, 0.289486299060413, 0.7465946379814651, 0.2620824022467747)

## Modeling on second group of clusters

Cluster 2 in the second group of clusters yields much higher quality.

**Linear Regression Model**

In [10]:
run_linear_model(train, validate,
                 features=['alcohol', 'volatile acidity', 'chlorides', 'clusters_2_2'],
                 target=['quality'])

Train:	RMSE = 0.7394488750805455	R2 = 0.289486299060413
Test:	RMSE = 0.7465946379814651	R2 = 0.2620824022467747


(0.7394488750805455, 0.289486299060413, 0.7465946379814651, 0.2620824022467747)

## Modeling on third group of clusters

Cluster 1 in the third group of clusters yields much higher quality.

**Linear Regression Model**

In [11]:
run_linear_model(train, validate,
                 features=['alcohol', 'volatile acidity', 'chlorides', 'clusters_3_1'],
                 target=['quality'])

Train:	RMSE = 0.7391175409246308	R2 = 0.29012289402206115
Test:	RMSE = 0.7461053876933185	R2 = 0.2630492136515725


(0.7391175409246308,
 0.29012289402206115,
 0.7461053876933185,
 0.2630492136515725)