# Model

Plan:

- Develop a model to predict property value
- Use drivers identified in explore to build predictive regression models
- Create and run a baseline model with sklearn's `DummyRegressor` to compare our results to
- Create and run `Linear Regression`, `LassoLars`, and Polynomial regression models
- Use the insights from the highest-performing model (with highest test RMSE) to confirm our initial hypotheses and insights on the features that are the biggest drivers of property value

In [17]:
import pandas as pd
import numpy as np

from sklearn.preprocessing import PolynomialFeatures

from sklearn.dummy import DummyRegressor
from sklearn.linear_model import LinearRegression, LassoLars

from sklearn.metrics import mean_squared_error, r2_score

from sklearn.preprocessing import MinMaxScaler

from wrangle import split_data


## Preprocessing before Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides']`

Scale features:
- MinMax

Before scaling, split data

In [2]:
df = pd.read_csv('wine_data_model.csv') 
df.head()

Unnamed: 0,fixed acidity,volatile acidity,citric acid,residual sugar,chlorides,free sulfur dioxide,total sulfur dioxide,density,pH,sulphates,alcohol,quality,red,clusters_1,clusters_2,clusters_3
0,7.4,0.7,0.0,1.9,0.076,11.0,34.0,0.9978,3.51,0.56,9.4,5,1,0,1,3
1,7.8,0.88,0.0,2.6,0.098,25.0,67.0,0.9968,3.2,0.68,9.8,5,1,0,1,3
2,7.8,0.76,0.04,2.3,0.092,15.0,54.0,0.997,3.26,0.65,9.8,5,1,0,1,3
3,11.2,0.28,0.56,1.9,0.075,17.0,60.0,0.998,3.16,0.58,9.8,6,1,1,3,3
4,7.4,0.66,0.0,1.8,0.075,13.0,40.0,0.9978,3.51,0.56,9.4,5,1,0,1,3


In [3]:
features = ['alcohol', 'volatile acidity', 'chlorides']
target = ['quality']

train, validate, test = split_data(df[features + target + ['red']],
                                   validate_size=.15, test_size=.15, 
                                   stratify_col='red', random_state=123)

# drop color column
train = train.iloc[:,:-1]
validate = validate.iloc[:,:-1]
test = test.iloc[:,:-1]

In [4]:
print(len(train), len(validate), len(test))
train.head()

3724 798 798


Unnamed: 0,alcohol,volatile acidity,chlorides,quality
3048,10.1,0.21,0.051,6
4302,10.4,0.26,0.038,6
1338,10.8,0.3,0.081,6
2634,10.5,0.2,0.053,5
4670,9.7,0.23,0.041,5


In [32]:
# remove target
X_train = train[features]
X_validate = validate[features]
X_test = test[features]

# only add target
y_train = train[target]
y_vaildate = validate[target]
y_test = test[target]

In [None]:
scaler = MinMaxScaler()

In [21]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

## Modeling before Clustering

**Baseline Model**

In [7]:
dummy = DummyRegressor().fit(X_train_scaled, y_train)

In [8]:
train['baseline_pred'] = dummy.predict(X_train_scaled)

In [9]:
train.head()

Unnamed: 0,alcohol,volatile acidity,chlorides,quality,baseline_pred
3048,10.1,0.21,0.051,6,5.796992
4302,10.4,0.26,0.038,6,5.796992
1338,10.8,0.3,0.081,6,5.796992
2634,10.5,0.2,0.053,5,5.796992
4670,9.7,0.23,0.041,5,5.796992


Evaluate

In [10]:
# RMSE
mean_squared_error(train['quality'],
                   train['baseline_pred'],
                   squared=False)

0.8813204007103415

In [11]:
# R2
r2_score(train['quality'],
                   train['baseline_pred'])

0.0

**Linear Regression Model**

In [30]:
def run_lm_model(X, y, features):
    
    # run model
    lm = LinearRegression().fit(X, y)
    
    # RMSE
    rmse = mean_squared_error(y, lm.predict(X), squared=False)
    # R2
    r2 = r2_score(y, lm.predict(X))
    
    
    print(f'RMSE = {rmse}\nR2 = {r2}')
    display(pd.DataFrame(index=features + ['intercept'],
             columns=['coefficients'],
             data=np.append(lm.coef_ * scaler.scale_, lm.intercept_)))
    
    return rmse, r2

In [31]:
X_train

Unnamed: 0,alcohol,volatile acidity,chlorides
3048,10.1,0.21,0.051
4302,10.4,0.26,0.038
1338,10.8,0.30,0.081
2634,10.5,0.20,0.053
4670,9.7,0.23,0.041
...,...,...,...
689,10.8,0.37,0.038
1959,10.0,0.31,0.040
2665,12.8,0.36,0.025
3931,12.8,0.25,0.034


In [29]:
X_train_sc

NameError: name 'X_train_sc' is not defined

In [28]:
run_lm_model(X_train, y_train)

RMSE = 0.7453522878997776
R2 = 0.2847538458381519


Unnamed: 0,coefficients
alcohol,0.058536
volatile acidity,-0.788375
chlorides,0.384663
intercept,2.448307


(0.7453522878997776, 0.2847538458381519)

Our model starts its prediction at 5.26 and:
- adds .33 for every 1 unit of alcohol
- subtracts -1.25 for every .1 units of volatile acidity
- adds .03 for every .1 for every .1 units of chorides

## Preprocessing after Clustering

Features: `['alcohol', 'volatile acidity', 'chlorides', 'clusters_1']`


Encode clusters

In [None]:
df = pd.concat([df, 
                pd.get_dummies(df[['clusters_1','clusters_2','clusters_3']].astype(str))],
                axis=1)
df.head()

Split data

In [56]:
features = ['alcohol', 'volatile acidity', 'chlorides',
            'clusters_1_0', 'clusters_1_1',
            'clusters_1_2', 'clusters_1_3']
target = ['quality']

train, validate, test = split_data(df[features + target + ['red']],
                                   validate_size=.15, test_size=.15, 
                                   stratify_col='red', random_state=123)

# drop color column
train = train.iloc[:,:-1]
validate = validate.iloc[:,:-1]
test = test.iloc[:,:-1]

In [57]:
print(len(train), len(validate), len(test))
train.head()

3724 798 798


Unnamed: 0,alcohol,volatile acidity,chlorides,clusters_1_0,clusters_1_1,clusters_1_2,clusters_1_3,quality
3048,10.1,0.21,0.051,1,0,0,0,6
4302,10.4,0.26,0.038,0,0,1,0,6
1338,10.8,0.3,0.081,0,0,1,0,6
2634,10.5,0.2,0.053,0,0,1,0,5
4670,9.7,0.23,0.041,1,0,0,0,5


In [58]:
# remove target
X_train = train[features]
X_validate = validate[features]
X_test = test[features]

# only add target
y_train = train[target]
y_vaildate = validate[target]
y_test = test[target]

In [59]:
scaler = MinMaxScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_validate_scaled = scaler.transform(X_validate)
X_test_scaled = scaler.transform(X_test)

## Modeling on first group of clusters

**Linear Regression Model**

In [73]:
run_lm_model(X_train, y_train)

RMSE = 0.74404479115878
R2 = 0.28726101401285986


Unnamed: 0,coefficients
alcohol,0.059491
volatile acidity,-0.808499
chlorides,0.126463
clusters_1_0,0.019167
clusters_1_1,0.039125
clusters_1_2,-0.076062
clusters_1_3,0.01777
intercept,2.416009


(0.74404479115878, 0.28726101401285986)