## Multiple Linear Regression
First we will use multiple linear regression to predict the rating based on the features such as region and roast.

In [4]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split

In [5]:
## first load the data
coffee = pd.read_csv('../data/one_hot_coffee.csv')
coffee = coffee.copy()

In [6]:
coffee.columns
coffee.describe()

Unnamed: 0,rating,region_africa_arabia,region_caribbean,region_central_america,region_hawaii,region_asia_pacific,region_south_america,type_espresso,type_organic,type_fair_trade,type_decaffeinated,type_pod_capsule,type_blend,type_estate,Dark,Light,Medium,Medium-Dark,Medium-Light,Very Dark
count,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0,4677.0
mean,90.45991,0.231986,0.005559,0.158435,0.018602,0.07291,0.082318,0.144323,0.089801,0.055591,0.009622,0.033782,0.085739,0.126149,0.054522,0.090229,0.29164,0.169767,0.356425,0.037417
std,3.898294,0.422145,0.07436,0.365187,0.135128,0.260016,0.274878,0.351455,0.285927,0.229155,0.097627,0.180688,0.280008,0.332053,0.227069,0.28654,0.454566,0.375468,0.478994,0.189802
min,60.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,89.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
50%,91.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
75%,93.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,0.0
max,97.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0


In [7]:
## next perform the train test split
coffee_train, coffee_test = train_test_split(coffee,
                                            shuffle=True,
                                            random_state=47,
                                            test_size = .2)

In [8]:
## make a baseline, which we will simply take to be the mean of the ratings
baseline = coffee['rating'].mean()
print(baseline)

90.4599101988454


In [9]:
## import the LinearRegression object, mse, and KFold
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import KFold

In [10]:
predictors = ['region_africa_arabia', 'region_caribbean',
       'region_central_america', 'region_hawaii', 'region_asia_pacific',
       'region_south_america', 'type_espresso', 'type_organic',
       'type_fair_trade', 'type_decaffeinated', 'type_pod_capsule',
       'type_blend', 'type_estate', 'Dark', 'Light', 'Medium', 'Medium-Dark', 'Medium-Light', 'Very Dark']

In [16]:
## perform cross validation for the linear regression model

splits = 5
kfold = KFold(n_splits=splits, shuffle=True, random_state=413)

reg = LinearRegression(copy_X = True)
mse = np.empty(splits)

i = 0

for train_index, test_index in kfold.split(coffee_train):
    coffee_train_train = coffee_train.iloc[train_index]
    coffee_holdout = coffee_train.iloc[test_index]

    reg.fit(coffee_train_train[predictors], 
            coffee_train_train['rating'])
    
    preds = reg.predict(coffee_train_train[predictors])
    
    mse[i] = mean_squared_error(coffee_train_train['rating'], preds)
    
    i = i+1


In [22]:
## look at the coefficients for the fit
reg.coef_

array([ 1.92517011e+00, -1.75777302e+00,  4.65018204e-01,  8.25901196e-01,
        3.09702150e-01,  1.84270168e-01,  1.66535174e+00, -2.60271673e-01,
        2.71352561e-01,  1.84258510e-01, -1.72178971e+00,  3.96620015e-01,
        6.27192633e-01, -8.82008795e+13, -8.82008795e+13, -8.82008795e+13,
       -8.82008795e+13, -8.82008795e+13, -8.82008795e+13])

In [19]:
## make the predictions for baseline
preds_baseline = baseline * np.ones(len(coffee_train))

In [25]:
## check the average mean squared error and compare to baseline
mse_baseline = mean_squared_error(coffee_train['rating'], preds_baseline)
print("The average cross validation mean squared error for multiple linear regression is", np.mean(mse))
print("The mean squared error for the baseline is", mse_baseline)

The average cross validation mean squared error for multiple linear regression is 9.924129624392668
The mean squared error for the baseline is 15.565458489578745


## Summary
We have made a baseline and basic model. Now we want to try lasso and ridge regression to determine which features are important and help us decide which interaction terms to include.