## Multiple Linear Regression Interaction Terms
First we will add interaction terms to our MLR model.

## Uploading the data and creating a train test split

In [15]:
#importing
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

In [16]:
## first load the data
coffee = pd.read_csv('../data/one_hot_coffee.csv')
coffee = coffee.copy()

In [17]:
## next perform the train test split
coffee_train, coffee_test = train_test_split(coffee,
                                            shuffle=True,
                                            random_state=47,
                                            test_size = .2)

## Creating a baseline

In [18]:
## make a baseline
baseline = coffee['rating'].mean()
print(baseline)

90.4599101988454


## Creating new interation terms.

In [19]:
roasts = ['Light','Medium-Light',  'Medium', 'Medium-Dark', 'Dark', 'Very Dark']

In [22]:
## Make Espresso interaction terms.
for i in roasts:
    coffee_train.loc[:,'espresso '+ i] = coffee_train['type_espresso'].copy() * coffee_train[i].copy()

In [23]:
## Make Pod interaction terms.
for i in roasts:
    coffee_train.loc[:,'pod '+ i] = coffee_train['type_pod_capsule'].copy() * coffee_train[i].copy()

In [24]:
# Create new predictor list.
predictors = ['region_africa_arabia', 'region_caribbean',
       'region_central_america', 'region_hawaii', 'region_asia_pacific',
       'region_south_america', 'type_espresso', 'type_organic',
       'type_fair_trade', 'type_decaffeinated', 'type_pod_capsule',
       'type_blend', 'type_estate', 'Dark', 'Light', 'Medium', 'Medium-Dark',
        'Medium-Light', 'Very Dark','espresso Dark',
       'espresso Light', 'espresso Medium', 'espresso Medium-Dark',
       'espresso Medium-Light', 'espresso Very Dark', 'pod Dark', 'pod Light',
       'pod Medium', 'pod Medium-Dark', 'pod Medium-Light', 'pod Very Dark']

## Creating MLR Model with interaction terms and running cross validation

In [25]:
## importing
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, mean_absolute_error
from sklearn.model_selection import KFold

In [26]:
## kfold splits
splits = 5
kfold = KFold(n_splits=splits, shuffle=True, random_state=413)

# creating model object and tables to hold results
reg = LinearRegression(copy_X = True)
mse = np.empty(splits)
mae = np.empty(splits)

i = 0

for train_index, test_index in kfold.split(coffee_train):
    #train and holdouts
    coffee_train_train = coffee_train.iloc[train_index]
    coffee_holdout = coffee_train.iloc[test_index]

    #fitting the model
    reg.fit(coffee_train_train[predictors], 
            coffee_train_train['rating'])
    
    #running predictions
    preds = reg.predict(coffee_train_train[predictors])
    
    #recording the errors
    mse[i] = mean_squared_error(coffee_train_train['rating'], preds)
    mae[i] = mean_absolute_error(coffee_train_train['rating'], preds)
    
    i = i+1


In [27]:
## make the predictions
preds = reg.predict(coffee_train[predictors])
preds_baseline = baseline * np.ones(len(coffee_train))

In [28]:
## check the mean squared error
mse = mean_squared_error(coffee_train['rating'], preds)
mse_baseline = mean_squared_error(coffee_train['rating'], preds_baseline)
print("The mean squared error for multiple linear regression with interaction terms is", mse)
print("The mean squared error for the baseline is", mse_baseline)

The mean squared error for multiple linear regression with interaction terms is 9.599984790313371
The mean squared error for the baseline is 15.565458489578745


In [29]:
mae = mean_absolute_error(coffee_train['rating'], preds)
mae_baseline = mean_absolute_error(coffee_train['rating'], preds_baseline)
print("The mean absolute error for multiple linear regression with interaction terms is", mae)
print("The mean absolute error for the baseline is", mae_baseline)

The mean absolute error for multiple linear regression with interaction terms is 2.0565192092680347
The mean absolute error for the baseline is 2.8500208925625063


Our MLR with interaction terms model appears to be performing well on the training set compared to the baseline model and slightly better than the original MLR model. 

## Running the model on the test split

In [41]:
## Make Espresso interaction terms.
for i in roasts:
    coffee_test.loc[:,'espresso '+ i] = coffee_test['type_espresso'].copy() * coffee_test[i].copy()

## Make Pod interaction terms.
for i in roasts:
    coffee_test.loc[:,'pod '+ i] = coffee_test['type_pod_capsule'].copy() * coffee_test[i].copy()

In [42]:
reg.fit(coffee_train[predictors], coffee_train.rating)

LinearRegression()

In [43]:
pred = reg.predict(coffee_test[predictors])
test_mse = mean_squared_error(coffee_test.rating,pred)
test_mae = mean_absolute_error(coffee_test.rating,pred)

print("The average cross validation mean squared error for multiple linear regression is", test_mse)
print("The average cross validation mean absolute error for multiple linear regression is", test_mae)

The average cross validation mean squared error for multiple linear regression is 8.815056009924586
The average cross validation mean absolute error for multiple linear regression is 2.0397200122930292


## Saving test results

Now we save our test results. The code below was run to add to the csv file that contains the MSE and MAE for previous models. But it is commented out now as to not add it to the file again.

In [44]:
#import csv

In [45]:
#with open('testing_results.csv', mode='a') as coffee_file:
    #results_writer = csv.writer(coffee_file, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)

    #results_writer.writerow(['MLR_Interaction', test_mse, test_mae])

In [10]:
# import the complete testing results csv and save it as data frame, then save that data frame as an image for README
results = pd.read_csv('testing_results.csv')
final_results = pd.DataFrame(results)
# add a column for root mean square error to allow easier comparison between RMSE and MAE
final_results['RMSE'] = np.sqrt(final_results['MSE'])

import dataframe_image as dfi

dfi.export(final_results, 'testing_results.png')