## Multiple Linear Regression Interaction Terms
First we will add interaction terms to our MLR model.

In [43]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from seaborn import set_style

set_style("whitegrid")

In [2]:
## first load the data
coffee = pd.read_csv('../data/one_hot_coffee.csv')
coffee = coffee.copy()

In [4]:
## next perform the train test split
coffee_train, coffee_test = train_test_split(coffee,
                                            shuffle=True,
                                            random_state=47,
                                            test_size = .2)

In [5]:
## make a baseline
baseline = coffee['rating'].mean()
print(baseline)

90.4599101988454


In [6]:
## import the LinearRegression object
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

In [8]:
roasts = ['Dark', 'Light', 'Medium', 'Medium-Dark', 'Medium-Light', 'Very Dark']

Creating new interation terms.

In [18]:
## Make Espresso interaction terms.
for i in roasts:
    coffee_train.loc[:,'espresso '+ i] = coffee_train['type_espresso'].copy() * coffee_train[i].copy()

In [20]:
## Make Pod interaction terms.
for i in roasts:
    coffee_train.loc[:,'pod '+ i] = coffee_train['type_pod_capsule'].copy() * coffee_train[i].copy()

In [53]:
predictors = ['region_africa_arabia', 'region_caribbean',
       'region_central_america', 'region_hawaii', 'region_asia_pacific',
       'region_south_america', 'type_espresso', 'type_organic',
       'type_fair_trade', 'type_decaffeinated', 'type_pod_capsule',
       'type_blend', 'type_estate', 'Dark', 'Light', 'Medium', 'Medium-Dark',
        'Medium-Light', 'Very Dark','espresso Dark',
       'espresso Light', 'espresso Medium', 'espresso Medium-Dark',
       'espresso Medium-Light', 'espresso Very Dark', 'pod Dark', 'pod Light',
       'pod Medium', 'pod Medium-Dark', 'pod Medium-Light', 'pod Very Dark']

Now we run using the new interaction terms.

In [54]:
reg = LinearRegression(copy_X = True)

reg.fit(coffee_train[predictors], 
        coffee_train['rating'])

LinearRegression()

In [55]:
## look at the coefficients for the fit
reg.coef_

array([ 1.87369947, -1.66313622,  0.46160642,  0.9248165 ,  0.25857735,
        0.09712685,  1.48531515, -0.08362772,  0.12860132,  0.22235085,
       -1.50272796,  0.51062645,  0.57258247, -3.54551651,  3.21655595,
        1.78287458, -1.22666402,  3.1991997 , -3.4264497 ,  2.33437768,
       -1.51671526, -0.00923135,  1.73690226, -1.09926156,  0.03924337,
        2.74098872, -2.34624698, -2.41595327,  0.17911238, -3.17123664,
        3.51060783])

In [56]:
## make the predictions
preds = reg.predict(coffee_train[predictors])
preds_baseline = baseline * np.ones(len(coffee_train))

In [57]:
## check the mean squared error
mse = mean_squared_error(coffee_train['rating'], preds)
mse_baseline = mean_squared_error(coffee_train['rating'], preds_baseline)
print("The mean squared error for multiple linear regression is", mse)
print("The mean squared error for the baseline is", mse_baseline)

The mean squared error for multiple linear regression is 9.57993448241745
The mean squared error for the baseline is 15.565458489578745
