# Interaction Effect in Multiple Regression

Previously, we have described how to build a multiple linear regression model for predicting a continuous outcome variable (y) based on multiple predictor variables (x).

For example, to predict sales, based on advertising budgets spent on youtube and facebook, the model equation is sales = b0 + b1*youtube + b2*facebook, where, b0 is the intercept; b1 and b2 are the regression coefficients associated respectively with the predictor variables youtube and facebook.

The above equation, also known as additive model, investigates only the main effects of predictors. It assumes that the relationship between a given predictor variable and the outcome is independent of the other predictor variables (James et al. 2014,P. Bruce and Bruce (2017)).

Considering our example, the additive model assumes that, the effect on sales of youtube advertising is independent of the effect of facebook advertising.

This assumption might not be true. For example, spending money on facebook advertising may increase the effectiveness of youtube advertising on sales. In marketing, this is known as a synergy effect, and in statistics it is referred to as an interaction effect (James et al. 2014).

Equation
The multiple linear regression equation, with interaction effects between two predictors (x1 and x2), can be written as follow:

y = b0 + b1*x1 + b2*x2 + b3*(x1*x2)

Considering our example, it becomes:

sales = b0 + b1*youtube + b2*facebook + b3*(youtube*facebook)

This can be also written as:

sales = b0 + (b1 + b3*facebook)*youtube + b2*facebook

or as:

sales = b0 + b1*youtube + (b2 +b3*youtube)*facebook

In [2]:
import pandas as pd
import statsmodels.api as sm
from sklearn.preprocessing import PolynomialFeatures#generating interaction terms
from statsmodels.regression import linear_model

data= pd.read_csv("data/marketing.csv" , index_col='Unnamed: 0')
X = data.drop('sales', axis=1)
y = data['sales']
model = linear_model.OLS(y, X).fit()
print(model.summary())

                                 OLS Regression Results                                
Dep. Variable:                  sales   R-squared (uncentered):                   0.982
Model:                            OLS   Adj. R-squared (uncentered):              0.982
Method:                 Least Squares   F-statistic:                              3566.
Date:                Sun, 14 Jun 2020   Prob (F-statistic):                   2.43e-171
Time:                        23:13:28   Log-Likelihood:                         -460.01
No. Observations:                 200   AIC:                                      926.0
Df Residuals:                     197   BIC:                                      935.9
Df Model:                           3                                                  
Covariance Type:            nonrobust                                                  
                 coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------

In [3]:
from sklearn.preprocessing import PolynomialFeatures#generating interaction terms
#generating interaction terms
x_interaction = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(X)
#creating a new dataframe with the interaction terms included
interaction_df = pd.DataFrame(x_interaction, columns = ['youtube', 'facebook', 'newspaper' , 
                                                        'youtube:facebook', 'youtube:newspaper',
                                                    'facebook:newspaper' ] )
interaction_df

Unnamed: 0,youtube,facebook,newspaper,youtube:facebook,youtube:newspaper,facebook:newspaper
0,276.12,45.36,83.04,12524.8032,22929.0048,3766.6944
1,53.40,47.16,54.12,2518.3440,2890.0080,2552.2992
2,20.64,55.08,83.16,1136.8512,1716.4224,4580.4528
3,181.80,49.56,70.20,9010.0080,12762.3600,3479.1120
4,216.96,12.96,70.08,2811.8016,15204.5568,908.2368
...,...,...,...,...,...,...
195,45.84,4.44,16.56,203.5296,759.1104,73.5264
196,113.04,5.88,9.72,664.6752,1098.7488,57.1536
197,212.40,11.16,7.68,2370.3840,1631.2320,85.7088
198,340.32,50.40,79.44,17152.1280,27035.0208,4003.7760


In [4]:
interaction_model = linear_model.OLS(y.values.reshape(-1,1), interaction_df).fit()
print(interaction_model.summary())

                                 OLS Regression Results                                
Dep. Variable:                      y   R-squared (uncentered):                   0.988
Model:                            OLS   Adj. R-squared (uncentered):              0.988
Method:                 Least Squares   F-statistic:                              2675.
Date:                Sun, 14 Jun 2020   Prob (F-statistic):                   1.41e-183
Time:                        23:13:35   Log-Likelihood:                         -418.51
No. Observations:                 200   AIC:                                      849.0
Df Residuals:                     194   BIC:                                      868.8
Df Model:                           6                                                  
Covariance Type:            nonrobust                                                  
                         coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------

In [5]:
interaction_model.pvalues[interaction_model.pvalues < 0.05]

youtube               7.792133e-53
facebook              5.942712e-26
newspaper             2.996430e-20
youtube:facebook      5.913430e-13
youtube:newspaper     5.610839e-08
facebook:newspaper    2.861813e-13
dtype: float64

In [6]:
print('Parameters: ', interaction_model.params)
print('R2: ', interaction_model.rsquared)
print('Pvalues: ', interaction_model.pvalues)

Parameters:  youtube               0.043296
facebook              0.181139
newspaper             0.144349
youtube:facebook      0.000609
youtube:newspaper    -0.000279
facebook:newspaper   -0.002285
dtype: float64
R2:  0.9880589866646199
Pvalues:  youtube               7.792133e-53
facebook              5.942712e-26
newspaper             2.996430e-20
youtube:facebook      5.913430e-13
youtube:newspaper     5.610839e-08
facebook:newspaper    2.861813e-13
dtype: float64


# Using Linear regression Technique

In [8]:
import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as seabornInstance 
from sklearn.model_selection import train_test_split 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
%matplotlib inline
dataset = pd.read_csv("data/marketing.csv" , index_col='Unnamed: 0')
print(dataset.describe())

dataset.isnull().any()
dataset = dataset.fillna(method='ffill')


# split data
X = dataset[['youtube' , 'facebook' , 'newspaper']].values
y = dataset[['sales']].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
regressor = LinearRegression().fit(X_train, y_train) #training the algorithm

#To retrieve the intercept:
print("To retrieve the intercept: " , regressor.intercept_)
#For retrieving the slope:
print("For retrieving the slope: " , regressor.coef_)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
print(df.head(10))

          youtube    facebook   newspaper       sales
count  200.000000  200.000000  200.000000  200.000000
mean   176.451000   27.916800   36.664800   16.827000
std    103.025084   17.816171   26.134345    6.260948
min      0.840000    0.000000    0.360000    1.920000
25%     89.250000   11.970000   15.300000   12.450000
50%    179.700000   27.480000   30.900000   15.480000
75%    262.590000   43.830000   54.120000   20.880000
max    355.680000   59.520000  136.800000   32.400000
To retrieve the intercept:  [3.59387164]
For retrieving the slope:  [[ 0.04458402  0.19649703 -0.00278146]]
   Actual  Predicted
0   13.56  12.068875
1   10.08   8.942737
2   10.44   8.423649
3   30.48  28.896357
4   14.04  14.421435
5   10.44   7.845526
6    8.64  15.339443
7   15.84  18.131695
8   11.04  12.923688
9   19.92  19.612295


In [9]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("R Score" , regressor.score(X_train, y_train)) 
print("An error rate: " ,np.sqrt(metrics.mean_squared_error(y_test, y_pred)) / np.mean(y_test))

Mean Absolute Error: 1.6341376202508333
Mean Squared Error: 6.339050339687548
Root Mean Squared Error: 2.5177470761948166
R Score 0.9067114990146382
An error rate:  0.15843855491755185


In [10]:
from sklearn.preprocessing import PolynomialFeatures#generating interaction terms
#generating interaction terms
x_interaction = PolynomialFeatures(2, interaction_only=True, include_bias=False).fit_transform(X)
#creating a new dataframe with the interaction terms included
interaction_df = pd.DataFrame(x_interaction, columns = ['youtube', 'facebook', 'newspaper' , 
                                                        'youtube:facebook', 'youtube:newspaper',
                                                    'facebook:newspaper' ] )


X_train, X_test, y_train, y_test = train_test_split(interaction_df, y, test_size=0.2, random_state=0)
regressor = LinearRegression().fit(X_train, y_train) #training the algorithm

#To retrieve the intercept:
print("To retrieve the intercept: " , regressor.intercept_)
#For retrieving the slope:
print("For retrieving the slope: " , regressor.coef_)
y_pred = regressor.predict(X_test)
df = pd.DataFrame({'Actual': y_test.flatten(), 'Predicted': y_pred.flatten()})
print(df.head(10))

To retrieve the intercept:  [7.63297321]
For retrieving the slope:  [[ 2.02406259e-02  3.62723148e-02  1.75222906e-02  9.11329922e-04
  -5.41593417e-05 -1.73882564e-04]]
   Actual  Predicted
0   13.56  12.259797
1   10.08  10.375163
2   10.44  10.199836
3   30.48  31.969616
4   14.04  14.052761
5   10.44  10.118421
6    8.64  11.138852
7   15.84  16.086702
8   11.04  11.519283
9   19.92  19.485511


In [11]:
print('Mean Absolute Error:', metrics.mean_absolute_error(y_test, y_pred))  
print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))  
print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))
print("R Score" , regressor.score(X_train, y_train)) 
print("An error rate: " ,np.sqrt(metrics.mean_squared_error(y_test, y_pred)) / np.mean(y_test))

Mean Absolute Error: 0.9026018467499058
Mean Squared Error: 2.319623301504984
Root Mean Squared Error: 1.5230309588137019
R Score 0.9738701026425148
An error rate:  0.09584236101023862
