## Baseline Model
As a simple model for this predictive model, we chose multiple linear regressions. This is a conceptually basic mondel and the interpretation is relatively simple compared to other predictive models. In addition, we can apply the model to other target variables of interest as the research question develops.'

In this multiple regression model, we have to preprocess the data. For continuous variables this means centering and scaling the data by the mean and standard deviation respectively, but for categorical data it includes creating dummy variables.

Then we get rid of variables involving arbitrary dates, such as the timestamp. This will be difficult to handle in a regression context. 


In [1]:
#import relevant packages
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import re
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split
%matplotlib inline
from sklearn.linear_model import LinearRegression
from sklearn.metrics import explained_variance_score

In [2]:
#import data from wrangling stage
df=pd.read_csv('data_step4.csv')

In [3]:
df.head()
df.drop(columns='Timestamp', inplace=True)


In [4]:
df.head()

Unnamed: 0.1,Unnamed: 0,Keyword,Position,Previous position,Search Volume,Keyword Difficulty,CPC,URL,Traffic,Competition,Number of Results
0,0,dunkin donuts,1,1,2740000,88.99,2.24,https://www.dunkindonuts.com/,2192000,0.09,35000000
1,1,dunkin donuts near me,1,1,823000,83.58,2.32,https://www.dunkindonuts.com/en/locations,658400,0.01,85
2,2,dunkin donuts menu,1,1,550000,86.33,1.78,https://www.dunkindonuts.com/en/menu,440000,0.02,30200000
3,3,donuts,1,1,823000,81.85,1.54,https://www.dunkindonuts.com/,386810,0.04,321000000
4,4,dd,1,1,301000,84.47,2.53,https://www.dunkindonuts.com/,240800,0.02,871000000


In [5]:
df.dtypes

Unnamed: 0              int64
Keyword                object
Position                int64
Previous position       int64
Search Volume           int64
Keyword Difficulty    float64
CPC                   float64
URL                    object
Traffic                 int64
Competition           float64
Number of Results       int64
dtype: object

Next, the keyword column will be separated because it is the label column. This will be dropped from the preprocessed data as it will not be entered into the regression model.

In [6]:
df1=df.drop(columns='Keyword')
keyword=df['Keyword']
df1.head()
df1.drop(columns='Unnamed: 0')
df1.columns

Index(['Unnamed: 0', 'Position', 'Previous position', 'Search Volume',
       'Keyword Difficulty', 'CPC', 'URL', 'Traffic', 'Competition',
       'Number of Results'],
      dtype='object')

In [7]:
continuous_var=df1.drop(columns='URL')

continuous_var.drop(columns='Unnamed: 0', inplace=True)
continuous_var.head()


Unnamed: 0,Position,Previous position,Search Volume,Keyword Difficulty,CPC,Traffic,Competition,Number of Results
0,1,1,2740000,88.99,2.24,2192000,0.09,35000000
1,1,1,823000,83.58,2.32,658400,0.01,85
2,1,1,550000,86.33,1.78,440000,0.02,30200000
3,1,1,823000,81.85,1.54,386810,0.04,321000000
4,1,1,301000,84.47,2.53,240800,0.02,871000000


In [8]:
cat_var=['URL']
continuous_var.head()

Unnamed: 0,Position,Previous position,Search Volume,Keyword Difficulty,CPC,Traffic,Competition,Number of Results
0,1,1,2740000,88.99,2.24,2192000,0.09,35000000
1,1,1,823000,83.58,2.32,658400,0.01,85
2,1,1,550000,86.33,1.78,440000,0.02,30200000
3,1,1,823000,81.85,1.54,386810,0.04,321000000
4,1,1,301000,84.47,2.53,240800,0.02,871000000


## Preprocessing

The continuous variables will be scaled and centered, since many predictive models perform better when all variables have a similar average and variance. In addition, we ensure there are no missing rows or columns, which we already established. Next, the independent variables are placed in an array (x) and target variable (position) is placed in a second array (y).

In [9]:
scaler = StandardScaler()
df_scaled = pd.DataFrame(scaler.fit_transform(continuous_var))
df_scaled.columns
df_scaled.head()





Unnamed: 0,0,1,2,3,4,5,6,7
0,-1.542663,-1.49188,50.418352,1.148294,0.43629,91.828251,-0.177323,-0.203623
1,-1.542663,-1.49188,15.113543,0.380464,0.456939,27.566427,-0.473826,-0.228116
2,-1.542663,-1.49188,10.085784,0.770766,0.317556,18.414899,-0.436763,-0.206982
3,-1.542663,-1.49188,15.113543,0.134928,0.255607,16.1861,-0.362637,-0.003482
4,-1.542663,-1.49188,5.500026,0.50678,0.511144,10.067902,-0.436763,0.381404


In [10]:
df_scaled.columns=[ 'Position', 'Previous position', 'Search Volume',
       'Keyword Difficulty', 'CPC', 'Traffic', 'Competition',
       'Number of Results']

In [11]:
df_scaled['URL'] = df1[cat_var]

In [12]:
df_scaled.head()
df_new = pd.get_dummies(df_scaled,dummy_na=True)
X=df_new.drop(columns='Position')
y=df_new['Position']

## Training and Testing the Model
Next, we split the data into training and testing sets The training set will be used to cross_validate the hyperparameters for the model and the testing set will be used to validate the model.

First, I will make a model with default parameters.


In [13]:
#Split into test/train
X_train, X_test, y_train, y_test= train_test_split(X, y, test_size=0.33, random_state=42)


In [14]:
# with sklearn
regr = LinearRegression()
regr.fit(X_train, y_train)


LinearRegression()

In [15]:
y_pred=regr.predict(X_test)

In [16]:
explained_variance_score(y_test, y_pred)

-6.675009881291173e+21

It looks like the URL column is skewing the model. Now lets try to drop the URL column and run the model again.


In [17]:
df_scaled.head()

Unnamed: 0,Position,Previous position,Search Volume,Keyword Difficulty,CPC,Traffic,Competition,Number of Results,URL
0,-1.542663,-1.49188,50.418352,1.148294,0.43629,91.828251,-0.177323,-0.203623,https://www.dunkindonuts.com/
1,-1.542663,-1.49188,15.113543,0.380464,0.456939,27.566427,-0.473826,-0.228116,https://www.dunkindonuts.com/en/locations
2,-1.542663,-1.49188,10.085784,0.770766,0.317556,18.414899,-0.436763,-0.206982,https://www.dunkindonuts.com/en/menu
3,-1.542663,-1.49188,15.113543,0.134928,0.255607,16.1861,-0.362637,-0.003482,https://www.dunkindonuts.com/
4,-1.542663,-1.49188,5.500026,0.50678,0.511144,10.067902,-0.436763,0.381404,https://www.dunkindonuts.com/


In [18]:
df_scaled.drop(columns='URL', inplace=True)

In [19]:
X2=df_scaled.drop(columns='Position')
y2=df_scaled['Position']

In [20]:
#Split into test/train
X_train2, X_test2, y_train2, y_test2= train_test_split(X2, y2, test_size=0.2, random_state=42)

In [21]:
# with sklearn
regr2 = LinearRegression()
regr2.fit(X_train2, y_train2)

print('Intercept: \n', regr2.intercept_)
print('Coefficients: \n', regr2.coef_)

Intercept: 
 -0.000978581584977256
Coefficients: 
 [ 0.96177721  0.01754652 -0.01799402 -0.00141528 -0.01720571  0.00117578
  0.01394046]


In [22]:
y_pred2=regr2.predict(X_test2)

In [23]:
explained_variance_score(y_test2, y_pred2)

0.9101729847536906