## Modelling Notebook

This notebook is to be used for testing out the various models that you want to use. No preprocessing will be done in this notebook. Steps:

1. Read in `data/final_data.csv` that you created in the `Data Cleaning.ipynb`
2. Try various models and print appropriate metrics (accuracy/MSE etc)
3. Pick a final model and save it as `models/model.pkl`

## Importing and Loading

In [1]:
import pandas as pd

from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

from sklearn.linear_model import LinearRegression
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor

import pickle

In [2]:
df = pd.read_csv('../data/final_dataset.csv')
df

Unnamed: 0,Weekly Sales,Holiday_Flag,Temperature,Fuel_Price,CPI,Unemployment,Day,Month,Year,Store_1,...,Store_36,Store_37,Store_38,Store_39,Store_40,Store_41,Store_42,Store_43,Store_44,Store_45
0,1643690.90,0.0,0.434149,0.050100,0.840500,0.405118,0.133333,0.090909,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,1641957.44,1.0,0.396967,0.038076,0.841941,0.405118,0.366667,0.090909,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,1611968.17,0.0,0.410861,0.021042,0.842405,0.405118,0.600000,0.090909,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,1409727.59,0.0,0.476419,0.044589,0.842707,0.405118,0.833333,0.090909,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,1554806.68,0.0,0.475147,0.076653,0.843008,0.405118,0.133333,0.181818,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6430,713173.95,0.0,0.654990,0.764028,0.651876,0.460514,0.900000,0.727273,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6431,733455.07,0.0,0.655088,0.758016,0.653427,0.458884,0.133333,0.818182,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6432,734464.36,0.0,0.553131,0.765531,0.654977,0.458884,0.366667,0.818182,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0
6433,718125.53,0.0,0.572701,0.750000,0.655013,0.458884,0.600000,0.818182,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0


## Splitting

### Input and Output

In [3]:
X = df.drop(columns=['Weekly Sales'])
Y = df['Weekly Sales']

### Training and Testing

In [4]:
x_train, x_test, y_train, y_test = train_test_split(X, Y)

In [5]:
def train_eval(model, x_train, x_test, y_train, y_test):
    
    model.fit(x_train,y_train)
    prd = model.predict(x_test)
    print(r2_score(y_test, prd))
    print(mean_absolute_error(y_test, prd))
    print((mean_squared_error(y_test, prd))**(0.5))
    
    return model

## Evaluating

### Linear Regression

In [6]:
lr = train_eval(LinearRegression(), x_train, x_test, y_train, y_test)

0.9258699357549504
93616.15903666873
153715.9267459009


### Support Vector Machine

In [7]:
svm = train_eval(SVR(), x_train, x_test, y_train, y_test)

-0.02136723591663814
466014.7437398306
570574.8707950586


### Decision Tree

In [8]:
dt = train_eval(DecisionTreeRegressor(), x_train, x_test, y_train, y_test)

0.9132660899153436
76029.20351771287
166270.85966805601


## Saving best model

SVM seems to perform the worst because of its high error value and negative R2 score. On the other hand, Linear Regression and Decision Tree create a minor confusion. Even though LR has higher MAE, it has lower RMSE than Decision Tree. This suggests that Decision tree model has a higher variance than Linear Regression. But, since the difference in RMSE is low, we can consider Decision Tree to be a better model for our purpose.

In [9]:
filename = '../models/model.pkl'
pickle.dump(dt, open(filename, 'wb'))

We can now load and see if the model is able to generate predicitions

In [10]:
model = pickle.load(open(filename, 'rb'))
model.predict(x_test)

array([ 314910.37, 2627910.75,  993097.43, ...,  513015.35,  516556.94,
        298947.51])