We will use the Grocery Store data again, however this time we will break it into ‘Train’ and ‘Test’ pieces. Test & train is a technique used to validate against overfitting, and to test how your model might perform in the real world. It is also good practice for using a regression to predict point estimates for an entire data set.

**An aside**: when doing AI/ML modelling, you often use three data sets: Train, Test, and Validation. In this case, the model is built on the train data, tested the first time on test data, and then the hyperparameters (hidden layers) are set using the validation data set. 


# Tasks

## Create test and train data sets

In [None]:
#Import Grocery Data

import pandas as pd
import os
from os.path import curdir
path = os.path.join(curdir,'Data',"MMA_860_Grocery_Data.xlsx")
data = pd.read_excel(path,sheet_name=0,header='infer',index_col="Obs")
data.info()

In [None]:
#Set X to be all values except Grocery_Bill and vice-versa for y
X = data.drop(columns=['Grocery_Bill']).values
y = data['Grocery_Bill'].values

Depending on the size of your dataset you can evaluate how large a test set is practical. The larger the dataset, the larger your test data can be. In this case, we will use 30% for testing, and 70% for training. This is probably a good rule of thumb. Create these datasets under the names ‘test’ and ‘train’.

Note: there are three important things to keep in mind:
1. Test and train sets must be mutually exclusive (i.e., no overlapping data)
2. Test and train sets must contain the same pattern of data (i.e., you should same randomly)
3. If you have time series data, you should always test on the most recent data

To sample randomly without replacement, you could use the following code. It will take a 70% train sample and a 30% test sample:

In [None]:
'''
Scikit learn has a built-in function for splitting data into training 
and testing datasets. Here we specify the X array, y array and train_size.
Setting a random_state makes our results reproducible.
'''
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X,y,train_size=0.7,random_state=0)

In [None]:
# Train our model and assess it against training data
from sklearn.linear_model import LinearRegression
reg = LinearRegression().fit(X_train, y_train)

## Predict Values

The ‘predict’ function in Scikit allows you to use the linear regression model to predict values (in this case, the grocery bill). You can choose the dataset on which you would like to predict. The resulting array will contain the predicted values. The code looks like this:

In [None]:
#I have suppressed the output to the first 5 numbers only
#For the whole array, remove the appended '[0:5]'
print(reg.predict(X_test)[0:5])

## Calculate Statistics

For test & train comparisons, you will now have to calculate some of the statistics we use to validate model accuracy: the $R^2$ for test data, RMSE (Root Mean Squared Error), and MAE (Mean Absolute Error). To calculate these we will need to import some additional functions from sklearn.

In [None]:
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_error
from numpy import sqrt

### Train Data

In [None]:
print("R^2:",reg.score(X_train,y_train))
print("Root Mean Squared Error:",sqrt(
    mean_squared_error(y_train,reg.predict(X_train))))
print("Mean Absolute Error:",mean_absolute_error(
    y_train,reg.predict(X_train)))

### Test Data 

In [None]:
print("R^2:",reg.score(X_test,y_test))
print("Root Mean Squared Error:",sqrt(
    mean_squared_error(y_test,reg.predict(X_test))))
print("Mean Absolute Error:",mean_absolute_error(
    y_test,reg.predict(X_test)))

We can compare our $R^2$ values and RMSEs directly. RMSE tends to be a more reliable measure of fit, especially when you would like to penalize large errors. Usually, we expect our model to perform worse on the test set. In this case, model performance is very similar – this is a good thing! 