Project 1 : Sale Amount Prediction based on the advertisement cost. 

We first import the necessary modules and read the data from the computer.

In [None]:
import pandas as pd
from sklearn.model_selection import train_test_split

df=pd.read_csv('..\\data\Advertising.csv')

X=df[['TV', 'radio', 'newspaper']].values
y=df['sales'].values
df.info()

The data is then split into the training data set and testing data set. Two-third of the data is used for training and one-third is kept for testing.

In [None]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X,y, 
                                    test_size = 0.33, random_state=1)

The grid search method is used to find the hyper parameter $k$, order of the polynomial regression model. Pipeline function is used to avoid the potential data leakage issue. 

In [None]:
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import Pipeline


steps = [('scaler', StandardScaler()), 
         ('poly', PolynomialFeatures(degree = 2, 
                                     include_bias=False)),
         ('liReg', LinearRegression())]
parameters = {"poly__degree":[2, 3, 4, 5, 7, 9]}
pipeline = Pipeline(steps)


poly_grid = GridSearchCV(pipeline, parameters, 
                         cv=5, 
                         scoring='neg_mean_squared_error',
                         verbose= True) 

poly_grid.fit(X_train, y_train)
print ('best order is :', poly_grid.best_params_)


The best model is used to evaluate the performance on both the training and testing data set.

In [None]:
from sklearn.metrics import mean_absolute_error
from sklearn.metrics import mean_squared_error
from sklearn.metrics import r2_score

# Evaluation on the Tesing data set
ytest_pred = poly_grid.predict(X_test)
mae = mean_absolute_error(y_test, ytest_pred)
mse = mean_squared_error(y_test, ytest_pred, 
                        squared= True)
r2 = r2_score(y_test, ytest_pred)


#Evaluation on the Training data set
ytr_pred = poly_grid.predict(X_train)
maeT = mean_absolute_error(y_train, ytr_pred)
mseT = mean_squared_error(y_train, ytr_pred, 
                        squared= True)
r2T = r2_score(y_train, ytr_pred)

#Keep all results in the tabular form
result = pd.DataFrame({'mae': [maeT, mae], 
                        'mse': [mseT, mse], 
                        'r2': [r2T, r2]})
result.index = ['Training', 'Testing']
result