# Measuring the performance of machine learning models

In this notebook we are going to see how to evaluate machine learning models generally.

We first load some, by now, standard modules:

In [1]:
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns

import statsmodels.api as sm
from sklearn import linear_model
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error, root_mean_squared_error

## Train-test split

We will first look at splitting data on test and train. For this, we first need to import a scikit-learn function:

In [2]:
from sklearn.model_selection import train_test_split

For this example we will be using the Ames housing data again, which we load in:

In [25]:
ames = pd.read_csv("../Notebooks and data-7/AmesHousing.csv")

Before we split into train and test sets, we first have to split into X and y data:

In [5]:
X_ames = ames[["Lot Area", "Overall Cond", "Year Built", "Gr Liv Area", "TotRms AbvGrd"]]
X_ames = X_ames.join(pd.get_dummies(ames["Bldg Type"], drop_first=True, dtype = "int"))
X_ames.head()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,2fmCon,Duplex,Twnhs,TwnhsE
0,31770,5,1960,1656,7,0,0,0,0
1,11622,6,1961,896,5,0,0,0,0
2,14267,6,1958,1329,6,0,0,0,0
3,11160,5,1968,2110,8,0,0,0,0
4,13830,5,1997,1629,6,0,0,0,0


In [6]:
y = ames["SalePrice"]
y.head()

0    215000
1    105000
2    172000
3    244000
4    189900
Name: SalePrice, dtype: int64

We can now split the data with a call to the function `train_test_split` with the following input and output:

Input:
* X - our entire X dataset
* y - our entire y dataset
* test_size: The percentage of data put in the test dataset - how much we allocate for test depends, by 30% is a fairly common choice if there is enough data
* random_state: An integer used for the random seed generation, such that we can replicate the split, if we want to

Output:
* X_train: Features of the training dataset
* X_test: Features of the test dataset
* y_train: Response variable of the training dataset
* y_test: Response variable (**groundtruth**) of the test dataset

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X_ames, y, test_size=0.3, random_state=123)

Note how this function returns 4 dataframes. Note also how the indexes for the X_train and y_train matches (this is of course essential for supervised learning!) and the same goes for the X_test and y_test.

In [8]:
X_train.head()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,2fmCon,Duplex,Twnhs,TwnhsE
2278,43500,5,1953,2034,9,0,0,0,0
1379,7162,7,1966,904,6,0,0,0,0
2182,3675,5,2005,1072,5,0,0,0,1
1436,8998,5,2000,1652,6,0,0,0,0
1599,1477,9,1970,1092,6,0,0,1,0


In [9]:
y_train.head()

2278    130000
1379    109900
2182    140000
1436    207500
1599     98000
Name: SalePrice, dtype: int64

In [10]:
X_test.head()

Unnamed: 0,Lot Area,Overall Cond,Year Built,Gr Liv Area,TotRms AbvGrd,2fmCon,Duplex,Twnhs,TwnhsE
655,10410,7,1916,1656,8,0,0,0,0
645,10434,5,1955,1005,5,0,0,0,0
80,9672,5,1984,1097,6,0,0,0,0
1927,12155,3,1970,1657,7,0,0,0,0
2030,7758,4,1931,1818,7,0,0,0,0


In [11]:
y_test.head()

655     135000
645     115000
80      152000
1927    163500
2030    169500
Name: SalePrice, dtype: int64

Now let us fit a multiple linear regression model on the training set only:

In [12]:
mlr_model = linear_model.LinearRegression()

In [13]:
mlr_model.fit(X_train, y_train)

We can now calculate our evaluation metrics on the training set:

In [14]:
y_pred_train = mlr_model.predict(X_train)

In [15]:
r2_score(y_train, y_pred_train)

0.6892984466795837

In [16]:
mean_absolute_error(y_train, y_pred_train)

29199.467780611936

In [17]:
root_mean_squared_error(y_train, y_pred_train)

45075.67260893577

We can now also calculate our evaluation metrics on the testset to get a better estimate of how well our model generalizes, that is, how well it predict on new unseen cases:

In [18]:
y_pred_test = mlr_model.predict(X_test)

In [19]:
r2_score(y_test, y_pred_test)

0.7374809409227066

In [20]:
mean_absolute_error(y_test, y_pred_test)

27209.62611307495

In [21]:
root_mean_squared_error(y_test, y_pred_test)

39659.45423005805

For some reason, our model preform better on the test set than on our training set, which is really rare. It is usually the other way around. But clearly our errors on the test set are not much bigger than our errors on the training set(-they are actually smaller!), thus there is no overfitting for sure. However, it is not that surprising as Linear regression models almost never overfit.