Training And Test Sets

We have seen previously how to fit a model to a dataset. In this exercise, we'll be looking at how to check and confirm the validity and performance of our models by using training and testing sets. As usual, we begin by loading in and having a look at our data:

In [2]:
import pandas

data = pandas.read_csv("dog-training.csv", delimiter="\t")

print(data.shape)
data.head()

(50, 5)


Unnamed: 0,month_old_when_trained,mean_rescues_per_year,age_last_year,weight_last_year,rescues_last_year
0,68,21.1,9,14.5,35
1,53,14.9,5,14.0,30
2,41,20.5,6,17.7,34
3,3,19.4,1,13.7,29
4,4,24.9,4,18.4,30


We are interested in the relationship between a dog's weight and the amount of rescues it performed in the previous year. Let's begin by plotting rescues_last_year as a function of weight_last_year:

In [3]:
import graphing
import statsmodels.formula.api as smf

# First, we define our formula using a special syntax
# This says that rescues_last_year is explained by weight_last_year
formula = "rescues_last_year ~ weight_last_year"

model = smf.ols(formula = formula, data = data).fit()

graphing.scatter_2D(data, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])


Train/test split

This time, instead of fitting a model to the entirety of our dataset, we're going to separate our dataset into two smaller partitions: a training set and a test set.
The training set is the largest of the two, usually made up of between 70-80% of the overall dataset, with the rest of the dataset making up the test set. By splitting our data, we're able to gauge the performance of our model when confronted with previously unseen data.

Notice that data on the test set is never used in training. For that reason it's commonly referred to as unseen data or data that is unknown by the model.

In [6]:
from sklearn.model_selection import train_test_split


# Obtain the label and feature from the original data
dataset = data[['rescues_last_year','weight_last_year']]

# Split the dataset in an 70/30 train/test ratio. We also obtain the respective corresponding indices from the original dataset.
train, test = train_test_split(dataset, train_size=0.7, random_state=21)

print("Train")
print(train.head())
print(train.shape)

print("Test")
print(test.head())
print(test.shape)

Train
    rescues_last_year  weight_last_year
33                 30              19.4
0                  35              14.5
13                 36              19.5
28                 31              16.1
49                 37              23.0
(35, 2)
Test
    rescues_last_year  weight_last_year
7                  37              17.1
44                 25              15.4
43                 26              20.0
25                 32              22.2
14                 32              18.3
(15, 2)


We can see that these sets are different, and that the training set and test set contain 70% and 30% of the overall data respectively.

Let's have a look at how the training set and test set are separated out:

In [8]:
# You don't need to understand this code well
# It's just used to create a scatter plot

# concatenate training and test so they can be graphed
plot_set = pandas.concat([train,test])
plot_set["Dataset"] = ["train"] * len(train) + ["test"] * len(test)

print(plot_set)

# Create graph
graphing.scatter_2D(plot_set, "weight_last_year", "rescues_last_year", "Dataset", trendline = lambda x: model.params[1] * x + model.params[0])

    rescues_last_year  weight_last_year Dataset
33                 30              19.4   train
0                  35              14.5   train
13                 36              19.5   train
28                 31              16.1   train
49                 37              23.0   train
40                 27               7.2   train
45                 29              16.9   train
48                 29              23.5   train
46                 29              21.0   train
29                 36              17.2   train
37                 30              12.6   train
1                  30              14.0   train
27                 38              16.5   train
24                 20              11.4   train
32                 39              21.2   train
3                  29              13.7   train
26                 31              14.6   train
16                 38              22.7   train
42                 30              19.5   train
6                  26              13.1 

Training Set

We begin by training our model using the training set, and testing its performance with the same training set:


In [9]:
import statsmodels.formula.api as smf
from sklearn.metrics import mean_squared_error as mse

# First, we define our formula using a special syntax
# This says that rescues_last_year is explained by weight_last_year
formula = "rescues_last_year ~ weight_last_year"

# Create and train the model
model = smf.ols(formula = formula, data = train).fit()

# Graph the result against the data
graphing.scatter_2D(train, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

We can gauge our model's performance by calculating the mean squared error (MSE).

In [10]:
correct_labels = train['rescues_last_year']
predicted = model.predict(train['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)

MSE = 18.674546 


Test Set

Next, we test the same model's performance using the test set:

In [14]:
graphing.scatter_2D(test, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])



In [13]:
#Let's have a look at the MSE again.

correct_labels = test['rescues_last_year']
predicted = model.predict(test['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)

MSE = 24.352949 


We can see that the model performs much better on the known training data than on the unseen test data (remember that higher MSE values are worse).

This can be down to a number of factors but first and foremost is overfitting, which is when a model matches the data in the training set too closely. This means that it will perform very well on the training set, but will not generalize well. (i.e., work well with other datasets).
New Dataset

To illustrate our point further, let's have a look at how our model performs when confronted with a completely new, unseen, and larger dataset. For our scenario, we'll use data provided by the avalanche rescue charity's European branch.

In [17]:
!wget https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training-switzerland.csv

# Load an alternative dataset from the charity's European branch
new_data = pandas.read_csv("dog-training-switzerland.csv", delimiter="\t")

print(new_data.shape)
new_data.head()


--2023-03-13 16:04:28--  https://raw.githubusercontent.com/MicrosoftDocs/mslearn-introduction-to-machine-learning/main/Data/dog-training-switzerland.csv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 12362 (12K) [text/plain]
Saving to: ‘dog-training-switzerland.csv’


2023-03-13 16:04:28 (4.69 MB/s) - ‘dog-training-switzerland.csv’ saved [12362/12362]

(500, 5)


Unnamed: 0,month_old_when_trained,mean_rescues_per_year,age_last_year,weight_last_year,rescues_last_year
0,9,16.7,2,15.709342,30
1,33,24.2,8,14.760819,35
2,43,20.2,4,13.118374,19
3,37,19.2,5,10.614075,24
4,45,16.9,8,17.51989,28


In [18]:
# Plot the fitted model against this new dataset. 

graphing.scatter_2D(new_data, "weight_last_year", "rescues_last_year", trendline = lambda x: model.params[1] * x + model.params[0])

In [19]:
correct_labels = new_data['rescues_last_year']
predicted = model.predict(new_data['weight_last_year'])

MSE = mse(correct_labels, predicted)
print('MSE = %f ' % MSE)

#MSE (Mean squared error)

MSE = 20.406905 


As expected, the model performs better on the training dataset as it does on the unseen dataset. This is simply due to overfitting, as we saw previously.

Interestingly, the model performs better on this unseen dataset than it does on the test set. This is because our previous test set was quite small, and thus not a very good representation of 'real world' data. By contrast, this unseen dataset is large and a much better representation of data we'll find outside of the lab. In essence, this shows us that part of performance difference we see between training and test is due to model overfitting, and part of the error is due to the test set not being perfect. In the next exercises, we'll explore the trade-off we have to make between training and test dataset sizes. 