<a href="https://colab.research.google.com/github/axel-sirota/model_training_best_practices/blob/master/module2/ModelTraining_Mod2Demo2_Train_test_split.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Train Test Split

© Data Trainers LLC. GPL v 3.0.

Author: Axel Sirota

In [1]:
import numpy as np
import pandas as pd
from sklearn import linear_model, model_selection, metrics

import warnings
warnings.filterwarnings('ignore')

First let's download the dataset

In [2]:
%%writefile get_data.sh
mkdir -p data
if [ ! -f data/glass.csv ]; then
  wget -O data/glass.csv https://www.dropbox.com/scl/fi/dv522a61am4dsc3vkfp4p/glass.csv?rlkey=6l9v685sw98plzj2myvtjpes6&dl=0
fi

Overwriting get_data.sh


In [3]:
!bash get_data.sh

In [4]:
glass = pd.read_csv('data/glass.csv')
glass.columns = ['ri','na','mg','al','si','k','ca','ba','fe','glass_type']
glass.head(3)

FileNotFoundError: [Errno 2] No such file or directory: 'data/glass.csv'

**Pretend we want to predict `ri`, how can we know the best set of features? How could we do it using machine learning?**


**Answer:** We use *Train/Test/Validation splits* to train the model with the **Train set**, select the best model on a given metric with the **Validation set**, and finally get a measurement of the model performance on the **Test set**.


We will use a different regression algorithm, Ridge Regression, that has a *hyperparameter* to showcase the power of train/validation/test splits.

In [None]:
# Do the split
from sklearn.model_selection import train_test_split

glass_pretrain, glass_test = train_test_split(glass, test_size = 0.2, random_state = 42)

In [None]:
glass_train, glass_validation = train_test_split(glass_pretrain, test_size = 0.25, random_state = 42)

Notice we do 2 splits to end up with a 60/20/20 proportion!

In [None]:
from sklearn.linear_model import Ridge


# First we create a function that on a given feature set it will train the model and return the model performance
def get_model_performance(feature_cols, alpha):

  # Basic start
  X = glass_train[feature_cols]
  y = glass_train.ri
  model = Ridge(alpha=alpha)

  # Train

  model.fit(X, y)

  # Predict

  y_pred = model.predict(glass_validation[feature_cols])
  y_true = glass_validation.ri

  return np.sqrt(metrics.mean_squared_error(y_pred, y_true))

Let's test this

In [None]:
get_model_performance(['al'], 1)

0.0021685024069272997

It works! Let's test a bunch of feature combinations and alphas

In [None]:
results = {}
for alpha in np.linspace(0.5, 2, 10):
  for feature_cols in [['na','mg'], ['na','mg', 'al'], ['al','si','k']]:
    results[(alpha, tuple(feature_cols))] = get_model_performance(feature_cols, alpha=alpha)

In [None]:
results

{(0.5, ('na', 'mg')): 0.002582784313827305,
 (0.5, ('na', 'mg', 'al')): 0.0022065951256895395,
 (0.5, ('al', 'si', 'k')): 0.001985998574703533,
 (0.6666666666666666, ('na', 'mg')): 0.002582743449127876,
 (0.6666666666666666, ('na', 'mg', 'al')): 0.0022046074240648595,
 (0.6666666666666666, ('al', 'si', 'k')): 0.00198523926415055,
 (0.8333333333333333, ('na', 'mg')): 0.002582702558272736,
 (0.8333333333333333, ('na', 'mg', 'al')): 0.0022027111115613565,
 (0.8333333333333333, ('al', 'si', 'k')): 0.001984506228521506,
 (1.0, ('na', 'mg')): 0.0025826616413630946,
 (1.0, ('na', 'mg', 'al')): 0.0022009031173304814,
 (1.0, ('al', 'si', 'k')): 0.0019837987913295225,
 (1.1666666666666665, ('na', 'mg')): 0.002582620698507875,
 (1.1666666666666665, ('na', 'mg', 'al')): 0.0021991804783616595,
 (1.1666666666666665, ('al', 'si', 'k')): 0.0019831162978338414,
 (1.3333333333333333, ('na', 'mg')): 0.0025825797298233205,
 (1.3333333333333333, ('na', 'mg', 'al')): 0.002197540335399412,
 (1.33333333333333

Let's select the best

In [None]:
min(results, key=results.get)

(2.0, ('al', 'si', 'k'))

In [None]:
results[min(results, key=results.get)]

0.0019800564870258238

Now let's evaluate the true model performance

In [None]:
feature_cols = ['al', 'si', 'k']
X = glass_train[feature_cols]
y = glass_train.ri

model = Ridge(alpha=2)

model.fit(X, y)

y_pred = model.predict(glass_test[feature_cols])
y_true = glass_test.ri

np.sqrt(metrics.mean_squared_error(y_pred, y_true))

0.0022154513457950113

Notice how the value is a little worse, but that is OK and expected! This is how we use the *train/validation/test* splits to select the best model