# Training Machine Learning Models


## Underfitting and Overfitting

**Underfitting** is when an algorithm trained to predict a value does so poorly both in the training data and on future, unseen data.

Reconsider the *Titanic* dataset example below:

In [1]:
import pandas as pd
from pandas import DataFrame

In [2]:
titanic = pd.read_csv("titanic.csv")
titanic.head()

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
0,0,3,Mr. Owen Harris Braund,male,22.0,1,0,7.25
1,1,1,Mrs. John Bradley (Florence Briggs Thayer) Cum...,female,38.0,1,0,71.2833
2,1,3,Miss. Laina Heikkinen,female,26.0,0,0,7.925
3,1,1,Mrs. Jacques Heath (Lily May Peel) Futrelle,female,35.0,1,0,53.1
4,0,3,Mr. William Henry Allen,male,35.0,0,0,8.05


In [3]:
titanic.Survived.value_counts()

0    545
1    342
Name: Survived, dtype: int64

In [4]:
# Predict most common value
if titanic.Survived.value_counts()[0] > titanic.Survived.value_counts()[1]:
    guess = 0
else:
    guess = 1

predicted = pd.Series([guess] * len(titanic))
(titanic.Survived - predicted).abs().sum()    # Error count (trivial here)

342

In [5]:
(titanic.Survived - predicted).abs().mean()     # Error rate

0.3855693348365276

In [6]:
1 - (titanic.Survived - predicted).abs().mean()     # Correct prediction rate

0.6144306651634723

This algorithm is underfitting as much as it possibly can. It may in fact be a worst-case-scenario for underfitting.

**Overfitting** occurs when an algorithm predicts training data well but does not generalize to new data; on new data, the algorithm's error rate increases unacceptably.

Underfitting is obvious when training a system, but overfitting requires more care to detect, since unseen data is not seen (obviously). There are techniques, though, for simulating unseen data.

## Training / Testing Split

The first technique is to split data into a training dataset and a testing dataset. We use the training data for developing our algorithm. We then see how well the algorithm generalizes by applying the trained algorithm to the test data and quantifying the error rate.

`train_test_split()`, from scikit-learn (**sklearn**), makes splitting data easy.

In [7]:
from sklearn.model_selection import train_test_split

In [8]:
titanic_train, titanic_test = train_test_split(titanic,          # Dataset to split (array-like)
                                               test_size=0.1)    # How large test set should be; in this case, 10% of
                                                                 # the whole (could also be an integer for fixed size)

In [9]:
titanic_train

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
517,1,1,Miss. Anne Perreault,female,30.00,0,0,93.5000
66,0,3,Mr. Ernest James Crease,male,19.00,0,0,8.1583
821,0,3,Mr. John Flynn,male,36.00,0,0,6.9500
200,0,3,Mr. Frederick Sage,male,17.00,8,2,69.5500
153,0,3,Mr. Ole Martin Olsen,male,27.00,0,0,7.3125
238,0,2,Mr. George Henry Hunt,male,33.00,0,0,12.2750
227,0,2,Mr. Arne Jonas Fahlstrom,male,18.00,0,0,13.0000
732,0,3,Mrs. Edward (Margaret Ann Watson) Ford,female,48.00,1,3,34.3750
823,1,2,Master. Andre Mallet,male,1.00,0,2,37.0042
746,1,2,Miss. Joan Wells,female,4.00,1,1,23.0000


In [10]:
titanic_train.shape

(798, 8)

In [11]:
titanic_test

Unnamed: 0,Survived,Pclass,Name,Sex,Age,Siblings/Spouses Aboard,Parents/Children Aboard,Fare
188,0,3,Mr. Stjepan Turcin,male,36.00,0,0,7.8958
255,1,1,Mrs. Gertrude Maybelle Thorne,female,38.00,0,0,79.2000
851,1,3,Mrs. Sam (Leah Rosen) Aks,female,18.00,0,1,9.3500
431,0,1,Mr. William Baird Silvey,male,50.00,1,0,55.9000
243,0,3,Mr. Sleiman Attalah,male,30.00,0,0,7.2250
211,0,3,Mr. John Henry Perkin,male,22.00,0,0,7.2500
357,1,3,Miss. Helen Mary Mockler,female,23.00,0,0,7.8792
36,1,3,Mr. Hanna Mamee,male,18.00,0,0,7.2292
764,0,3,Mr. Daniel J Moran,male,28.00,1,0,24.1500
710,0,3,Mr. August Viktor Larsson,male,29.00,0,0,9.4833


In [None]:
titanic_test.shape

Let's now train a new algorithm. The table-lookup algorithm does the following:

1. *Look up all individuals in the training set with the same passenger class (`Pclass`), sex (`Sex`), siblings/spouses aboard (`Siblings/Spouses Aboard`) and parents/children aboard (`Parents/Children Aboard`).*
2. *Predict the most common value amongst those individuals.*

Below is the code for the algorithm.

In [12]:
def table_lookup_predictor(x, table):
    """Implements the table-lookup algorithm"""
    
    # Get most common label
    default = table.Survived.value_counts().argmax()
    # Get similar individuals
    similar_tab = table.loc[(table["Pclass"] == x["Pclass"]) &\
                            (table["Sex"] == x["Sex"]) &\
                            (table["Siblings/Spouses Aboard"] == x["Siblings/Spouses Aboard"]) &\
                            (table["Parents/Children Aboard"] == x["Parents/Children Aboard"]), "Survived"]
    if len(similar_tab) == 0:
        # If table is empty (no "similar" individuals), guess the most common label
        return default
    else:
        return similar_tab.value_counts().argmax()

In [13]:
titanic_train.iloc[0,:]

Survived                                      1
Pclass                                        1
Name                       Miss. Anne Perreault
Sex                                      female
Age                                          30
Siblings/Spouses Aboard                       0
Parents/Children Aboard                       0
Fare                                       93.5
Name: 517, dtype: object

In [14]:
# Demonstration 1
table_lookup_predictor(titanic_train.iloc[0,:], titanic_train)    # Perfect!

1

In [15]:
tlu_train_predicted = titanic_train.apply(table_lookup_predictor, 1,
                                          table=titanic_train)    # Make predictions on training set
tlu_train_predicted

517    1
66     0
821    0
200    0
153    0
238    0
227    0
732    0
823    1
746    1
221    0
489    0
607    1
94     0
800    0
209    0
479    0
759    1
777    1
747    1
491    0
20     0
447    0
15     1
118    0
365    1
811    0
580    0
386    0
864    0
      ..
544    0
443    1
585    0
354    1
614    1
110    1
862    1
708    0
303    1
88     0
559    0
171    1
344    1
166    0
528    0
727    0
518    0
473    0
16     0
250    1
582    1
420    1
721    0
54     0
465    0
611    0
842    0
335    1
226    0
154    0
Length: 798, dtype: int64

We can easily compute the error our algorithm made on the training set using the scikit-learn function `accuracy_score()`.

In [16]:
from sklearn.metrics import accuracy_score

In [17]:
accuracy_score(y_true=titanic_train.Survived,    # True values
               y_pred=tlu_train_predicted)    # Predicted values

0.82957393483709274

The algorithm is very accurate on the training set. This is to be expected; it's just looking up values from the table! What about when it's applied to the test set?

In [18]:
tlu_test_predicted = titanic_test.apply(table_lookup_predictor, 1,
                                        table=titanic_train)    # Make predictions on test set

In [19]:
accuracy_score(y_true=titanic_test.Survived,    # True values
               y_pred=tlu_test_predicted)    # Predicted values

0.6629213483146067

The algorithm overfit slightly on the training set, though the overfitting isn't terrible.

**NOTE:** Evaluating a model on the test set should be the *very last thing you do!* If you repeatedly refer to the test set, it no longer is "unseen" data.

## Cross-Validation

Many algorithms include **hyperparameters**, which are parameters that are characteristic of the algorithm itself rather than the underlying phenomenon. We need to choose the value of these parameters and we are indifferent to their values beyond their ability to improve predictions.

Our algorithm does not account for passengers' ages when making predictions. These unfortunately are not binary variables, but we can use them to create a binary variable by fixing an age and marking all those individuals less than this age with 1, and the rest 0. The cutoff age behaves like a hyperparameter here.

We don't want to pick our cutoff to maximize predictive accuracy in the training set, though, and we don't want to choose it so that it improves accuracy in the test set either. Instead we will employ **cross-validation**. The procedure works as follows:

1. *Divide data into $k$ **folds** (approximately equal size subsets of the original dataset that together form the whole dataset).*
2. *For each fold, do the following:*
    1. *Treat the fold as the "test" data and the rest of the data as "training" data.*
    2. *For each possible value of the hyperparameter, use the "training" data to fit the model and evaluate its performance on the "test" data; track performance*
3. *Aggregate the performance of the algorithm across the different folds for each possible value of the hyperparameter*
4. *Use the hyperparameter value that overall yielded the best performance.*

Cross-validation can be used for purposes other than choosing hyperparameters. For example, it can be a good way to evaluate an algorithm's performance and thus allow you to choose between different algorithms.

Here I will consider six candidate cutoff ages: 10, 20, 30, 40, 50, 60. I will use 10 folds.

scikit-learn provides multiple functions for supporting cross-validation. The `KFold` class can split a dataset up into folds as described. `cross_val_score()` can perform the entire cross-validation procedure, and is a good choice. Here I will do the cross-validation manually using only `KFold` but in future videos we may use `cross_val_score()`.

In [20]:
from sklearn.model_selection import KFold
import numpy as np

In [21]:
kf = KFold(n_splits=10)    # Prepare for cross-validation, creating an object for managing splits

In [22]:
# Preview; note that these are NumPy arrays
for train, test in kf.split(titanic_train):
    print("Training Indices:")
    print(train)
    print("\nTest Indices")
    print(test)
    print("\n----\n")

Training Indices:
[ 80  81  82  83  84  85  86  87  88  89  90  91  92  93  94  95  96  97
  98  99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115
 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133
 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151
 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169
 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187
 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205
 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223
 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241
 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259
 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277
 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295
 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313
 314 315 316 317 318 319 320 321 

In [29]:
def table_lookup_predictor_2(x, table, age):
    """Implements the table-lookup algorithm with ages after cufoff"""
    
    # Get most common label
    default = table.Survived.value_counts().argmax()
    # Get similar individuals
    similar_tab = table.loc[(table["Pclass"] == x["Pclass"]) &\
                            (table["Sex"] == x["Sex"]) &\
                            (table["Siblings/Spouses Aboard"] == x["Siblings/Spouses Aboard"]) &\
                            (table["Parents/Children Aboard"] == x["Parents/Children Aboard"]) &\
                            ((table["Age"] < age) == (x["Age"] < age)) , "Survived"]
    if len(similar_tab) == 0:
        # If table is empty (no "similar" individuals), guess the most common label
        return default
    else:
        return similar_tab.value_counts().argmax()

In [30]:
ages = [10, 20, 30, 40, 50, 60]
performance = dict()

for age in ages:
    cv_perf = list()
    for train, test in kf.split(titanic_train):
        # Get predicted values in "test" data using "train" data
        predicted = titanic_train.iloc[test,:].apply(table_lookup_predictor_2, 1, table=titanic_train.iloc[train,:],
                                                    age=age)
        actual = titanic_train.loc[:,"Survived"].iloc[test]
        # Add performance to a list
        cv_perf.append(accuracy_score(y_true=actual, y_pred=predicted))
    performance[age] = cv_perf

In [31]:
DataFrame(performance)

Unnamed: 0,10,20,30,40,50,60
0,0.7625,0.725,0.75,0.775,0.7625,0.7625
1,0.7875,0.7875,0.7625,0.7875,0.7875,0.7875
2,0.7875,0.7875,0.7875,0.8125,0.7875,0.7875
3,0.8,0.7875,0.7625,0.75,0.7375,0.7125
4,0.775,0.75,0.7375,0.775,0.75,0.7625
5,0.8125,0.8,0.775,0.775,0.7875,0.75
6,0.825,0.8375,0.8125,0.85,0.825,0.8375
7,0.8375,0.7875,0.775,0.75,0.8,0.7875
8,0.860759,0.848101,0.848101,0.860759,0.822785,0.822785
9,0.797468,0.797468,0.810127,0.810127,0.797468,0.810127


In [28]:
DataFrame(performance).mean()

10    0.804573
20    0.790807
30    0.782073
40    0.794589
50    0.785775
60    0.782041
dtype: float64

It appears we attain optimal performance by choosing our cutoff age to be 10 years.