<h1>CS4619: Artificial Intelligence 2</h1>
<h2>Error Estimation</h2>
<h3>
    Derek Bridge<br>
    School of Computer Science and Information Technology<br>
    University College Cork
</h3>

# Initialization $\newcommand{\Set}[1]{\{#1\}}$ $\newcommand{\Tuple}[1]{\langle#1\rangle}$ $\newcommand{\v}[1]{\pmb{#1}}$ $\newcommand{\cv}[1]{\begin{bmatrix}#1\end{bmatrix}}$ $\newcommand{\rv}[1]{[#1]}$

In [1]:
%load_ext autoreload
%autoreload 2
%matplotlib inline

In [30]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

In [31]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.model_selection import ShuffleSplit
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.model_selection import LeaveOneOut

<h1>Mean Squared Error</h1>
<p>
    So, you've trained an estimator on a training set. 
    You want to know how well it will do in practice, once you start to use it to make predictions.
    Easy right? We have the training set, so we measure how well the estimator peforms on the training set.
    For each example in the training set, we ask the estimator to predict the value of the dependent variable
    and compare with the <em>actual</em> value, which is also in the training set.
</p>
<p>
    For regression, we will compute the <b>mean squared error</b>:
    $$\frac{1}{m}\sum_{i=0}^m(\hat{y_i} - y_i)^2$$
    where $\hat{y_i}$ is the predicted value for example $i$ and $y_i$ is the actual value.
</p>

<h2>Example of Mean Squared Error on the Training Set</h2>
<p>
    Let's compute the mean squared error for Linear Regression trained on the Cork property dataset: 
</p>

In [32]:
# Use pandas to read the CSV file
df = pd.read_csv("dataset-corkA.csv")

# Get the feature-values and the target values into separate numpy arrays of numbers
X = df[['flarea', 'bdrms', 'bthrms']].values
y = df['price'].values

# Create linear regression object
estimator = LinearRegression()

# Train the model using the data
estimator.fit(X, y)

# Use the learned model to predict on the same examples
y_predicted = estimator.predict(X)

# Compute the mean squared error
mse = mean_squared_error(y, y_predicted)

# Display
mse

55456.553237981032

<h1>Training Error and Test Error</h1>
<p>
    We will refer to the error on the training set as the <b>training error</b>. (Some people call it the
    'resubstitution error' and sometimes the 'in-sample error'.) But, remember, we're not much
    interested in how well we have done on this data; we want to know how well we will perform in the future, 
    on unseen data. Is the training error a good indicator of performance on unseen data? The answer is, 
    in general: no.
</p>
<p>
    The estimator's training error (its performance on the very data on which it was trained) is likely to
    give an optimistic, even very optimistic, view of its future performance.
</p>
<ul>
    <li>
        One intuition of why this is wrong is that it's a bit like
        a teacher who sets exams whose questions test the very same examples s/he used when teaching the
        material.
    </li>
    <li>
        Another intuition of why this is wrong: which estimator that we have studied can have
        zero training error but would be likely to perform much less well in practice?
    </li>
</ul>
<p>
    (By the way, although the training error is not a good predictor of future performance, it can 
    still be useful, as we will see in the lecture on Underfitting and Overfitting.)
</p>
<p>
    To predict future performance, we need to measure error on an <em>independent</em> dataset &mdash; one
    that played no part in creating the estimator. This second dataset is called the <b>test set</b>, and
    our error on the test set we will call the <b>test error</b>. (In some circumstances people might call it 
    the 'out-of-sample error' or 'extra-sample error'.)
</p>
<ul>
    <li>
        If you have a ready supply of quality data, then collect one very large dataset 
        to be the training
        set, and collect another very large dataset to be the test set. But large datasets are not
        always available, and large high-quality datasets are even harder to come by. (And, remember, it must also 
        have the actual values of the dependent variable as well as the values of the features.)
    </li>
    <li>
        If the supply of data is more limited, then collect one dataset (as large as you can) and partition it 
        into training set and test set. 
        This is called the <b>holdout</b> method, because the test set
        is withheld during training. It is essential that the test set is not used in any way to create
        the estimator. We look at holdout, and variations of it, in more detail in this lecture.
    </li>
</ul>

<h1>Holdout</h1>
<p>
   We split the dataset randomly but ensuring the two sets are <em>disjoint</em>.
   There is a tension here. To learn a good estimator, we want the training set to be as big as possible.
   But for a good prediction of future performance, we want the test set to be as big as possible.
   Commonly, the training set will be between 50% and 80% of the dataset.
</p>
<p>
    Splitting the dataset in this way is very easy in scikit-learn:
</p>

In [33]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3)

# Create linear regression object
estimator = LinearRegression()

# Train the model using the training set
estimator.fit(X_train, y_train)

# The training error
# Predict on the training set and measure the difference between the predictions and the actual values in the training set:
y_predicted = estimator.predict(X_train)
mse_train = mean_squared_error(y_train, y_predicted)

# The test error
# Predict on the test set and measure the difference between the predictions and the actual values in the test set:
y_predicted = estimator.predict(X_test)
mse_test = mean_squared_error(y_test, y_predicted)

# Display
mse_train, mse_test

(8848.4480660590198, 241423.1846134419)

<p>
    You will find it instructive to run the above again and again to see the effect of different random splits.
</p>

<h2>Pros and Cons of Holdout</h2>
<p>
    The advantage of this method is that the test error is independent of the training set.
</p>
<p>
    The disadvantages of the holdout method are:
</p>
<ul>
    <li>
        You will observe that results can vary quite a lot. Informally, you might get lucky &mdash; or unlucky.
        Maybe you get a very 'helpful' training set, or a very 'unhelpful' training set; a very 'easy-to-predict' 
        test set, or a very 'hard-to-predict' test set. In other words,
        in any one split, the data used for training or testing might not be representative.
    </li>
    <li>
        We are training on only a subset of the available dataset, perhaps as little as 50% of it. From so little
        data, we may learn a worse model and so our error measurement may be pessimistic.
    </li>
</ul>
<p>
    In practice, you would not use the holdout method &mdash; unless you had a very large dataset that
    would mitigate the above problems. Instead, you would use one of its variants that we 
    describe below. Each of these variants uses <b>resampling</b>, meaning that the examples get re-used
    for training and testing.
</p>

<h1>Repeated Holdout</h1>
<p>
    One solution to the problem of biased holdout sets is to <em>repeat</em> the whole process:
</p>
<ul style="background: lightgray; list-style: none">
    <li>
        repeatedly
        <ul>
            <li>split the dataset into training and test sets</li>
            <li>train on the training set</li>
            <li>make predictions for the test set</li>
            <li>measure error (e.g. MSE)</li>
        </ul>
        report the mean of the errors
    </li>
</ul>

<h2>Illustrating scikit-learn's ShuffleSplit Class</h2>
<p>
    scikit-learn provides a ShuffleSplit class, which gives Boolean indexes that split the dataset. Here's a simple use:
</p>

In [34]:
# Split the dataset 70%/30%. Do so 3 times.
ss = ShuffleSplit(n_splits = 3, test_size = 0.3)

# Display the indexes
for train_indexes, test_indexes in ss.split(X):
    print("TRAIN:", train_indexes)
    print("TEST:", test_indexes)

TRAIN: [119  76 203  21 162  78 131 128 199 193 180 188  89 192  87 117   0 140
 150  24 183  72  26  31  88   5 191 172  51 220 160  38 186 165  70 102
   1  11 185 133 178  43 104 138 118 149  61 121 115 214 222  97  99 200
 107  69  98 209  82  47  93 223 129 111  37  96  19 143  75 212 132  58
 106  35 201  22  57  84  62  85  12 171  27 116  48  81 105 136 145  44
  36 147 112  32 114  66  68  20 164 218 219  90  63 148 130 182 152 206
 163 213 151  64  83  33 216 196  42 103  77   3 167 101  80 108  39 139
  91 170  16   9  14   2 137 190 208  49  46 154  95 110 158 155  53 157
 141  41 126 174  73 194 142  71  28 122   8   6]
TEST: [134 179  92 211   4  67  17  25 173  79 153 124  65 210 161  56  54  55
  29 207  18 156 100 166 123  52  15  13 195 175  59 159 113 198 135  34
  60  50   7 125 197 202 168  10 127  45 215 184 146 204  86  40 177 221
  94 169  23  74 189 109 120 176 205  30 217 187 144 181]
TRAIN: [ 48  11  62   6  19  92 131  37 124 138 179  75 212 188  56 222  49 

## Using ShuffleSplit to Compute Training Error and Test Error

In [35]:
def repeated_holdout_for_regression(estimator, X, y, n_splits = 10, test_size = 0.3):
    mses_train = np.zeros(n_splits)
    mses_test = np.zeros(n_splits)
    ss = ShuffleSplit(n_splits, test_size = test_size)
    for i, (train_indexes, test_indexes) in zip(range(n_splits), ss.split(X)):
        X_train = X[train_indexes]
        y_train = y[train_indexes]
        X_test = X[test_indexes]
        y_test = y[test_indexes]
        estimator.fit(X_train, y_train)
        y_predicted = estimator.predict(X_train)
        mses_train[i] = mean_squared_error(y_train, y_predicted)
        y_predicted = estimator.predict(X_test)
        mses_test[i] = mean_squared_error(y_test, y_predicted)
    return np.mean(mses_train), np.mean(mses_test)

# Here's an example of calling the function:
estimator = LinearRegression()
mean_mse_train, mean_mse_test = repeated_holdout_for_regression(estimator, X, y)
mean_mse_train, mean_mse_test

(48649.124348839061, 85196.270987307231)

<h2>Using ShuffleSplit to Compute Test Error</h2>
<p>
    scikit-learn does provide a more convenient way of doing this, but it only computes the test error and it computes the negative of the MSE (so that higher values are better):
</p>

In [36]:
ss = ShuffleSplit(n_splits = 10, test_size = 0.3)
estimator = LinearRegression()
mses_test = cross_val_score(estimator, X, y, scoring = 'neg_mean_squared_error', cv = ss)
mean_mse_test = np.mean(mses_test)
mean_mse_test

-68390.020459469481

<h2>Pros and Cons of Repeated Holdout</h2>
<p>
    The advantage here is we can repeat indefinitely to improve our confidence. The disadvantage is training
    sets may overlap with each other and test sets may overlap with each other, although the effect of this
    is reduced if the dataset is large.
</p>
<p>
    Let's look at another method.
</p>

<h1>$k$-Fold Cross-Validation</h1>
<p>
    In this approach, we randomly partition the data into $k$ disjoint subsets of equal size. (This is a different use of 
    $k$ from the $k$ in kNN.) Each of the partitions is called a <b>fold</b>. Typically, $k = 10$, so you have 10 folds. 
    But, for conventional statistical significance testing to be applicable, you should probably ensure that the number of
    examples in each fold does not fall below 30. If this isn't possible, then either use a smaller value for $k$, or
    do not use $k$-fold cross validation. 
</p>
<p>
    You take each fold in turn and use it as the test set, training the learner on 
    the remaining folds. Clearly, you can do this $k$ times, so that each fold gets 'a turn' at being the test set.
</p>
<ul style="background: lightgray; list-style: none">
    <li>
        partition the dataset $D$ into $k$ disjoint equal-sized subsets, $T_1, T_2,\ldots,T_k$
    <li>
    <li>
        <b>for</b> $i = 1$ to $k$
        <ul>
            <li>train on $D \setminus T_i$</li>
            <li>make predictions for $T_i$</li>
            <li>measure error (e.g. MSE)</li>
        </ul>
        report the mean of the errors
    </li>
</ul>
<p>
    By this method, each example is used exactly once for testing, and $k - 1$ times for training.
</p>

<h2>Pros and Cons of $k$-Fold Cross-Validation</h2>
<p>
    Compared with repeated holdout, the advantages of this method are:
</p>
<ul>
    <li>
        The test errors of the folds are independent &mdash; because examples are included in only one test set. 
    </li>
    <li>
        Better use is made of the dataset: for $k = 10$, for example, we train using 9/10 of the dataset.
    </li>
</ul>
<p>
     The disadvantages are: 
</p>
<ul>
    <li>
        While the test sets are independent of each other, the training sets are not: they will overlap
        with each other to some degree. (This effect of this will be less, of course, for larger datasets.)
    </li>
    <li>
        The number of folds is constrained by the size of the dataset and the desire to have folds of
        at least 30 examples.
    </li>
    <li>
        It can be costly to train the learning algorithm $k$ times.
    </li>
    <li>
        There may still be some variability in the results due to 'lucky'/'unlucky' splits &mdash; which
        motivates Repeated $k$-Fold Cross-Validation, below.
    </li>
</ul>
</p>

<h2>Illustrating scikit-learn's KFold Class</h2>
<p>
    scikit-learn provides the KFold class, which is an iterator, similar to the ShuffleSplit class. Here's a simple use:
</p>

In [37]:
kf = KFold(n_splits = 5, shuffle = True)

# Display the indexes
for train_indexes, test_indexes in kf.split(X):
    print("TRAIN:", train_indexes)
    print("TEST:", test_indexes)

TRAIN: [  0   1   2   3   4   6   7   8   9  11  12  14  15  16  18  21  22  23
  24  25  26  28  29  30  31  32  33  34  35  36  37  38  39  40  41  42
  44  46  47  48  49  50  51  52  53  54  55  56  58  59  61  62  64  66
  67  68  69  70  72  74  75  77  79  80  81  82  84  86  87  88  89  90
  91  92  93  94  95  96  97  99 100 101 102 103 104 106 107 108 109 110
 111 112 113 116 118 119 120 122 123 124 125 127 128 129 131 132 133 135
 136 137 138 139 142 143 144 145 146 147 148 149 150 151 152 153 154 155
 156 158 159 160 161 164 165 166 167 168 169 170 171 172 173 174 175 177
 178 179 180 181 182 183 184 186 187 188 189 190 192 196 197 198 199 200
 201 202 203 204 205 206 207 208 209 210 211 212 213 215 216 218 220]
TEST: [  5  10  13  17  19  20  27  43  45  57  60  63  65  71  73  76  78  83
  85  98 105 114 115 117 121 126 130 134 140 141 157 162 163 176 185 191
 193 194 195 214 217 219 221 222 223]
TRAIN: [  1   3   4   5   6   7   8   9  10  11  12  13  15  16  17  18  19 

<h2>Using KFold to Compute Test Error</h2>
<p>
    Assuming that we are happy to get just the test error, we can use the cross_val_score method again:
</p>

In [38]:
kf = KFold(n_splits =10, shuffle = True)
estimator = LinearRegression()
mses_test = cross_val_score(estimator, X, y, scoring = 'neg_mean_squared_error', cv = kf)
mean_mse_test = np.mean(mses_test)
mean_mse_test

-71316.054085133728

<p>
    But, $k$-fold cross-validation is so commonplace, that there is a shorter way to write
    the code above, as follows:
</p>

In [39]:
estimator = LinearRegression()
mses_test = cross_val_score(estimator, X, y, scoring = 'neg_mean_squared_error', cv = 10)
mean_mse_test = np.mean(mses_test)
mean_mse_test

-100047.69646341191

<p>
    Be warned, however, this almost certainly does not shuffle the dataset before splitting it into folds.
    Q: Why might that be a problem?
</p>
<p>
    You should probably shuffle the <code>DataFrame</code> just after reading it in from the CSV file using, e.g.:<br>
    <code>df = df.take(np.random.permutation(len(df)))</code>
</p>
<p>
    Final observation: In the above, we ran the 10-fold cross validation on the Cork property dataset. That dataset has 
    only 224 examples
    &mdash; not enough examples to give at least 30 examples in each of the 10 folds. So this isn't an ideal use of 
    the method.
</p>

<h1>Repeated $k$-Fold Cross-Validation</h1>
<p>
    It's not uncommon to find people repeating the $k$-fold cross validation to reduce variability in the results. 
    For example, you might run 10 times 10-fold
    cross-validation and average the results. This means running the learning algorithm 100 times, each time on a training
    set that is nine tenths of the full dataset &mdash; quite computationally expensive.
</p>
<p>
    We won't look at the code. Straightforwardly, you wrap an extra loop around the code we gave above. 
</p>

<h1>Leave-One-Out Cross-Validation (LOOCV)</h1>
<p>
    Leave-one-out cross-validation is $k$-fold cross-validation in which $k = m$, the number of examples in the dataset:
    each example is in its own fold. In other words, you train the learner on all examples but one, and that one remaining
    example is used for testing. And you do this in turn for each example in the dataset. You'll get $m$ error values, which you
    can average.
</p>
<ul style="background: lightgray; list-style: none">
    <li>
        <b>for</b> $i = 1$ to $m$
        <ul>
            <li>train on $D \setminus \Set{\v{x}^{(i)}}$</li>
            <li>make prediction for $\v{x}^{(i)}$</li>
            <li>measure error (e.g. MSE)</li>
        </ul>
        report the mean of the errors
    </li>
</ul>
<p>
    As with $k$-fold cross-validation, each example is used exactly once in a test set. But each example is used in $m - 1$ 
    different training sets. 
</p>

<h2>Pros and Cons of LOOCV</h2>
<p>
    There are advantages:
</p>
<ul>
    <li>
        One advantage of LOOCV is that the maximum amount of data is used for training, which makes an accurate
        estimator more likely. (But, see the disadvantage below.)
    </li>
    <li>
        Another advantage is that there is no randomness: we can't get lukcy or unlucky. And there's 
        no point in repeating the process, we'll get the same result each time.
    </li>
</ul>
<p>
    But there are disadvantages:
</p>
<ul>
    <li>
        The obvious disadvantage is the cost: the learner must be trained $m$ times, and each time it will be trained on almost 
        all the data. This method is therefore infeasible in some cases. 
        <p>
            Question:
        </p>
        <ul>
            <li>
                For which estimator do you think LOOCV is fairly common? Why?
            </li>
        </ul>
    </li>
    <li>
        More subtly, LOOCV's $m$ models are trained on almost identical data; in $k$-Fold Cross-Validation, the $k$ models
        are trained on data with less overlap.
    </li>
</ul>
<p>
    (Advanced note, which you can ignore: We said that LOOCV must train the learning algorithm $m$ times. In fact, 
    for some learners, including OLS linear regression, you can learn just the first model and then, with a bit of 
    maths, work out the final average error without learning any of the other models. So this makes LOOCV practical 
    for this class of learning algorithms.)
</p>
<p>
    You can see that there is a trade-off here. Empirically, $k$-Fold Cross-Validation with $k = 5$ or $k = 10$ tends
    to report the most reliable error figures.
</p>

<h2>Using LeaveOneOut to Compute Test Error</h2>
<p>
    Here is LOOCV in scikit-learn:
</p>

In [40]:
loocv = LeaveOneOut()
estimator = LinearRegression()
mses_test = cross_val_score(estimator, X, y, scoring = 'neg_mean_squared_error', cv = loocv)
mean_mse_test = np.mean(mses_test)
mean_mse_test

-72376.115691746367

<h1>Final Remarks</h1>
<ol>
    <li>
        There are methods other than those covered including Bootstrapping and Permutation Tests.
    </li>
    <li>
        So you've used one of the above methods and found the test error of your estimator. The dirty secret
        of Machine Learning is this: at this point, if dissatisfied with the test error, many Machine
        Learning researchers, start tweaking their learning algorithms to try to bring down the test error.
        This is wrong! It is called <b>leakage</b>: knowledge of the test set is being used to develop the
        estimator. It's like the teacher letting the students take the same exam again. It will result in
        the test error giving an optimistic view of the ultimate performance of the estimator on unseen data.
        <p>
            If you must do something like this, then somewhat less problematic is if you ensure that you are 
            using different random splits when evaluating your tweaks.
        </p>
    </li>
    <li>
        Finally, suppose you have used one of the above methods to estimate the error of your regressor. 
        You are ready to release your regressor on the world. At this point, you can train it on
        <em>all</em> the examples in your dataset, so as to maximize the use of the data.
    </li>
</ol>