# Finding hard instances with cross-validation

In the following, we explain how 10-fold cross-validation can be used to find hard instances:

* It is always good to analyse for which instances it is hard to get correct predictions.
* Without cross-validation, the analysis of hard instances requires a big hold-out test set.
* Cross-validation provides a way to get predictions for all instances in the dataset.
* As these predictions are computed on new data, overfitting does not corrupt the predictions.

As a result, model dissection with cross-validation may reveal the cause of overfitting or problematic regions.
The technique is particularly useful when the model strongly overfits the data and you need to find the cause.

In [1]:
import numpy as np
import pandas as pd
import scipy.stats as stats
import sklearn

from pandas import Series
from pandas import DataFrame
from typing import Tuple

from tqdm.notebook import trange
from sklearn.linear_model import LogisticRegression
from plotnine import *

# Local imports
from common import *
from convenience import *

## I. Experiment setup

For simplicity, we consider a sampling procedure that outputs $2n$ instances of which $n$ are simple and $n$ are impossible to predict. 
The type of the instance is encoded into the first parameter $x_0$.
This is only for conveniece as similar data can be sampled with standard iid samplers. 
To make our life harder, we use majority voting for predicting as this makes it hard to see the issue considering training error.

In [2]:
def sampler(n: int) -> DataFrame:
    return pd.concat([
        data_sampler(n, 8, lambda x: logit(x, Series([0, 0]))).assign(x_0 = False),
        data_sampler(n, 8, lambda x: logit(x, Series([10, 10]))).assign(x_0 = True)], 
        ignore_index = True)[['x_{}'.format(i) for i in range(9)] + ['y']]
clf_1 = MajorityVoting()
clf_2 = LogisticRegression(solver = 'lbfgs')

In [3]:
 sampler(2)

Unnamed: 0,x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y
0,False,True,True,True,True,False,True,True,False,True
1,False,True,False,False,False,True,True,True,True,True
2,True,False,True,False,False,False,True,True,False,True
3,True,True,True,True,True,True,True,False,True,True


## II. Modified cross-validation algorithm

We will use the standard cross-validation scheme but instead of measuring test and training errors, we collect predictions on the test folds.

In [4]:
k = 10
m = 10
n = k * m
data = sampler(n)
features = list(data.columns.values[:-1])

### Analysis of  majority voting  

Let's use the sceleton of the cross-validation algorithm described in the previous notebook.

In [5]:
pred = Series(np.nan, index = data.index)
for i, training_samples, test_samples in  crossvalidation_splits(data[features], data['y']):
    test_set = data.iloc[test_samples]
    training_set = data.iloc[training_samples]
    clf_1.fit(training_set[features], training_set['y'])
    pred.iloc[test_samples] = clf_1.predict(test_set[features])

data = data.assign(yp = pred)

Predictions with the model trained over the entrire dataset:

In [6]:
clf_1.fit(data[features], data['y'])
data = data.assign(yp_train = clf_1.predict(data[features]))

Compare the predictions:

In [7]:
print('CV error: {}%'.format(round(sum(data['y'] != data['yp'])/len(data)*100, 1)))
print('Training error: {}%'.format(round(sum(data['y'] != data['yp_train'])/len(data)*100, 1)))

CV error: 50.0%
Training error: 6.0%


We discover many more problematic cases with cross-validation compared to training over the entire dataset. Unfortunately, there is no strong signal to indicate hard instances. This is understandable as majority voting cannot find the solution.

In [8]:
mdisplay([data.loc[data['y'] != data['yp']].describe(), data.loc[data['y'] != data['yp_train']].describe()], ['Crossvalidation', 'Training'])

x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
100,100,100,100,100,100,100,100,100,100,100,100
2,2,2,2,2,2,2,2,2,2,2,2
False,False,False,True,False,False,False,True,True,False,True,False
55,66,64,53,54,56,52,59,58,91,91,79
12,12,12,12,12,12,12,12,12,12,12,12
2,2,2,2,2,2,2,2,2,1,1,1
False,True,False,False,False,False,False,True,True,False,True,True
8,6,8,7,9,8,7,8,8,12,12,12
Crossvalidation,Training,,,,,,,,,,
x_0  x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  y  yp  yp_train  100  100  100  100  100  100  100  100  100  100  100  100  2  2  2  2  2  2  2  2  2  2  2  2  False  False  False  True  False  False  False  True  True  False  True  False  55  66  64  53  54  56  52  59  58  91  91  79,x_0  x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  y  yp  yp_train  12  12  12  12  12  12  12  12  12  12  12  12  2  2  2  2  2  2  2  2  2  1  1  1  False  True  False  False  False  False  False  True  True  False  True  True  8  6  8  7  9  8  7  8  8  12  12  12,,,,,,,,,,

x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
100,100,100,100,100,100,100,100,100,100,100,100
2,2,2,2,2,2,2,2,2,2,2,2
False,False,False,True,False,False,False,True,True,False,True,False
55,66,64,53,54,56,52,59,58,91,91,79

x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
12,12,12,12,12,12,12,12,12,12,12,12
2,2,2,2,2,2,2,2,2,1,1,1
False,True,False,False,False,False,False,True,True,False,True,True
8,6,8,7,9,8,7,8,8,12,12,12


The same analysis for logistic regression does not help much as the training and cross-validation errors are close.

In [9]:
pred = Series(np.nan, index = data.index)
for i, training_samples, test_samples in  crossvalidation_splits(data[features], data['y']):
    test_set = data.iloc[test_samples]
    training_set = data.iloc[training_samples]
    clf_2.fit(training_set[features], training_set['y'])
    pred.iloc[test_samples] = clf_2.predict(test_set[features])

data = data.assign(yp = pred)

clf_2.fit(data[features], data['y'])
data = data.assign(yp_train = clf_2.predict(data[features]))

print('CV error: {}%'.format(round(sum(data['y'] != data['yp'])/len(data)*100, 1)))
print('Training error: {}%'.format(round(sum(data['y'] != data['yp_train'])/len(data)*100, 1)))

mdisplay([data.loc[data['y'] != data['yp']].describe(), data.loc[data['y'] != data['yp_train']].describe()], ['Crossvalidation', 'Training'])

CV error: 33.5%
Training error: 29.0%


x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
67,67,67,67,67,67,67,67,67,67,67,67
2,2,2,2,2,2,2,2,2,2,2,2
False,False,True,False,False,False,True,False,True,False,True,True
43,40,41,35,37,36,35,34,36,35,35,35
58,58,58,58,58,58,58,58,58,58,58,58
2,2,2,2,2,2,2,2,2,2,2,2
False,False,True,True,False,False,True,False,True,False,True,True
37,34,35,30,32,33,30,30,30,30,31,30
Crossvalidation,Training,,,,,,,,,,
x_0  x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  y  yp  yp_train  67  67  67  67  67  67  67  67  67  67  67  67  2  2  2  2  2  2  2  2  2  2  2  2  False  False  True  False  False  False  True  False  True  False  True  True  43  40  41  35  37  36  35  34  36  35  35  35,x_0  x_1  x_2  x_3  x_4  x_5  x_6  x_7  x_8  y  yp  yp_train  58  58  58  58  58  58  58  58  58  58  58  58  2  2  2  2  2  2  2  2  2  2  2  2  False  False  True  True  False  False  True  False  True  False  True  True  37  34  35  30  32  33  30  30  30  30  31  30,,,,,,,,,,

x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
67,67,67,67,67,67,67,67,67,67,67,67
2,2,2,2,2,2,2,2,2,2,2,2
False,False,True,False,False,False,True,False,True,False,True,True
43,40,41,35,37,36,35,34,36,35,35,35

x_0,x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8,y,yp,yp_train
58,58,58,58,58,58,58,58,58,58,58,58
2,2,2,2,2,2,2,2,2,2,2,2
False,False,True,True,False,False,True,False,True,False,True,True
37,34,35,30,32,33,30,30,30,30,31,30


# Homework

## 6.1 Hard instances and average ROC curve* (<font color='red'>3p</font>)

Most classifiers internally compute a numeric decision value and convert it into a binary decision by using a predescribed threshold. 
By changing the threshold, we can change the classifiers' output towards the positive or the negative class. 
Receiver operating characteristic (ROC curve) is a two-dimensional plot which allows you to choose the right threshold value. See Wikipedia for further details. In order to do that, we need a large hold-out sample or the curve will be too jumpy. 
To see this in effect, consider a classification algorithm which computes its output as $\mathrm{sign}(x_2-x_1+b)$ for some fixed threshold $b$:

* Compute the ratio of true positives and the ratio of false positives for parameter values $b=-5, -4,\ldots, 5$ and draw the  corresponding ROC curve. Since you cannot compute the false positive and false negative ratio analytically, compute these values on the hold-out dataset of size $100$. (<font color='red'>1p</font>)
 
* Study how precise is the ROC curve based on 100 hold-out points by repeating the same computations on $100$ datasets sampled from the same source and drawing all these ROC curves on the same plot. You should see a peculiar effect. Describe it and explain why it occurs. 
* Repeat the same experiment with datasets of size $10$ and $1000$. Compare the resulting plots and interpret the results. Is there a minimal size of the hold-out sample for which the ROC curve makes sense? (<font color='red'>1p</font>)
 
* Use now the cross-validation to extend the hold-out predictions over the entire dataset.
  Study how precise is the ROC curve based on 100 hold-out points by repeating the same computations on $100$ datasets sampled from the same source and drawing all these ROC curves on the same plot. Is the resulting ROC curve closer to the true ROC curve of 100 or 10 samples analyzed in the previous subtask? (<font color='red'>1p</font>)

In [10]:
%config IPCompleter.greedy=True