# Assignment 3

**Joris LIMONIER**

---


In [78]:
# imports
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
import plotly.io as pio
from sklearn.datasets import make_classification
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import (
    accuracy_score,
    auc,
    precision_score,
    recall_score,
    roc_curve,
)
from sklearn.model_selection import (
    RepeatedKFold,
    StratifiedKFold,
    cross_val_score,
    train_test_split,
)

pio.templates.default = "plotly_white"


## Assigments 3.1

This first part of the assignment requires to implement yourself basic cross-validation strategies.

**Exercise 1.** Define a 10-fold classification strategy to test the accuracy of a Linear Discriminant Analysis (LDA) classifier for the data created as follows:


In [79]:
# Declare variables
n_features = 2

# Generate data
X, y = make_classification(
    n_samples=100,
    n_features=n_features,
    n_redundant=0,
    n_informative=2,
    random_state=0,
    n_clusters_per_class=1,
    weights=[0.5],
)


In [80]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,
    random_state=0,
)

print(
    f"*** Training ***",
    f"Elements of class 0: {1 - y_train.mean()}",
    f"Elements of class 1: {y_train.mean()}",
    " ",
    f"*** Testing ***",
    f"Elements of class 0: {1 - y_test.mean()}",
    f"Elements of class 1: {y_test.mean()}",
    " ",
    f"*** Training + Testing ***",
    f"Elements of class 0: {1 - y.mean()}",
    f"Elements of class 1: {y.mean()}",
    sep="\n",
)


*** Training ***
Elements of class 0: 0.52
Elements of class 1: 0.48
 
*** Testing ***
Elements of class 0: 0.48
Elements of class 1: 0.52
 
*** Training + Testing ***
Elements of class 0: 0.5
Elements of class 1: 0.5


The data is roughly balanced, so we can work with `RepeatedKFold` rather than `StratifiedKFold`.


In [81]:
px.scatter(x=X[:, 0], y=X[:, 1], color=pd.Categorical(y))


The data seems almost linearly separable, so our LDA classifier should perform fairly well.


In [82]:
# Baseline
n_folds = 10
lda = LinearDiscriminantAnalysis()
cv_acc = cross_val_score(
    estimator=lda,
    X=X,
    y=y,
    scoring="accuracy",
    cv=n_folds,
)
mean_acc = np.mean(cv_acc)

print(f"Accuracy on each fold: {cv_acc}")
print(
    f"Mean accuracy over the {n_folds} folds:",
    f"{mean_acc:.2f}",
)
fig = px.bar(
    data_frame=cv_acc,
    template="plotly_white",
)
fig.add_hline(
    mean_acc,
    annotation_text=f"mean accuracy {mean_acc:.2f}",
)
fig.update_layout(
    showlegend=False,
    xaxis_title="Fold index",
    title="Accuracy per fold",
    font_family="Nato",
)


Accuracy on each fold: [0.9 1.  1.  0.9 1.  1.  1.  1.  0.9 0.7]
Mean accuracy over the 10 folds: 0.94


**Exercise 2.** Use the previous 10-fold cross-validation to plot and compute the average area under the curve of the LDA classifier. You can use the built in method _predict_proba(X)_


In [83]:
lda = LinearDiscriminantAnalysis()
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,
    random_state=0,
)


In [84]:
cv = RepeatedKFold(n_splits=n_folds, n_repeats=2, random_state=0)

# Declare variables
roc_sample = []
aucs = []
store_metrics = []
mean_fpr = np.linspace(0, 1, 100)
n_repeats = 3
fig = go.Figure()


# Declares loop
for i, (train, test) in enumerate(cv.split(X, y)):

    # Compute predictions
    predictions = lda.fit(X[train], y[train]).predict_proba(X[test])

    # Compute ROC and AUC data
    fpr, tpr, _ = roc_curve(y[test], predictions[:, 1])
    roc_sample.append(np.interp(mean_fpr, fpr, tpr))

    # Make ROC curve start at 0
    # The trick is to make the zero-th element of the sample that was just added
    roc_sample[-1][0] = 0.0

    # Compute AUC
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)

    # Plot
    fig.add_trace(
        go.Scatter(
            x=fpr,
            y=tpr,
            mode="lines",
            opacity=0.5,
            name=f"ROC fold {str(i).zfill(2)}, with AUC {roc_auc:.2f}",
        )
    )
    y_pred = lda.predict(X[test])

    # Compute accuracy, precision, recall and F1 score
    acc = accuracy_score(y_pred, y[test])
    prec = precision_score(y_pred, y[test])
    rec = recall_score(y_pred, y[test])
    f1 = 2 * prec * rec / (prec + rec)

    store_metrics.append([acc, prec, rec, f1])

fig.update_layout(
    title=f"Mean AUC: {np.mean(aucs):.4f}",
    font_family="Nato",
    template="plotly_white",
)
fig


In [85]:
df_met = pd.DataFrame(data=store_metrics, columns=["acc", "prec", "rec", "f1"])
fig = px.line(df_met)
fig


We can plot the average ROC curve with its uncertainty.


In [86]:
fig = go.Figure()

mean_tpr = np.mean(roc_sample, axis=0)
std_tpr = np.std(roc_sample, axis=0)
thresh_low = np.maximum(mean_tpr - std_tpr, 0)
thresh_upp = np.minimum(mean_tpr + std_tpr, 1)

fig.add_traces(
    [
        # Add mean ROC curve
        go.Scatter(
            x=mean_fpr,
            y=mean_tpr,
            name="Mean ROC",
            opacity=0.9,
        ),
        # Add upper bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_upp,
            opacity=0.1,
            marker={"color": "rgba(0,0,255,0.15)"},
            showlegend=False,
        ),
        # Add lower bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_low,
            name="+/- 1 std",
            opacity=0.1,
            fillcolor="rgba(0,0,255,0.15)",
            fill="tonexty",
            mode="none",
        ),
        # Add diagonal line
        go.Scatter(
            x=mean_fpr,
            y=mean_fpr,
            name="Chance",
            opacity=1,
            line={"width": 1, "color": "red"},
            line_dash="dash",
        ),
    ]
)

# Add trade-off line
trade_off = mean_fpr[np.argmax(mean_tpr)]
fig.add_vline(
    x=trade_off,
    line_dash="dash",
    annotation_text=f"TPR/FPR trade-off (FPR={trade_off:.2f})",
    annotation_position="top",
)
fig.update_layout(
    title=f"Mean ROC over all folds",
    font_family="Nato",
    xaxis_title="FPR",
    yaxis_title="TPR",
)
fig


---

**Exercise 3.** Define the appropriate cross-validation strategy and measurement of the area under the curve for the data:


In [87]:
n_samples = 200
n_features = 5
i = 0

fold_range = np.arange(
    2,
    min(
        np.sum(y == 0),
        np.sum(y == 1),
    ),
)

roc_sample = [[], []]
aucs = [[], []]
mean_fpr = np.linspace(0, 1, n_samples)


In [88]:
X, y = make_classification(
    n_samples=n_samples,
    n_features=n_features,
    n_redundant=0,
    n_informative=3,
    random_state=0,
    n_clusters_per_class=1,
    weights=[0.9],
)


In [89]:
px.scatter(x=X[:, 0], y=X[:, 1], color=pd.Categorical(y))


In [90]:
X_train, X_test, y_train, y_test = train_test_split(
    X,
    y,
    test_size=0.5,
    random_state=0,
)

print(
    f"*** Training ***",
    f"Elements of class 0: {1 - y_train.mean()}",
    f"Elements of class 1: {y_train.mean()}",
    " ",
    f"*** Testing ***",
    f"Elements of class 0: {1 - y_test.mean()}",
    f"Elements of class 1: {y_test.mean()}",
    " ",
    f"*** Training + Testing ***",
    f"Elements of class 0: {1 - y.mean()}",
    f"Elements of class 1: {y.mean()}",
    sep="\n",
)


*** Training ***
Elements of class 0: 0.94
Elements of class 1: 0.06
 
*** Testing ***
Elements of class 0: 0.87
Elements of class 1: 0.13
 
*** Training + Testing ***
Elements of class 0: 0.905
Elements of class 1: 0.095


The data is heavily unbalanced, so we can't work with `RepeatedKFold`. Rather, we should use `StratifiedKFold`.

##### Investigate the right number of folds


In [91]:
res = {mesure: [] for mesure in ["mean", "std"]}

# Declare loop
for k in fold_range:

    # Model declaration
    lda = LinearDiscriminantAnalysis()

    # Initialize list
    accuracies_within_fold = []

    # Declare the stratified folding
    skf = StratifiedKFold(n_splits=k)
    for train_index, test_index in skf.split(X, y):
        X_train, X_test = X[train_index], X[test_index]
        y_train, y_test = y[train_index], y[test_index]

        # Fit model on X_train
        lda.fit(X_train, y_train)

        # Store Kfold accuracy
        accuracies_within_fold.append(accuracy_score(y_test, lda.predict(X_test)))

    # Average and std for a particular k fold
    res["mean"].append(np.mean(accuracies_within_fold))
    res["std"].append(np.std(accuracies_within_fold, ddof=1))


df_res = pd.DataFrame(res)
df_res.insert(loc=0, column="n_folds", value=fold_range)
df_res



The least populated class in y has only 19 members, which is less than n_splits=20.


The least populated class in y has only 19 members, which is less than n_splits=21.


The least populated class in y has only 19 members, which is less than n_splits=22.


The least populated class in y has only 19 members, which is less than n_splits=23.


The least populated class in y has only 19 members, which is less than n_splits=24.


The least populated class in y has only 19 members, which is less than n_splits=25.


The least populated class in y has only 19 members, which is less than n_splits=26.


The least populated class in y has only 19 members, which is less than n_splits=27.


The least populated class in y has only 19 members, which is less than n_splits=28.


The least populated class in y has only 19 members, which is less than n_splits=29.


The least populated class in y has only 19 members, which is less than n_splits=30.


The least populated class in y has only 19 members, w

Unnamed: 0,n_folds,mean,std
0,2,0.97,0.014142
1,3,0.969998,0.000261
2,4,0.97,0.02582
3,5,0.97,0.020917
4,6,0.969994,0.018607
5,7,0.970091,0.031096
6,8,0.97,0.028284
7,9,0.970136,0.031223
8,10,0.97,0.02582
9,11,0.970229,0.036982


In [92]:
fig = go.Figure()
fig.add_traces(
    [
        go.Scatter(
            x=fold_range,
            y=df_res["mean"],
            name="mean"
        ),
        go.Scatter(
            x=fold_range,
            y=np.maximum(0, df_res["mean"] - df_res["std"]),
            line={"width": 1, "color": "rgba(0, 0, 255, 0.2)",},
            mode="lines",
            name="+/- 1 std",
        ),
        go.Scatter(
            x=fold_range,
            y=np.minimum(1, df_res["mean"] + df_res["std"]),
            fill="tonexty",
            fillcolor="rgba(0, 0, 255, 0.2)",
            mode="none",
            showlegend=False,
        ),
    ]
)
fig.update_layout(
    xaxis_title="Number of folds",
    yaxis_title="Accuracy",
    title="Accuracy per fold",
    font_family="Nato",
)

fig


Besides 2-folds, which would correspond to a basic train-test split, the best options (lowest standard deviation) are `K=3` and `K=6`. Therefore, these are the ones we choose to consider.

In [93]:
# Compute AUC for n_folds = 3
n_folds = 3

cv = StratifiedKFold(n_splits=n_folds)

# Declare variables
roc_sample = []
aucs = []
store_metrics = []
mean_fpr = np.linspace(0, 1, 100)
n_repeats = 3
fig = go.Figure()


# Declares loop
for i, (train, test) in enumerate(cv.split(X, y)):

    # Compute predictions
    predictions = lda.fit(X[train], y[train]).predict_proba(X[test])

    # Compute ROC and AUC
    fpr, tpr, _ = roc_curve(y[test], predictions[:, 1])
    roc_sample.append(np.interp(mean_fpr, fpr, tpr))

    # Make ROC curve start at 0
    # The trick is to make the zero-th element of the sample that was just added
    roc_sample[-1][0] = 0.0

    # Compute AUC
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)

    # Plot AUC
    fig.add_trace(
        go.Scatter(
            x=fpr,
            y=tpr,
            mode="lines",
            opacity=0.5,
            name=f"ROC fold {str(i).zfill(2)}, with AUC {roc_auc:.2f}",
        )
    )
    y_pred = lda.predict(X[test])

    # Compute accuracy, precision, recall and F1 score
    acc = accuracy_score(y_pred, y[test])
    prec = precision_score(y_pred, y[test])
    rec = recall_score(y_pred, y[test])
    f1 = 2 * prec * rec / (prec + rec)

    store_metrics.append([acc, prec, rec, f1])

fig.update_layout(
    title=f"Mean AUC: {np.mean(aucs):.4f} with {n_folds} folds",
    font_family="Nato",
)
fig


In [94]:
# this cell is just computing the ROC for n_folds=3
# but nothing is displayed because it will be displayed
# when we look at the ROC a bit below

# Compute ROC for n_folds = 3
fig = go.Figure()

mean_tpr = np.mean(roc_sample, axis=0)
std_tpr = np.std(roc_sample, axis=0)
thresh_low = np.maximum(mean_tpr - std_tpr, 0)
thresh_upp = np.minimum(mean_tpr + std_tpr, 1)

fig.add_traces(
    [
        # Add mean ROC curve
        go.Scatter(
            x=mean_fpr,
            y=mean_tpr,
            name="Mean ROC",
            opacity=0.9,
        ),
        # Add upper bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_upp,
            opacity=0.1,
            marker={"color": "rgba(0,0,255,0.15)"},
            showlegend=False,
        ),
        # Add lower bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_low,
            name="+/- 1 std",
            opacity=0.1,
            fillcolor="rgba(0,0,255,0.15)",
            fill="tonexty",
            mode="none",
        ),
        # Add diagonal line
        go.Scatter(
            x=mean_fpr,
            y=mean_fpr,
            name="Chance",
            opacity=1,
            line={"width": 1, "color": "red"},
            line_dash="dash",
        ),
    ]
)

# Add trade-off line
trade_off = mean_fpr[np.argmax(mean_tpr[: len(mean_tpr) // 2])]
fig.add_vline(
    # get best tradeoff that happens before FPR=0.5
    # not for statistics reason, just to get the vertical line where I want
    x=trade_off,
    annotation_text=f"TPR/FPR trade-off (FPR={trade_off:.2f})",
    line_dash="dash",
    annotation_position="top",
)
fig.update_layout(
    title=f"Mean ROC over all folds (on {n_folds} folds)",
    font_family="Nato",
    xaxis_title="FPR",
    yaxis_title="TPR",
)
fig_3_save = fig

In [95]:
# Compute AUC for n_folds = 6
n_folds = 6

cv = StratifiedKFold(n_splits=n_folds)

# Declare variables
roc_sample = []
aucs = []
store_metrics = []
mean_fpr = np.linspace(0, 1, 100)
n_repeats = 3
fig = go.Figure()


# Declares loop
for i, (train, test) in enumerate(cv.split(X, y)):

    # Compute predictions
    predictions = lda.fit(X[train], y[train]).predict_proba(X[test])

    # Compute ROC and AUC
    fpr, tpr, _ = roc_curve(y[test], predictions[:, 1])
    roc_sample.append(np.interp(mean_fpr, fpr, tpr))

    # Make ROC curve start at 0
    # The trick is to make the zero-th element of the sample that was just added
    roc_sample[-1][0] = 0.0

    # Compute AUC
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)

    # Plot AUC
    fig.add_trace(
        go.Scatter(
            x=fpr,
            y=tpr,
            mode="lines",
            opacity=0.5,
            name=f"ROC fold {str(i).zfill(2)}, with AUC {roc_auc:.2f}",
        )
    )
    y_pred = lda.predict(X[test])

    # Compute accuracy, precision, recall and F1 score
    acc = accuracy_score(y_pred, y[test])
    prec = precision_score(y_pred, y[test])
    rec = recall_score(y_pred, y[test])
    f1 = 2 * prec * rec / (prec + rec)

    store_metrics.append([acc, prec, rec, f1])

fig.update_layout(
    title=f"Mean AUC: {np.mean(aucs):.4f} with {n_folds} folds",
    font_family="Nato",
)
fig


We see that the 3-folds outperforms the 6-folds model in terms of AUC. However, 6-folds performs very well on some splits, but then very poorly on other splits (it has more variance than its counterparts). My choice would then be the 3-folds model.

Now we compute and plot the ROC.

In [96]:
# See comment a few cells above
fig_3_save

In [97]:
# Compute ROC for n_folds = 6
fig = go.Figure()

mean_tpr = np.mean(roc_sample, axis=0)
std_tpr = np.std(roc_sample, axis=0)
thresh_low = np.maximum(mean_tpr - std_tpr, 0)
thresh_upp = np.minimum(mean_tpr + std_tpr, 1)

fig.add_traces(
    [
        # Add mean ROC curve
        go.Scatter(
            x=mean_fpr,
            y=mean_tpr,
            name="Mean ROC",
            opacity=0.9,
        ),
        # Add upper bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_upp,
            opacity=0.1,
            marker={"color": "rgba(0,0,255,0.15)"},
            showlegend=False,
        ),
        # Add lower bound of confidence interval
        go.Scatter(
            x=mean_fpr,
            y=thresh_low,
            name="+/- 1 std",
            opacity=0.1,
            fillcolor="rgba(0,0,255,0.15)",
            fill="tonexty",
            mode="none",
        ),
        # Add diagonal line
        go.Scatter(
            x=mean_fpr,
            y=mean_fpr,
            name="Chance",
            opacity=1,
            line={"width": 1, "color": "red"},
            line_dash="dash",
        ),
    ]
)

# Add trade-off line
trade_off = mean_fpr[np.argmax(mean_tpr[: len(mean_tpr) // 2])]
fig.add_vline(
    # get best tradeoff that happens before FPR=0.5
    # not for statistics reason, just to get the vertical line where I want
    x=trade_off,
    line_dash="dash",
    annotation_text=f"TPR/FPR trade-off (FPR={trade_off:.2f})",
    annotation_position="top",
)
fig.update_layout(
    title=f"Mean ROC over all folds (on {n_folds} folds)",
    font_family="Nato",
    xaxis_title="FPR",
    yaxis_title="TPR",
)
fig


We see that both strategies perform fairly well, but a few observations can still be made:
- The variance is way smaller for the 3-folds model than for the 6-folds one.
- The trade-off points are at $(FPR, TPR) = (0.05, 0.85)$ and $(FPR, TPR) = (0.10, 0.85)$ for $K=3$ and $K=6$ respectively. So both are above 80\% TPR, which is respectable.
- We get below the "Chance" dashed line and make no improvement as FPR values increase, until suddenly increasing at last. This could be due to overfitting, which is made easier by the fact that one class is heavily over-represented in our dataset.

___

## Assigments 3.2

**Exercise 1.** During lesson we discussed the problem of _selection bias_ in cross-validation.
This problem is nicely investigated in the paper _On the Dangers of Cross-Validation. An Experimental Evaluation_, accessible here:

http://people.csail.mit.edu/romer/papers/CrossVal_SDM08.pdf

Read

- Section 1 (Introduction),
- Section 4 (Experiments on Synthetic Data),
- Section 7 (Discussion)

And write a short summary (~half a page) about these three sections (results and take home message).


#### Summary:

In this paper, the authors show that although CV and leave-one-out (LOO) CV is widely used in the industry and in research, it is often over-used and/or mis-used. Indeed, most people in the field have some understanding of the drawbacks attached to CV, but when data is hard to collect, CV-related temptations can infect even the best practicioners. 

The authors use a LOOCV strategy to train and predict $M$ algorithms over a synthetic dataset with a true error of $\frac{1}{2}$. An average accuracy is used computed on multiple trials (different values as $M$ varies). The result is that as $M$ increases, accuracy increases from $61.9\%$ ($M=10$), to $85.6\%$ ($M=10^6$)), although the true error remains $\frac{1}{2}$.

The authors warn that CV may become ineffective and should be used with caution. The sentence for not doing so are increased variance, overfitting andt therefore inefficiency. The paper suggests to use a “sequestered” test set to be used once training and validation are completed. The error on the validation and test sets would be checked to see whether they are close. The idea is to prevent data-leakage, for instance selecting features on the whole set (*e.g.* using correlation) before splitting the data and performing the train test split. This is a major methodology mistake and should not be done.