# Estimating the quality of classifiers
## The learning curve and the decision regions of several classifiers on the Iris dataset
### This notebook is based on the notebook by Gabriel Shiu available on Kaggle at https://www.kaggle.com/code/gabrielshiu/iris-investigation-voting-learning-curve

In [None]:
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# from sklearn.inspection import plot_partial_dependence

%matplotlib inline
sns.set_style('whitegrid')


In [None]:
# Reset the seed of the random number generator, for reproducibility purposes

import os

def reset_seed(SEED = 0):
    """Reset the seed for every random library in use (System, numpy)"""

    os.environ['PYTHONHASHSEED']=str(SEED)
    np.random.seed(SEED)


reset_seed(2023)

In [None]:
# Import the Iris dataset from the sklearn library. 

from sklearn.datasets import load_iris

X, y = load_iris(return_X_y=True, as_frame=True)

# We already know from an earlier assignment that the sepal features do not help much with the classification
# We drop them. 
X.drop(['sepal length (cm)', 'sepal width (cm)'], axis=1, inplace=True)

# Split into train and test
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X.values, 
    y, 
    test_size=0.20, 
    shuffle=True,
    random_state=100,
    stratify=y,
)

from sklearn.preprocessing import StandardScaler

stand_scaler = StandardScaler()
stand_scaler = stand_scaler.fit(X_train)
X_train = stand_scaler.transform(X_train)
X_test = stand_scaler.transform(X_test)

In [None]:
# We evaluate several different classifiers on the Iris dataset
# To reduce the chance that the performance somehow comes from the specific validation dataset, 
#     we do a 10-fold cross-validation.
# The data is split into 10 equal bins (in a stratified way). 9 of them are used to train, 1 to validate.


from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=10, shuffle=True, random_state=0)


# Here is the list of models we will evaluate, most already used in previous assignments

from sklearn.naive_bayes import GaussianNB, BernoulliNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

estimators = [
    KNeighborsClassifier(n_neighbors=5),
    SVC(),
    DecisionTreeClassifier(max_depth=None),
    RandomForestClassifier(n_estimators=100, max_depth=None),
    GaussianNB(),
    LogisticRegression(max_iter=1000),
    BernoulliNB(),
    LinearDiscriminantAnalysis(),
]

In [None]:
SMALL_SIZE = 12
MEDIUM_SIZE = 18
BIGGER_SIZE = 22
plt.rc('axes', titlesize=BIGGER_SIZE)     # fontsize of the axes title
plt.rc('axes', labelsize=MEDIUM_SIZE)    # fontsize of the x and y labels
plt.rc('xtick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('ytick', labelsize=SMALL_SIZE)    # fontsize of the tick labels
plt.rc('legend', fontsize=MEDIUM_SIZE)    # legend fontsize
plt.rc('figure', titlesize=BIGGER_SIZE)  # fontsize of the figure title
plt.rc('font', size=SMALL_SIZE)          # controls default text sizes

In [None]:
from sklearn.model_selection import cross_val_score

scores = pd.DataFrame(columns=['Estimator', 'CV Scores mean', 'CV Scores Variance'])

for i in range(len(estimators)):
    est = estimators[i]
    est_name = est.__class__.__name__
    est.fit(X_train, y_train)
    cv_scores = cross_val_score(est, X_train, y_train, cv=skf, n_jobs=-1)
    scores.loc[i] = [est_name, cv_scores.mean(), cv_scores.std()**2]
    
scores.sort_values(by='CV Scores mean', ascending=False, inplace=True)
print(scores)

plt.figure(figsize=(20, 10))
sns.barplot(x=scores['CV Scores mean'], y=scores['Estimator'])
plt.show()

## Learning curves
#### Source: https://towardsdatascience.com/learning-curve-to-identify-overfitting-underfitting-problems-133177f38df5

Learning curves plot the training and validation loss of a sample of training examples by incrementally adding new training examples. Learning curves help us in identifying whether adding additional training examples would improve the validation score (score on unseen data). If a model is overfit, then adding additional training examples might improve the model performance on unseen data. Similarly, if a model is underfit, then adding training examples doesn’t help. 

#### Typical features of the learning curve of a well fit model
- Training loss and Validation loss are close to each other with validation loss being slightly greater than the training loss.
- Initially decreasing training and validation loss and a pretty flat training and validation loss after some point till the end.

#### Typical features of the learning curve of an overfit model
- Training loss and Validation loss are far away from each other.
- Gradually decreasing validation loss (without flattening) upon adding training examples.
- Very low training loss that’s very slightly increasing upon adding training examples.

#### Typical features of the learning curve of an underfit model
- Increasing training loss upon adding training examples.
- Training loss and validation loss are close to each other at the end.
- Sudden dip in the training loss and validation loss at the end (not always).

In [None]:
# Check how the model learns with more data
# Train with just 10% of the training data, then with 20%, etc.
# We should see how the model quality evolves with more data to train on

from sklearn.model_selection import learning_curve

def plot_learning_curve(model, X, y, title, cv, train_sizes, plot_location):
    train_sizes, train_scores, valid_scores = learning_curve(
        model, X, y, cv=cv, n_jobs=-1, random_state=0, train_sizes=train_sizes
    )
    
    ax = plt.subplot(4,2,plot_location)    
    ax.set(title=title)
    ylim = (0.4, 1.01)
    if ylim is not None:
        plt.ylim(*ylim)
    plt.xlabel("Training examples")
    plt.ylabel("Score")
    
    train_scores_mean = np.mean(train_scores, axis=1)
    train_scores_std = np.std(train_scores, axis=1)
    valid_scores_mean = np.mean(valid_scores, axis=1)
    valid_scores_std = np.std(valid_scores, axis=1)
    
    plt.grid()
    plt.fill_between(train_sizes, train_scores_mean - train_scores_std,
                train_scores_mean + train_scores_std, alpha=0.3,
                color="g")
    plt.fill_between(train_sizes, valid_scores_mean - valid_scores_std,
                valid_scores_mean + valid_scores_std, alpha=0.1, color="r")
    plt.plot(train_sizes, train_scores_mean, 'o-', color="g",
        label="Mean train score")
    plt.plot(train_sizes, valid_scores_mean, 'o-', color="r",
        label="Mean CV score")
    plt.legend(loc="best")

In [None]:
# Plot the learning curves on the training and on the validation datasets

plt.figure(figsize=(20,35))

for i in range(len(estimators)):
    plot_learning_curve(estimators[i], X_train, y_train,
                        estimators[i].__class__.__name__, skf, np.linspace(0.1, 0.9, 20),
                        plot_location=i+1
                       )    
plt.show()

#### Some conclusions about these models
- Bernoulli naive Bayes has the lowest score. The learning curves on the training and on the validation data are close to each other with enough training data. This shows the model is well fit. The reason why it does not have a good score is probably because he data is simply not Bernoulli distributed.
- The other 7 models have excellent cross-validation scores (calculated as the average of all CV models' score on their own validation data sets). Some differences between them can be seen in the learning curves. 
- The decision tree and the random forest classifiers have the biggest gap between the learning curves on the training and on the validation data sets. Also, the band on the training set is very narrow, showing its ability to fit well no matter of the change in the training set. Worse, the training learning curve is about one standard deviation away from the validation learning curve. This suggests that these models are over-fitted. 
- The other models seem to be well fit: the two learning curves train/validation are close to each other, their +/- standard deviation bands overlap, with more variation of course on the validation band. Their score stabilises quickly, suggesting they can fit well even with relatively little training data from the Iris dataset. 
- The models differ widely in HOW they make their predictions. We can visualise this below through plotting the decision boundaries. See if you can confirm the points made here after seeing the decision boundaries. 

#### Decision boundaries

A decision boundary is a set of hyperplanes that split the n-dimensional (in our case 2-dim) space of the data points into th K classes that the model was trained for. The model is fit using the training dataset and then it is ready to make predictions about the class of any datapoint in the n-dimensional space. The decision boundaries are the hyperplanes separating the sub-regions corresponding to each of the K classes. The shape of these decision boundaries are different from a model to another, depending on the mathematical functions underlying each model. 

In [None]:
from itertools import product
from sklearn.inspection import DecisionBoundaryDisplay

# Plotting decision regions

fig, axes = plt.subplots(4, 2, figsize=(20,35))

for i in range(len(estimators)):
    axes[i//2, i%2].set(title=estimators[i].__class__.__name__)
    disp = DecisionBoundaryDisplay.from_estimator(
        estimators[i], X_train, response_method="predict",
        xlabel=X.columns[0], ylabel=X.columns[1],
        alpha=0.5,
        ax=axes[i//2, i%2],
        plot_method='contourf'
    )
    disp.ax_.scatter(X_train[:, 0], X_train[:, 1], c=y_train, edgecolor="k")

plt.show()

# Assignment 4

- Load the Wisconsin breast cancer dataset from OpenML (ID 43757) in the same way you did in assignment 3. 
- Train the same 8 classifiers we used in the tutorial part of this notebook, display their CV scores, learning curves and decision boundaries as done above. 
- Which models seem overfit?
- Focus on the decision tree and on the random forest models. Train them in 4 different variants, with the parameter min_samples_split taking values 2, 5, 10, and 100. This is the parameter controlling the minimum number of datapoints in an internal node (an internal node is one that the model decides to improve through a further query; any node that gets less points than min_samples_split becomes a leaf in the tree/forest). Looking at their learning curves (their scores are close to each other), which one seems the best fit model?