# TP2: MACHINE LEARNING APPLICATION

Following the previous notebook, we will go over the fundamentals of machine learning applied to a classification task : binary classification between Low Grade Glioblastoma (LGG) and High Grade Glioblastoma (HGG). Of course, the aim is not to find the best pipeline with the best algorithm for this task but to understand and experiment with the concepts. First, download necessary materials for the practical session.

In [None]:
!git clone https://github.com/jeannebc/Radiomics-BrainTumorClassification.git
%cd Radiomics-BrainTumorClassification

### Installation of python dependencies

In [None]:
!pip install scikit-learn pandas matplotlib numpy


## General Machine Learning Steps

Before we start, let's review the basic steps of machine learning:

1. Data collection, preprocessing (e.g., integration, cleaning, etc.), and exploration; Splitting a dataset into the **training** and **testing** sets
2. Model development:  
    A. Let us define a model $\{f\}$ as is a collection of candidate functions $f_{w}$. Let's assume that each $f_{w}$ is parametrized by ${w}$  
    B. Let us define a **cost function** $C({w})$ which quantifies "how well a particular $f_{w}$ can explain the training data." The lower the cost function the better;  
3. **Training:** employ an algorithm that finds the best (or good enough) function $f^{*}$ in the model that minimizes the cost function over the training dataset.
4. **Testing**: evaluate the performance of the learned $f^{*}$ using the testing dataset;
5. Apply the model in the real world !

## Loading data

In [None]:
import pandas as pd
from IPython.display import display

path_dataset = './data/radiomics_analysis_cleaned.csv'

data = pd.read_csv(path_dataset)
# we will only work with the full area segmentation with all sequences
data = data[data['segmentation']=='full']

data = data.pivot_table(index=['patient', 'label'],
                                columns=['sequence', 'segmentation'],
                                values=data.columns[4:])
data.columns = ['_'.join(col).strip() for col in data.columns.values]
data.reset_index(level=1, inplace=True)

display(data)

# Convert LGG into class 0 and HGG into class 1
data.loc[data['label'] == 'HGG', 'label'] = 1
data.loc[data['label'] == 'LGG', 'label'] = 0

## Splitting the data

Let’s now use the <code>train_test_split</code> function from scikit-learn to divide feature data (x_data) and target data (y_data, 0 or 1) even further into train and test cohorts. Here we will have 30% of the data for the test set. It is also a good practice to define a random state for reproducible results.

Note: the <code>stratify</code> parameter in the function ensures proportions are maintained in the split. For example, if variable y is a binary categorical variable with values 0 and 1, and there are 25% of zeros and 75% of ones, <code>stratify=y_data</code> will make sure that the proportions 75% and 25% are also verified for the training and testing sets.

In [None]:
x_data, y_data = data.drop(columns='label'), data['label'].astype(int).to_numpy()

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x_data, y_data ,test_size = 0.3, random_state=123, stratify=y_data)
print('x_train shape: ', x_train.shape)
print('x_test shape: ', x_test.shape)
print('y_train shape: ', y_train.shape)
print('y_test shape: ', y_test.shape)

We now have our training set to fit and validate our model. The test set is considered like an unseen set and **will be used for performance evaluation only**. Indeed, to be relevant, model evaluation has to be done on data that your model hasn’t seen before.

Nevertheless, the strategy to evaluate you model depends on your goal and the chosen approache :

 - **Scenario 1: Running a simple training**  
Split the dataset into separate training and testing sets. Train the model on the former, evaluate the model on the latter. Evaluation is done by a variety of performance metrics such as mean error, precision, recall, ROC auc, etc.

 - **Scenario 2: Training a model and tuning (optimize) its hyperparameters.**  
Split the dataset into separate training and validation sets. Use techniques such as k-fold cross-validation on the training set to find the “optimal” set of hyperparameters for the model. After hyperparameter tuning, use the independent test set to get an unbiased estimate of its performance.

 - **Scenario 3: Comparing multiple models to identify the best architecture (e.g., SVM vs. logistic regression vs. Random Forests, etc.).**  
In this case, we use nested or double cross-validation which leverages an inner and an outer k-fold validation loop. The inner loop chooses the best model and tunes hyperparameters. The outer loop evaluates the resulting choice on unseen folds. Once the optimal model is identified, we evaluate it on the held out test set.

![Training options](https://raw.githubusercontent.com/jeannebc/Radiomics-BrainTumorClassification/main/images/evaluate_overview.png)

# 1) Introduction to data preprocessing

Data pre-processing is an crucial step in machine learning because the quality of the data and it's interpretability directly affects the model's ability to learn useful information.

A researcher in AI will spend almost a majority of their time on data cleaning and processing. You will often hear : **Garbage in, garbage out !**


### Handling NULL and NaN (=Not A Number) Values

In any real world dataset there are null values. Tipically, models cannot handle these NULL or NaN values on their own so they need to be edited out of the data. In python a NULL value is represented with NaN, which stands for *Not a Number* (for instance, a division by zero will result in NaN).

The first step is to identify potential NULL values in our data using the <code>isnull()</code> method.

In [None]:
data.isnull().sum()
# Returns the column names along with the number of NaN values in that particular column (we can specify the axis=1, if we want rows)

In [None]:
print(data.isnull().sum().any())# add .any() if you want to identify any NaN values in the sum (returns bool)

Luckily, there are no NULL values in our dataset. Nevertheless, let us look at some strategies to handle them just in case ...

#### 1. Quick and easy : dropping (i.e. removing) the rows or columns that contain NULL values.

In [None]:
data.dropna(); # (axis=0 for columns or 1 for rows), you have various parameters for dropna(), like 'how', 'tresh',
# take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dropna.html

#### 2. More evolved : imputation

Imputation is defined by the substitution of the missing values of our dataset. The NULL values can be replaced by mean, max, 0, a custom function ... We can even train another algorithm to predict the values of the missing features.

In [None]:
data.fillna(0); # will replace NaN values by a 0, take a look at https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.fillna.html

### Standardization/Normalization

Normalization rescales the data to ensure the values of different features fall in the specified range, e.g. [0,1].

Indeedn different radiomics features have different units and range. Some features are designed to be in range [0,1], while others can have a very large range. In most machine learning algorithms, the objective function will not give relevant results without normalization. For example, many classifiers calculate the distance between two points using the Euclidean distance. If one of the features has a wide range of values, the distance will be governed by that particular characteristic as all other distances will become negligible. Normalization ensures each feature contributes equally to the final distance. Let us introduce the most common approaches to normalization.

#### 1. Min-Max scaling

This estimator scales and translates each feature individually such that it is in the desired range, e.g. between zero and one.

$$ X_{norm} = \frac{X - X_{min}}{X_{max}-X_{min}} $$

Scikit-learn directly implements this for us:

In [None]:
from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(x_train)
x_train_standardize = scaler.transform(x_train)
# We apply the same transform on the test
x_test_standardize = scaler.transform(x_test)

#### 2. Z-Score normalization

It standardize features by removing the mean and scaling to unit variance. In other words, we transform our values such that the mean of the values is 0 and the standard deviation is 1.

$$
Z = \frac{x_i - \mu}{\sigma}
$$  
with mean: $$\mu = \frac{1}{N} \sum_{i=1}^N (x_i)$$
and standard deviation: $$\sigma = \sqrt{\frac{1}{N} \sum_{i=1}^N (x_i - \mu)^2}$$

**Exercice:**
Complete the following scripts to:

- do the Z-Score normalization on the <code>x_train</code> and apply it on the  <code>x_test</code> without using the scikit-learn function.

In [None]:
#@title Exercise 1

# How z-score normalization is implemented?

import numpy as np
import operator
import pandas as pd


def z_score(X):
    # zero mean and unit variance
    mean =  '''CompleteHere'''
    std_dev =  '''CompleteHere'''
    z =  '''CompleteHere'''
    return z, mean, std_dev


x_train_standardize, mean, std_dev = '''CompleteHere'''(x_train)
x_test_standardize = (('''CompleteHere''' - mean) / std_dev)


In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(x_train)
x_train_standardize = scaler.transform(x_train)
# We apply the same transform on the test set
x_test_standardize = scaler.transform(x_test)

There are many other techniques, you can consult this link which
 <a href="https://scikit-learn.org/stable/auto_examples/preprocessing/plot_all_scaling.html#sphx-glr-auto-examples-preprocessing-plot-all-scaling-py">compares the effect of different scalings on data with outliers</a>

# 2) Building models

#### Standard validation (Hold-Out method)

Previously, we have separated our data into train and test sets. We used the complete training set to derive estimates of means and variances for each feature in order to perform Z-score normalization.

We now further split the training set into a smaller training set and a validation set. The training set samples will all be used to optimize the parameters of a model (i.e. iteratively converge to a satisfying model from the input family of functions), and the validation set will be used to select a good set of hyper-parameters such as the model type (here, logestic regression, decision trees, support vector machines).

The function `train_test_split()` from scikit-learn is also used on the pair (features, labels) of the training set:

![Training options](https://raw.githubusercontent.com/jeannebc/Radiomics-BrainTumorClassification/main/images/holdout_method.png)

In [None]:
train_features, validation_features, train_labels, validation_labels = \
  train_test_split(x_train, y_train ,test_size = 0.3, random_state=123, stratify=y_train)

print('train_features shape: ', train_features.shape)
print('validation_features shape: ', validation_features.shape)
print('train_labels shape: ', train_labels.shape)
print('validation_labels shape: ', validation_labels.shape)

#### Example : building a model with `scikit-learn`

Because od its simplicity of usage and consistency, `scikit-learn` (or `sklearn`) has become the leading data science library. It implements a wide range of ML and data processing algorithms. Let us look at some examples of models and training pipelines from `sklearn`.

Logistic regression is a very popular classifier for applications to the medical field. It models the relationship between a categorical response variable $y$ and a set of $x \in R^k$ of $k$ features per input by fitting a linear equation. The goal of training is then to find the weights or coefficients in the linear function such that the output is closest to the real label.

While all of the training pipeline of logistic regression can be hand-coded, its implementation in `sklearn` fits in two lines only:

In [None]:
from sklearn.linear_model import LogisticRegression
logistic_regression_model = LogisticRegression()  # instantiate a logistic regression model with default parameters
print(logistic_regression_model)

Training is then performed using the `fit` function, which applies gradient descent with input hyper-parameters, and usually stops once a training hyper-parameter is satisfied, such as minimal training error tolerance or number of iterations:

In [None]:
clf = logistic_regression_model.fit(X=train_features, y=train_labels)

##### Performance assessment

Once the model is trained, i.e. `fit` function is done running, it can be used to classify any input sample with the correct shape, or vectors of 140 features in this case. Notably, training and validation accuracies can be obtained with other performance indicators:

In [None]:
accuracy_train = clf.score(X=train_features, y=train_labels)
probs = clf.predict_proba(train_features)
accuracy_validation = clf.score(X=validation_features, y=validation_labels)
print('Training accuracy', accuracy_train, '; Validation accuracy', accuracy_validation)

In our problem setting, the dataset contains more positive samples than negative samples. This implies that a model outputting only the positive class would result in an accuracy above 50%. In other words, the accuracy may not always be suited to the needs of the task in hand. Other performance metrics, such as balanced accuracy (https://en.wikipedia.org/wiki/Precision_and_recall) are implemented in scikit-learn (https://scikit-learn.org/stable/modules/model_evaluation.html). All implemented performance metrics take two vectors as their inputs, one being the model predictions and the other ground-truths. We first need to explicitely compute the predictions of the trained model and then run sklearn metrics functions:

*Note : here, model predictions are represented by a vector of size 2. The first number represents the probability of the patients being in category 1, and the second one the probability of the patient being in category 2. These two probabilities always sum up to 1.*

In [None]:
# Use the trained model (clf) to explicitely compute probabilities of all training and validation samples
predicted_training_probabilities = clf.predict_proba(train_features)[:, 1]  # for each sample, two outputs probas summing to 1: one for class 0 (LGG), the other for class 1 (HGG)
predicted_validation_probabilities = clf.predict_proba(validation_features)[:, 1]  # for each sample, two outputs probas summing to 1: one for class 0 (LGG), the other for class 1 (HGG)

# Compute predicted classes from predicte?d probabilities by thresholded probabilities with 0.5: a probability higher than 0.5 would yield HGG prediction, otherwise LGG prediction
training_predicted_classes = list(map(lambda proba: int(proba > .5), predicted_training_probabilities))
validation_predicted_classes = list(map(lambda proba: int(proba > .5), predicted_validation_probabilities))

from sklearn.metrics import balanced_accuracy_score, roc_auc_score

# Metrics function expect first the ground-truth vector, then the predicted probabilities/classes one
training_balanced_accuracy = balanced_accuracy_score(train_labels, training_predicted_classes)
validation_balanced_accuracy = balanced_accuracy_score(validation_labels, validation_predicted_classes)
print('Training balanced accuracy', training_balanced_accuracy, '; Validation balanced accuracy', validation_balanced_accuracy)

training_auc = roc_auc_score(train_labels, predicted_training_probabilities)
validation_auc = roc_auc_score(validation_labels, predicted_validation_probabilities)
print('Training AUC', training_balanced_accuracy, '; Validation AUC', validation_balanced_accuracy)


Although the balanced accuracy alleviates the issue of class imbalance, it is not a means of assessing the performance of a decision system. Indeed, performance should be assessed using two measures, such as precision and recall. While giant technology companies can make mistakes in online tools such as Facebook when suggesting friends to tag on newly uploaded pictures, in medical routine tasks, errors can have a significant impact on patient care or material maintenance. For instance, false positives (a test finds that a patient is sick with cancer when in reality the patient is healthy) have much less impact than false negatives (a patient's cancer goes undetected). The community would expect guarantees that the level of false negative is close to none for diagnostic tasks, even if this implies a significant amount of false positive.

**Exercice:**
Complete the following script to:

- evaluate the logistic regression model trained above, using sklearn documentation (https://scikit-learn.org/stable/modules/model_evaluation.html)
- compute the precision (also called positive predictive value) and recall (also called true positive rate or sensitivity) called  of the trained classifier on both training and testing sets.

In [None]:
#@title Exercise 2

from sklearn.metrics import precision_score, recall_score

training_precision = '''CompleteHere'''(train_labels, training_predicted_classes)
training_recall = '''CompleteHere'''('''CompleteHere''', '''CompleteHere''')

validation_precision = precision_score(validation_labels, validation_predicted_classes)
validation_recall = recall_score(validation_labels, validation_predicted_classes)

print('Training precision', '''CompleteHere''', '; training recall', '''CompleteHere''')
print('Validation precision', '''CompleteHere''', '; Validation recall', '''CompleteHere''')

#Note: pays attention that the two metrics functions take predicted classes as input)

##### Analysing the trained model

Fundamentally, logistic regression is parametrized by two parameters per input feature : the weight (or coefficient) and the bias (or intercept). Intuitively, features corresponding to weights with high weights will have more influence on the output that those with weight close to 0.

Generally, the parameters of any classifier in scikit-learn can be extracted using:

In [None]:
# one parameter per input feature + one for intercept with 0
logistic_regression_parameters = clf.coef_
logistic_regression_intercept = clf.intercept_

Let us pair each parameter with its associated feature name and sort the parameters by magnitude. Then we retrieve the top 10 with most important magnitude i.e. **the 10 most relevant features for our classification task** :

In [None]:
# Pair each parameter with its associated feature name
logistic_regression_parameters_with_names = list(zip(logistic_regression_parameters[0], x_data.columns.values))
print('Example of paires param/feature name', logistic_regression_parameters_with_names[:3])
print('Intercept value', logistic_regression_intercept)

# Sort paired data with respect to absolute value of parameters
logistic_regression_parameters_with_names = sorted(logistic_regression_parameters_with_names, key=lambda pair: abs(pair[0]))

print('\nMost important features:')
# Select top 10 max magnitude parameters and print associated feature name
for parameter_value, feature_name in logistic_regression_parameters_with_names[:-10:-1]:
  print('\t\t', feature_name, 'with value:', str(parameter_value))

So far, we trained a logistic regression model, looked at some metrics to assess its performance, and looked into the trained parameters to infer the most important features found by the model.

In a typical ML setting, we would then try to further boost the performance of our decision system by trying different parameters or by using another family of functions etc.

### Improving performance with hyper-parameter optimization

In the previous section, we implemented a logistic regression model using scikit-learn for the classification of MRI images into LGG or HGG. However, there is no guarantee that logistic regression is the best suited family of functions for this task, with this type of data. There are many families of machine learning decision systems, the behavior and performance of which depend on the experiment context. Let's take a look at other options !

First, we will train 6 algorithms and use the reported generalization performance on the validation set to extract the best performing family of models. We will also look into another hyper-parameter which is the standardization of the data, and compare the validation generalization performance of models trained with standardization with respect to those trained without to determine whether applying standardization is beneficial to this task.

#### Standard validation (Hold-Out method : we split the training set into a train and a validation set)

We observed that how it worked above!

##### Standardization

In [None]:
# Import librairies
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


def standardize(x_train, x_test, type='zscore'):
    if type == 'zscore':
        scaler = StandardScaler()
    else:
        scaler = MinMaxScaler()
    scaler.fit(x_train)
    x_train_standardize = scaler.transform(x_train)
    # We apply the same transform on the test
    x_test_standardize = scaler.transform(x_test)
    return x_train_standardize, x_test_standardize


random_state = 1234

# We load 6 different algorithms with associated algorithm name
models = []
models.append(('Logistic Regression', LogisticRegression(random_state=random_state)))
models.append(('Linear Discriminant Analysis', LinearDiscriminantAnalysis()))
models.append(('K-nearest neighbours', KNeighborsClassifier()))
models.append(('Decision Tree Classifier', DecisionTreeClassifier(random_state=random_state)))
models.append(('Gaussian Naive Bayes', GaussianNB()))
models.append(('Support Vector Classifier', SVC(random_state=random_state)))


# We will perform the same process for each of the 6 algotithms: train on training set (fit) then get score on validation set
results_with_std = []
for model_name, model in models:
  # Apply standardization
    train_features_standardize, validation_features_standardize = standardize(x_train=train_features, x_test=validation_features, type='zscore')
    # Train model; N.B.: training features not standardized were train_features
    clf = model.fit(X=train_features_standardize, y=train_labels)
    accuracy_train = clf.score(X=train_features_standardize, y=train_labels)  # get train performance
    accuracy_val = clf.score(X=validation_features_standardize, y=validation_labels) # get validation performance
    results_with_std.append(accuracy_val)
    print("%s: train = %.3f, validation = %.3f" % (model_name, accuracy_train, accuracy_val))

Generally, in machine learning, we cannot identify the best model until training is over.

Here, the models have been tested only once on the validation set. To make our pipeline more robust, let us run cross validation instead.

#### K-fold cross validation

So far, we have :
- split our entire dataset into a "global machine learning optimization" set and a testing set.
- split the "global machine learning optimization" set into a training set, used to optimize the parameters of decision systems, and a validation set for hyper-parameters tuning.

This second split is random, as there is no reason to privilege some data samples over others. Thus, we could take the "global machine learning optimization" set and partition it into a new training set and new associated validation set.

K-fold cross-validation is the process of splitting a set into two sets k times, where for each split one set is used to train the model and the second to assess its performance. This yields more robust estimates of model performance without requiring more data.


-------------------------------------------------------------------------------


**Exercice:**

Complete the script below to implement the K-fold cross validation based on https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html

K-fold cross validation is performed as follows:
  - Randomly split your entire dataset into "k-folds"
  - For each fold in your dataset, train your model on the remaining k – 1 folds. Then, assess model performance on the hold-out fold.
  - Save the error you see on each of the predictions.
  - Repeat this until each of the k-folds has served as the test set.
  - The average of your k recorded errors is called the cross-validation error. It is used as your performance metric for the model.


  Now, one of the most commonly asked questions is, *“How do I choose the right value of k?”*.

A lower value of k is more biased, and hence undesirable. On the other hand, a higher value of k is less biased, but at the cost of an increase in variability.

A smaller value of k brings us closer towards the single validation set approach, whereas a higher value of k leads to Leave-One-Out Cross Validation (LOOCV) approach. Actually, LOOCV is equivalent to n-fold cross validation where n is the number of training examples.


![CV](https://raw.githubusercontent.com/jeannebc/Radiomics-BrainTumorClassification/main/images/cross_validation_method.png)

In [None]:
from sklearn.model_selection import KFold
import numpy as np
import pandas as pd


def cross_validation(X, y, model, num_folds=5):
    if isinstance(X, pd.DataFrame):
        X = X.to_numpy()
    if isinstance(y, pd.DataFrame):
        y = y.to_numpy()

    cv = KFold(n_splits=num_folds, random_state=123, shuffle=True)
    results_train, results_test = [], []
    for train_index, test_index in cv.split(X):
        X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]
        clf = model.fit('''CompleteHere''', '''CompleteHere''')
        accuracy_train = clf.score(X='''CompleteHere''', y='''CompleteHere''')
        accuracy_test = clf.score(X='''CompleteHere''', y='''CompleteHere''')  # Return the mean accuracy
        results_train.append(accuracy_train)
        results_test.append(accuracy_test)
    return '''CompleteHere''', '''CompleteHere'''


We implement cross validation on the <code>x_train</code> data

In [None]:
from sklearn.model_selection import KFold
import numpy as np
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler


def standardize(x_train, x_test, type='zscore'):
    if type == 'zscore':
        scaler = StandardScaler()
    else:
        scaler = MinMaxScaler()
    scaler.fit(x_train)
    x_train_standardize = scaler.transform(x_train)
    # We apply the same transform on the test
    x_test_standardize = scaler.transform(x_test)
    return x_train_standardize, x_test_standardize


def cross_validation_with_standardization(X, y, model, num_folds=5):
    if isinstance(X, pd.DataFrame):
        X = X.to_numpy()
    if isinstance(y, pd.DataFrame):
        y = y.to_numpy()

    cv = KFold(n_splits=num_folds, random_state=123, shuffle=True)
    results_train, results_test = [], []
    for train_index, test_index in cv.split(X):
        X_train, X_test, y_train, y_test = X[train_index], X[test_index], y[train_index], y[test_index]

         # standardize with z-score
        X_train, X_test = standardize(x_train=X_train, x_test=X_test, type='zscore')

        clf = model.fit(X_train, y_train)
        accuracy_train = clf.score(X=X_train, y=y_train)
        accuracy_test = clf.score(X=X_test, y=y_test)  # Return the mean accuracy
        results_train.append(accuracy_train)
        results_test.append(accuracy_test)
    return results_train, results_test

In [None]:
all_results_cv_train_w_std, all_results_cv_val_w_std, names = [], [], []
for name, model in models:
    names.append(name)
    results_cv_train_w_std, results_cv_val_w_std = cross_validation_with_standardization(X=x_train, y=y_train, model=model, num_folds=5)
    all_results_cv_train_w_std.append(results_cv_train_w_std)
    all_results_cv_val_w_std.append(results_cv_val_w_std)
    means_train, stds_train = np.mean(results_cv_train_w_std), np.std(results_cv_train_w_std)
    means_val, stds_val = np.mean(results_cv_val_w_std), np.std(results_cv_val_w_std)
    msg = "%s: train = %.3f (+/- %.3f), test = %.3f (+/- %.3f)" % (name, means_train, stds_train, means_val, stds_val)
    print(msg)

In [None]:
# boxplot algorithm comparison
plt.figure(figsize=(15,10))

ax = plt.subplot(122)
plt.title('Algorithm Comparison using cross validation with standardization')
plt.boxplot(all_results_cv_val_w_std)
ax.set_ylim([0.60, 0.95])
ax.set_xticklabels(names, rotation=40)
plt.show()

Compare the results of cross validation without standardization:

In [None]:
def cross_validation_without_standardization():
  """TODO"""

In [None]:
import matplotlib.pyplot as plt

# boxplot algorithm comparison
plt.figure(figsize=(15,10))

ax = plt.subplot(122)
plt.title('Algorithm Comparison using cross validation without standardization')
plt.boxplot("""COMPLETE HERE""")
ax.set_ylim([0.60, 0.95])
ax.set_xticklabels(names, rotation=40)
plt.show()

### Which model would you choose ?

# 3) Overfitting and Underfitting

When training a model, it is crucial to avoid two common pitfalls : overfitting or on the contrary, underfitting.

- **Overfitting**:  
A model suffers from overfitting when it becomes specific to the data it is trained on, and is no longer able to generalize accurately to unseen data. This is usually caused by overtraining the model on the training set. **The bigger the model (i.e. the more parameters), the higher the chances of overfitting**.

- **Underfitting**:  
Intuitively, a model suffers from underfitting when it has not been trained enough and behaves almost like a randomly initialized model. It is not speciff enough to the data and the task.

![over-underfitting](https://raw.githubusercontent.com/jeannebc/Radiomics-BrainTumorClassification/main/images/over_and_under_fitting.png)

*How do we detect overfitting?*  
A key challenge with overfitting, and with machine learning in general, is that we **do not know how well our model will perform on new data until we actually test it.**

To make sure the model overfit, we compare model performance on seen data (training set) versus unseen data (testing set). If our model does much better on the training set than on the testing set, we are likely overfitting. For instance, seeing a 99% training accuracy but a 55% testing accuracy is a clear sign of overfitting. The bigger the gap in performance, the stronger the overfit.

*How do we prevent overfitting?*  

- **Cross-validation**  
As seen previously, it implies using the initial training data to generate multiple mini train-test splits to tune your model.
In standard k-fold cross-validation, the data is partitioned into k subsets, called folds. The model is iteratively trained on the k-1 folds, and tested on the remaining one (the holdout fold). Cross-validation allows for hyperparameter tuning without integrating additional data, keeping the test set apart for final model evaluation.

- **Remove features**  
This is a means of reducing the number of parameters and making the model simpler. Some algorithms have built-in feature selection. For those that do not, you can manually improve their generalizability by removing irrelevant input features.

- **Regularization**  
Regularization refers to a broad range of techniques for artificially forcing your model to be simpler. The method will depend on the type of learner you are using. For example, you could prune a decision tree, use dropout on a neural network, or add a penalty parameter to the cost function in regression.

- **Also : additional data, ensembling, early stopping ...**


# 4) Hyperparameters optimization

Wikipedia states that "*hyperparameter tuning/optimizization is choosing a set of optimal hyperparameters for a learning algorithm*". What exactly is a hyperparameter ?

<div align="center">
    <i> a hyperparameter is a parameter whose value is set before the learning process begins </i>
</div>

An example of hyperparameter is include penalty in logistic regression
In sklearn, hyperparameters are passed as arguments of the constructor of the models classes.
   
**Tuning Strategies:**
 - Grid Search  
 Also known as an exhaustive search, Grid search looks through each combination of hyperparameters. This means that every combination of specified hyperparameter values will be tried.
 - Random Search  
 As its names suggests, Random Search uses random combinations of hyperparameters. This means that not all of the parameter values are tried, and instead, parameters will be sampled with fixed numbers of iterations.

 ![grid_search](https://raw.githubusercontent.com/jeannebc/Radiomics-BrainTumorClassification/main/images/random_grid_search.png)

### Implementing Grid Search

Let us define the grid search space. The pipeline implementing z-score normalisation will be used.

In [None]:
"""
Create a dictionary with classifier name as a key and it's hyper parameters options as a value for grid search
"""

import numpy as np

# Logistic Regression Params
C = [x for x in np.arange(0.1, 3, 0.2)]
penalty = ["l1", "l2"]
fit_intercept = [True, False]
solver = ["saga"]
lr_params = {'C': C,
             'penalty': penalty,
             'fit_intercept': fit_intercept,
             'solver': solver
             }

# DecisionTreeClassifier PARAMS
criterion = ['gini', 'entropy']
splitter = ['best', 'random']
class_weight = [None, "balanced"]
max_depth = [int(x) for x in np.linspace(10, 110, num=11)]
max_depth.append(None)
min_samples_split = [2, 5, 10]
min_samples_leaf = [1, 2, 4]
max_features = [None, "sqrt", "log2"]
dtc_params = {'criterion': criterion,
              'splitter': splitter,
              'class_weight': class_weight,
              'max_depth': max_depth,
              'min_samples_split': min_samples_split,
              'min_samples_leaf': min_samples_leaf,
              'max_features': max_features
              }

# KNN PARAMS
n_neighbors = [int(x) for x in np.linspace(start=1, stop=20, num=2)]
weights = ["uniform", "distance"]
algorithm = ["auto", "ball_tree", "kd_tree", "brute"]
leaf_size = [int(x) for x in np.linspace(start=5, stop=50, num=2)]
p = [int(x) for x in np.linspace(start=1, stop=4, num=1)]
knn_params = {'n_neighbors': n_neighbors,
              'weights': weights,
              'algorithm': algorithm,
              'leaf_size': leaf_size,
              'p': p,
              }

# LDA PARAMS
solver = ["lsqr"]
shrinkage = ["auto", None, 0.1, 0.3, 0.5, 0.7, 0.9]
lda_params = {'solver': solver,
              'shrinkage': shrinkage
              }

# GaussianNB PARAMS
var_smoothing = [1e-9, 1e-8, 1e-7, 1e-6, 1e-5] #Portion of the largest variance of all features that is added to variances for calculation stability
gnb_params = {'var_smoothing': var_smoothing,
              }

# SVC PARAMS
C = [x for x in np.arange(0.1, 2, 0.2)]
gamma = ["auto"]
kernel = ["linear", "poly", "rbf", "sigmoid"]
degree = [1, 2, 3, 4, 5, 6]
svc_params = {'C': C,
              'gamma': gamma,
              'kernel': kernel,
              'degree': degree,
              }

hypertuned_params_gs = {"Logistic Regression": lr_params,
                     "Decision Tree Classifier": dtc_params,
                     "K-nearest neighbours": knn_params,
                     "Linear Discriminant Analysis": lda_params,
                     "Gaussian Naive Bayes": gnb_params,
                     "Support Vector Classifier": svc_params,
                     }


In [None]:
# grid search function
import numpy as np
import warnings
warnings.filterwarnings('ignore')

def grid_search(X, y, name, model, param_grid, verbose=False):
    names = []
    names.append(name)
    result_gs, max_val_mean_val = [], 0
    for i, params in enumerate(param_grid): #enumerate: adds a counter to an iterable object
        model = model.set_params(**params)
        results_cv_train_w_std, results_cv_val_w_std = cross_validation_with_standadization(X=X, y=y, model=model, num_folds=5)
        mean_train, std_train = np.mean(results_cv_train_w_std), np.std(results_cv_train_w_std)
        mean_val, std_val = np.mean(results_cv_val_w_std), np.std(results_cv_val_w_std)
        if verbose:
            print("%s - iteration %i: %f (%f)" % (name, i, mean_val, std_val))
        if mean_val > max_val_mean_val:
            max_val_mean_test = mean_val
            max_val_std_val = std_val
            max_val_mean_train = mean_train
            max_val_std_train = std_train
            max_i = i
            best_params = param_grid[i]
    msg = "%s: Maximum value on validation = %.3f (+/- %.3f) with train = %.3f (+/- %.3f) for iteration %i with params: %s" % (name, max_val_mean_test, max_val_std_val, max_val_mean_train, max_val_std_train, max_i, best_params)
    print(msg)

In [None]:
from sklearn.model_selection import ParameterGrid

for name, model in models:
    param_grid = list(ParameterGrid(hypertuned_params_gs[name]))
    grid_search(X=x_train, y=y_train, name=name, model=model, param_grid=param_grid, verbose=False)  # you can set verbose to True to see each iteration

### Exercice 4
Compare with previous results with the grid search and without any hyperparameters optimization

### Random Search

Let us define the random search space. The pipeline implementing z-score normalisation will be used.

In [None]:
"""
Create a dictionary with classifier name as a key and it's hyper parameters options as a value for Random search
"""

import numpy as np
from scipy.stats import uniform

# Logistic Regression Params
# Create regularization hyperparameter distribution using uniform distribution
C = uniform(loc=0, scale=4)
penalty = ["l1", "l2"]
fit_intercept = [True, False]
solver = ["saga"]
lr_params = {'C': C,
             'penalty': penalty,
             'fit_intercept': fit_intercept,
             'solver': solver
             }

# DecisionTreeClassifier PARAMS
criterion = ['gini', 'entropy']
splitter = ['best', 'random']
class_weight = [None, "balanced"]
max_depth = list(range(10, 501))
max_depth.append(None)
min_samples_split = list(range(2, 101))
min_samples_leaf = list(range(1, 50))
max_features = [None, "sqrt", "log2"]
dtc_params = {'criterion': criterion,
              'splitter': splitter,
              'class_weight': class_weight,
              'max_depth': max_depth,
              'min_samples_split': min_samples_split,
              'min_samples_leaf': min_samples_leaf,
              'max_features': max_features
              }

# KNN PARAMS
n_neighbors = list(range(1, 101))
weights = ["uniform", "distance"]
algorithm = ["auto", "ball_tree", "kd_tree", "brute"]
leaf_size = list(range(2, 101))
p = list(range(1, 11))
knn_params = {'n_neighbors': n_neighbors,
              'weights': weights,
              'algorithm': algorithm,
              'leaf_size': leaf_size,
              'p': p,
              }

# LDA PARAMS
solver = ["lsqr"]
shrinkage = ["auto", None, 0.1, 0.3, 0.5, 0.7, 0.9]
lda_params = {'solver': solver,
              'shrinkage': shrinkage
              }

# GaussianNB PARAMS
var_smoothing = uniform(loc=0, scale=0.1)
gnb_params = {'var_smoothing': var_smoothing,
              }

# SVC PARAMS
C =  uniform(loc=0, scale=2)
gamma = ["auto"]
kernel = ["linear", "poly", "rbf", "sigmoid"]
degree = list(range(1,11))
svc_params = {'C': C,
              'gamma': gamma,
              'kernel': kernel,
              'degree': degree,
              }

hypertuned_params_rs = {"Logistic Regression": lr_params,
                     "Decision Tree Classifier": dtc_params,
                     "K-nearest neighbours": knn_params,
                     "Linear Discriminant Analysis": lda_params,
                     "Gaussian Naive Bayes": gnb_params,
                     "Support Vector Classifier": svc_params,
                     }

In [None]:
# random search function
import random
import warnings
warnings.filterwarnings('ignore')

def random_search(X, y, name, model, param_grid, nb_iterations, verbose=False):
    best_params = []
    names = []
    names.append(name)
    result_rs, max_val_mean_val = [], 0
    for i in range(nb_iterations):
        # create random param from the grid dict
        params = {key: value.rvs() if isinstance(value, type(uniform())) else random.choice(value) for key, value in param_grid.items()}
        model = model.set_params(**params)
        results_cv_train_w_std, results_cv_val_w_std = cross_validation_with_standadization(X=X, y=y, model=model, num_folds=5)
        mean_train, std_train = np.mean(results_cv_train_w_std), np.std(results_cv_train_w_std)
        mean_test, std_val = np.mean(results_cv_val_w_std), np.std(results_cv_val_w_std)
        if verbose:
            print("%s - iteration %i: %f (%f)" % (name, i, mean_test, std_val))
        if mean_test > max_val_mean_val:
            max_val_mean_test = mean_test
            max_val_std_test = std_val
            max_val_mean_train = mean_train
            max_val_std_train = std_train
            max_i = i
            best_params = params
    msg = "%s: Maximum value on validation = %.3f (+/- %.3f) with train = %.3f (+/- %.3f) for iteration %i with params: %s" % (name, max_val_mean_test, max_val_std_test, max_val_mean_train, max_val_std_train, max_i, best_params)
    print(msg)

In [None]:
for name, model in models:
    dic_grid = hypertuned_params_rs[name]
    random_search(X=x_train, y=y_train, name=name, model=model, param_grid=dic_grid, nb_iterations=100, verbose=False)  # you can set verbose to True to see each iteration

Here, we implemented our own Grid Search and Random Search function to understand the mechanisms at play. In practice, sklearn directly implements the functions, [Random Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.RandomizedSearchCV.html) and [Grid Search](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.GridSearchCV.html)

# 5) Ensembling

The ''No Free Lunch'' theorem states that no machine learning algorithm is universally better than the others in all domains. The goal of ensembling is to combine multiple learners to improve the applicability and achieve better performance. Importantly, if two models have comparable performance, the simplest architecture should be priviledged (see [Occam's razor](https://simple.wikipedia.org/wiki/Occam%27s_razor)).

### Voting

Voting is arguably the most straightforward way to combine multiple learners $d^{(j)}(\cdot)$. The idea is to take a linear combination of the predictions made by the learners. For example, in multiclass classification, we have
$$\tilde{y}_k =\sum_j^L w_j d^{(j)}_k(\boldsymbol{x}), \text{ where }w_j\geq 0\text{ and }\sum_j w_j=1,$$<p>for any class $k$, where $L$ is the number of voters. This can be simplified to the <strong>plurarity vote</strong> where each voter has the same weight:</p>
$$\tilde{y}_k =\sum_j \frac{1}{L} d^{(j)}_k(\boldsymbol{x}).$$<p> We use the <code>VotingClassifier</code> from Scikit-learn to combine several classifiers.</p>

We will use the Sklearn <code>Pipeline</code> tools which allow to combine all the steps we have seen previously.


In [None]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.naive_bayes import GaussianNB
from sklearn.svm import SVC

pipe1 = Pipeline([['sc', StandardScaler()], ['clf', LogisticRegression(**{'C': 0.1, 'fit_intercept': True, 'penalty': 'l1', 'solver': 'saga'}, random_state=random_state)]])
pipe2 = Pipeline([['clf', DecisionTreeClassifier(**{'class_weight': None, 'criterion': 'gini', 'max_depth': 80, 'max_features': 'log2', 'min_samples_leaf': 1, 'min_samples_split': 10, 'splitter': 'best'}, random_state=random_state)]])
pipe3 = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(**{'algorithm': 'auto', 'leaf_size': 5, 'n_neighbors': 20, 'p': 1, 'weights': 'distance'})]])
pipe4 = Pipeline([['sc', StandardScaler()], ['clf', LinearDiscriminantAnalysis(**{'shrinkage': 0.7, 'solver': 'lsqr'})]])
pipe5 = Pipeline([['sc', StandardScaler()], ['clf', GaussianNB(**{'var_smoothing': 1e-09})]])
pipe6 = Pipeline([['sc', StandardScaler()], ['clf', SVC(**{'C': 0.5, 'degree': 1, 'gamma': 'auto', 'kernel': 'sigmoid', 'probability': True}, random_state=random_state)]])

We can estimate the performance of individual classifiers via the 5-fold CV:

In [None]:
from sklearn.model_selection import cross_validate

clf_labels = ['LR', 'DTC', 'KNN', 'LDA', 'GNB', 'SVC']
print('[Individual]')
for pipe, label in zip([pipe1, pipe2, pipe3, pipe4, pipe5, pipe6], clf_labels):
    results = cross_validate(estimator=pipe, X=x_train, y=y_train, cv=5, scoring='accuracy', return_train_score=True)
    scores_val = results['test_score']
    scores_train = results['train_score']
    print('%s: train = %.3f (+/- %.3f), validation = %.3f (+/- %.3f)' % (label, scores_train.mean(), scores_train.std(), scores_val.mean(), scores_val.std()))

We combine the classifiers by <code>VotingClassifer</code> from Scikit-learn and experiment some weight combinations:

In [None]:
from sklearn.ensemble import VotingClassifier
import itertools

print('[Voting]')
best_vt, best_w, best_val_score, best_train_score = None, (), -1, 1
for a, b, c in list(itertools.permutations(range(0,3))): # try some weight combination
    clf = VotingClassifier(estimators=[('dtc', pipe2), ('knn', pipe3), ('svc', pipe6)],
                           voting='soft', weights=[a,b,c])
    results = cross_validate(estimator=clf, X=x_train, y=y_train, cv=5, scoring='accuracy', return_train_score=True)
    scores_val = results['test_score']
    scores_train = results['train_score']
    print('%s: train = %.3f (+/- %.3f), validation = %.3f (+/- %.3f)' % ((a,b,c), scores_train.mean(), scores_train.std(), scores_val.mean(), scores_val.std()))
    if best_val_score < scores_val.mean() and best_train_score > scores_train.mean():
        best_vt, best_w, best_val_score = clf, (a, b, c), scores_val.mean()

print('\nBest %s: %.3f' % (best_w, best_val_score))

### Exercise 5
What is the best ensemble?

# 6) Final prediction and Evaluation metrics

We have now reached the final step of our ML pipeline : evaluation on the hold out test set, which has remained untouched so far. Our final model will be an ensemble which combines the <code>KNeighborsClassifier</code> and <code>SVC</code> with a z-score standardization as feature preprocessing. Let us fit the model on all the training data as follows:

*Note : A decision tree does not need a z-score preprocessing because by nature the algorithm is scale invariant.*

In [None]:
from sklearn.ensemble import VotingClassifier
from sklearn.pipeline import Pipeline
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, confusion_matrix, auc, ConfusionMatrixDisplay, RocCurveDisplay

pipe3 = Pipeline([['sc', StandardScaler()], ['clf', KNeighborsClassifier(**{'algorithm': 'auto', 'leaf_size': 5, 'n_neighbors': 20, 'p': 1, 'weights': 'distance'})]])
pipe6 = Pipeline([['sc', StandardScaler()], ['clf', SVC(**{'C': 0.5000000000000001, 'degree': 1, 'gamma': 'auto', 'kernel': 'sigmoid', 'probability': True})]])

clf = VotingClassifier(estimators=[('knn', pipe3), ('svc', pipe6)], voting='soft', weights=[1,2])
clf.fit(x_train, y_train)
classes = clf.classes_ #number of classes

# Here we have the probabilty associate to each classes
proba_test = clf.predict_proba(x_test)[:,1] # [:,1] referes to the second class HGG
y_pred = np.where(proba_test>0.5, 1, 0) # Here we have the prediction


# 1 -- Confusion matrix
cm = confusion_matrix(y_test, y_pred)

# 2 -- ROC curve
fp_rates, tp_rates, _ = roc_curve(y_test, proba_test, pos_label=1)
roc_auc = auc(fp_rates, tp_rates)

tn, fp, fn, tp = [i for i in cm.ravel()]

# 3 -- Calculate each metrics
precision = tp / (tp + fp)
recall = tp / (tp + fn)
F1 = 2 * (precision * recall) / (precision + recall)
accuracy = (tn + tp) / (tn + fp + fn + tp)

printout = (
        f'Precision: {round(precision, 3)} | '
        f'Recall: {round(recall, 3)} | '
        f'F1 Score: {round(F1, 3)} | '
        f'Accuracy Score: {round(accuracy, 3)} | '
        f'ROC auc: {round(roc_auc, 3)} | '

    )
print(printout)


Now, we evaluate the classifier using different metrics.
Let us plot the confusion matrix and ROC curve:

In [None]:
ConfusionMatrixDisplay.from_estimator(clf, x_test, y_test);
RocCurveDisplay.from_estimator(clf, x_test, y_test);

Finally, let's look at the distribution of predicted probabilities for each class.

In [None]:
df = pd.DataFrame({'probPos': proba_test, 'target': y_test})
plt.figure(figsize=(10,5))
plt.hist(df[df.target == 0].probPos, density=True, bins=25,
             alpha=.5, color='green', label='LGG')
plt.hist(df[df.target == 1].probPos, density=True, bins=25,
             alpha=.5, color='red', label='HGG')
plt.axvline(.5, color='blue', linestyle='--', label='Boundary')
plt.xlim([0, 1])
plt.title('Distributions of Predictions', size=15)
plt.xlabel('Positive Probability (predicted)', size=13)
plt.ylabel('Samples (normalized scale)', size=13)
plt.legend(loc="upper right")
plt.show();