# Lab 4: Class Imbalance and Performance Metrics

**Exercise 1**

A little classification problem on glass type recognition.

1. Import the `glass.dat` dataset;
2. print the data  and the classes (column `type`);
3. divide the dataset into training and test set with the 20% of the whole data for testing;
4. print the number of training and test instances;
5. train a SVM and compute the accuracy on the test set.

In [None]:
# YOUR CODE HERE

Are we happy about this accuracy? Let's look the f1-score = harmonic mean of the precision and recall with `metrics.f1_score`

$$
    f1 = 2 \cdot \dfrac{(precision \cdot recall)}{(precision + recall)}
$$

In [None]:
print(metrics.f1_score(y_pred,y_test))

In [None]:
print(metrics.recall_score(y_pred,y_test)) # recall = tp / (tp + fn) 
print(metrics.precision_score(y_pred,y_test)) # precision =  tp / (tp + fp)

Both precision and recall are 0, but why? Let's print the entries of the confusion matrix with `metrics.confusion_matrix`.


<img src="img/conf_mat.jpg" alt="Drawing" style="width: 500px;"/>

In [None]:
conf_mat = metrics.confusion_matrix(y_test, y_pred)
print("confusion matrix \n",conf_mat,"\n")
print("True negative    \t",conf_mat[0][0],"--> predicted as negative, and really negative :) ")
print("False positive   \t",conf_mat[0][1]," --> predicted as positive, but negative :( ")
print("False negative  \t",conf_mat[1][0]," --> predicted as negative, but positive :( ")
print("True positive   \t",conf_mat[1][1]," --> predicted as positive, and really positive :) ")

In [None]:
# have a look at the predictions:

print(y_pred)

The classifier predicts only one classe (the class zero), that's really bad! Any idea?

**Exercise 2**

1. Print the number of instances of class 0 and 1. Hint: you can use `np.unique` by setting a proper input parameter, have a look at the documentation.
2. Print the ratio of positive samples over the total samples -> the skewness.
3. Print the ratio of negative samples over the total samples.
4. Is the dataset balanced?

In [None]:
# YOUR CODE HERE

It is a good practice to compute the positive and negative ratio before training a ML to understand the imbalance of the data.

In the next exercise we address the imbalance by oversampling the minority class: giving more weights to the minority class corresponds to duplicate the samples of the minority class to reach the same number of the majority class.

**Exercise 3**
1. Train a SVM by including the class weights (see the doc, use both the `class_weight` option);
2. The `class_weight` option takes a dict as `{0: weight_class_0, 1: weight_class_1}`, we can give weight 1 to the most popular class and a bigger weight to the minority class.
3. Print precision, recall, f1 score, accuracy and confusion matrix.

In [None]:
# YOUR CODE HERE

## Plot Receiver Operating Characteristic and Precision/Recall curves

Sklearn provides you utilities for computing the ROC and the PrecRec curve: the `roc_curve` and `precision_recall_curve` functions in `sklearn.metrics`. In addition, the `auc` function returns the area under each curve.

**Exercise 4**
Plot the ROC curve for each class of the Iris dataset. Understand and complete the following code.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn import svm, datasets
from sklearn.metrics import roc_curve, auc
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import label_binarize
from sklearn.multiclass import OneVsRestClassifier


# Import the IRIS dataset by using the datasets.load_iris() function and inspect its attributes
# YOUR CODE HERE
iris = datasets.load_iris()
X = iris.data
y = iris.target

# Binarize the output
y = label_binarize(y, classes=[0, 1, 2])
n_classes = y.shape[1]

# Add noisy features to make the problem harder
random_state = np.random.RandomState(10)
n_samples, n_features = X.shape
X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]

# shuffle and split training and test sets: test size 50%, random_state=0
# YOUR CODE HERE

# Learn to predict each class against the other
classifier = OneVsRestClassifier(svm.SVC(kernel="linear", probability=True, random_state=random_state))

# Train the classifier and compute the scores of the test set (look at the SVM methods)
# YOUR CODE HERE

# Compute ROC curve and ROC area for each class
# Understand what the roc_curve() does and takes as input
fpr = dict()
tpr = dict()
th = dict()
roc_auc = dict()
for i in range(n_classes):
    # YOUR CODE HERE

# Compute micro-average ROC curve and ROC area
fpr["micro"], tpr["micro"], _ = roc_curve(y_test.ravel(), y_score.ravel())
roc_auc["micro"] = auc(fpr["micro"], tpr["micro"])

In [None]:
# Plot the ROC curve for each class with the AUC in the legend. Plot also the line from (0, 0) to (1, 1)
# YOUR CODE HERE

This task can be performed with the `RocCurveDisplay` function in `sklearn.metrics`. The boolean parameter `plot_chance_level` (new from version 1.3.) determines whether to plot the baseline level.

In [None]:
from sklearn.metrics import RocCurveDisplay
import matplotlib.pyplot as plt

colors = ["aqua", "darkorange", "cornflowerblue"]
fig, ax = plt.subplots(figsize=(6, 6))

for class_id, class_name in enumerate(iris.target_names):
    RocCurveDisplay.from_predictions(
        y_test[:, class_id],
        y_score[:, class_id],
        name=f"ROC curve for {class_name}",
        color=colors[class_id],
        ax=ax)
    
_ = ax.set(
    xlabel="False Positive Rate",
    ylabel="True Positive Rate",
    title="Receiver Operating Characteristic\nto One-vs-Rest multiclass")

**Exercise 5**

Repeat the same exercise by plotting the Precision Recall curve (search for the `precision_recall_curve` SKlearn function) with the baseline: P/(P + N)

In [None]:
# YOUR CODE HERE

This task can be performed with the `PrecisionRecallDisplay` function in `sklearn.metrics`.

In [None]:
import matplotlib.pyplot as plt
from sklearn.metrics import average_precision_score, precision_recall_curve, PrecisionRecallDisplay

precision = dict()
recall = dict()
average_precision = dict()

fig, ax = plt.subplots(figsize=(6, 6))
ax.set_title="Precision Recall Curve\nto One-vs-Rest multiclass",

colors = ["aqua", "darkorange", "cornflowerblue"]
for class_id, class_name in enumerate(iris.target_names):
    precision[class_id], recall[class_id], _ = precision_recall_curve(y_test[:, class_id], y_score[:, class_id])
    average_precision[class_id] = average_precision_score(y_test[:, class_id], y_score[:, class_id])
    display = PrecisionRecallDisplay(
        recall=recall[class_id],
        precision=precision[class_id],
        average_precision=average_precision[class_id],
    )
    display.plot(ax=ax, name=f"Precision-recall for class {class_name}", color=colors[class_id])

plt.show()

## Model Selection: K fold, cross validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called **overfitting**. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set `X_test`, `y_test`. Note that the word “experiment” is not intended to denote academic use only, because even in commercial settings machine learning usually starts out experimentally.

One problem with validation sets is that you "lose" some of the data: you only used, e.g., 3/4 of the data for the training, and used 1/4 for the validation. One option is to use K-fold cross-validation, where we split the data into chunks and perform fits, where each chunk gets a turn as the validation set. 

First, we need to understand how to split data into folds by using `KFold`:

In [1]:
import numpy as np
from sklearn.model_selection import KFold
import pandas as pd

data = pd.read_csv("glass.dat")
X = data.iloc[:, :-1].values
y = data['type'].values

kf = KFold(n_splits=3, shuffle=True, random_state=0)

for fold_id, (train_index, test_index) in enumerate(kf.split(X)):
    print(f"Fold id {fold_id} TRAIN: {train_index}")
    print(f"Fold id {fold_id} TEST: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f"Label counts: {np.unique(y_train, return_counts=True)}")
    print("--------------------------------------------------------------------------------\n")

Fold id 0 TRAIN: [  0   1   2   3   6   9  10  11  14  17  19  20  21  23  25  27  28  29
  31  32  34  35  36  38  39  41  42  43  46  47  48  49  50  51  53  54
  57  58  59  62  64  65  67  68  69  70  72  73  77  78  79  81  82  84
  85  87  88  91  92  93  94  95  98  99 100 101 102 103 104 105 107 110
 113 114 115 116 117 118 119 120 121 123 124 125 127 128 130 131 132 133
 134 140 142 146 147 148 149 150 151 152 156 161 163 164 165 166 167 169
 171 172 173 174 175 177 178 179 181 182 183 184 185 186 187 188 190 192
 193 195 198 199 200 201 203 204 205 206 207 209 210 211 212 213]
Fold id 0 TEST: [  4   5   7   8  12  13  15  16  18  22  24  26  30  33  37  40  44  45
  52  55  56  60  61  63  66  71  74  75  76  80  83  86  89  90  96  97
 106 108 109 111 112 122 126 129 135 136 137 138 139 141 143 144 145 153
 154 155 157 158 159 160 162 168 170 176 180 189 191 194 196 197 202 208]
Label counts: (array([0, 1]), array([131,  11]))
------------------------------------------------

We can notice that some y labels are not balanced in the folds, use `StratifiedKFold` to obtain folds with balanced labels.

In [2]:
import numpy as np
from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=3, shuffle=True, random_state=0)

for fold_id,  (train_index, test_index) in enumerate(skf.split(X, y)):
    print(f"Fold id {fold_id} TRAIN: {train_index}")
    print(f"Fold id {fold_id} TEST: {test_index}")
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f"Label counts: {np.unique(y_train, return_counts=True)}")
    print("--------------------------------------------------------------------------------\n")

Fold id 0 TRAIN: [  1   2   3   4   5   7   9  11  12  13  14  17  19  20  21  23  25  26
  27  28  32  33  35  36  38  39  40  43  44  47  48  49  50  51  52  53
  54  57  60  61  62  64  65  67  69  70  71  72  73  74  75  76  79  80
  82  83  84  88  89  92  93  94  95  96  97  98  99 100 101 104 105 106
 108 110 113 115 117 118 121 122 124 125 127 128 129 131 132 133 134 135
 136 137 138 140 141 146 147 149 150 151 154 156 157 160 161 164 165 167
 168 170 171 174 175 176 177 178 180 181 182 184 185 186 187 188 189 190
 191 193 194 195 196 197 200 201 202 203 207 208 209 210 211 213]
Fold id 0 TEST: [  0   6   8  10  15  16  18  22  24  29  30  31  34  37  41  42  45  46
  55  56  58  59  63  66  68  77  78  81  85  86  87  90  91 102 103 107
 109 111 112 114 116 119 120 123 126 130 139 142 143 144 145 148 152 153
 155 158 159 162 163 166 169 172 173 179 183 192 198 199 204 205 206 212]
Label counts: (array([0, 1]), array([134,   8]))
------------------------------------------------





When evaluating different settings (“hyperparameters”) for estimators, such as the C setting that must be manually set for an SVM, there is still a risk of overfitting on the test set because the parameters can be tweaked until the estimator performs optimally. This way, knowledge about the test set can “leak” into the model and evaluation metrics no longer report on generalization performance. To solve this problem, yet another part of the dataset can be held out as a so-called “validation set”: training proceeds on the training set, after which evaluation is done on the validation set, and when the experiment seems to be successful, final evaluation can be done on the test set.

However, by partitioning the available data into three sets, we drastically reduce the number of samples which can be used for learning the model, and the results can depend on a particular random choice for the pair of (train, validation) sets.

A solution to this problem is a procedure called cross-validation (CV for short). A test set should still be held out for final evaluation, but the validation set is no longer needed when doing CV. In the basic approach, called k-fold CV, the training set is split into k smaller sets. The following procedure is followed for each of the k “folds”:

- A model is trained using $K - 1$ of the folds as training data;

- the resulting model is validated on the remaining part of the data (i.e., it is used as a test set to compute a performance measure such as accuracy).

The performance measure reported by k-fold cross-validation is then the average of the values computed in the loop. This approach can be computationally expensive, but does not waste too much data (as is the case when fixing an arbitrary validation set), which is a major advantage in problems such as inverse inference where the number of samples is very small.

<br><br>
<img src="img/grid_search_cross_validation.png" style="width: 400px;"/>
<br><br>

Here is a flowchart of typical cross validation workflow in model training. The best parameters can be determined by grid search techniques:

<br><br>
<img src="img/grid_search_workflow.png" style="width: 400px;"/>
<br><br>

**Exercise 6**

Use the `google_play_store_apps_reviews_training` dataset to perform 5-fold cross-validation:

- Look at Lab 3 for the data import, preprocessing and splitting into train and test. Use TfidfVectorizer without stopwords;
- Combine a `StratifiedKFold` with with `cross_val_score` of `sklearn.model_selection` to train a Support Vector Machine with linear kernel. As classification score in the `cross_val_score` object use the f1 measure (see [here](https://scikit-learn.org/stable/modules/model_evaluation.html#common-cases-predefined-values)) in the `scoring` parameter (see [here](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.cross_val_score.html)). For
- print the mean and stdev of the cross validation scores.

In [None]:
### YOUR CODE HERE

**Exercise 7**

You can create a Sklearn pipeline to embed several ML steps. Pipelines can be used as normal classifiers, here an example:

In [3]:
from sklearn.pipeline import make_pipeline
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.model_selection import cross_val_score
from sklearn import preprocessing
from sklearn import svm, datasets
from sklearn.metrics import precision_recall_curve, auc

data = pd.read_csv('google_play_store_apps_reviews_training.csv')

# data cleaning
data = data.drop('package_name', axis=1)
data['review'] = data['review'].str.strip().str.lower().str.replace('[^\w]', ' ')

# Split into training and testing data
X = data['review']
y = data['polarity']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

kf = StratifiedKFold(n_splits=5, shuffle=True, random_state=0)

clf_pipeline = make_pipeline(CountVectorizer(), svm.SVC(kernel='linear', C=1))
scores = cross_val_score(clf_pipeline, X_train, y_train, cv=kf, scoring='f1')
print(scores.mean(), scores.std())

0.6904193624647222 0.03892002616929922


Implement the following 3 pipelines for the above classification and evaluate them with cross_val_score:
- `TfidfVectorizer()` and `svm.SVC(kernel='linear', C=1)`;
- `CountVectorizer()` and the feature `Normalizer()` step, `svm.SVC(kernel='linear', C=1)`;
- `CountVectorizer()` and `svm.SVC(kernel='linear', C=1)`.

In [None]:
### YOUR CODE HERE

**Exercise 8**

Repeat exercise 7 by using `GridSearchCV` to perform an exhaustive search over a set of hyperparameters for a SVM with 'rbf' kernel:
- user `TfidfVectorizer` as vectorizer;
- use `StratifiedKFold` with 5 folds;
- you have to define a hyperparameters grid with a Python dict: the keys are the hyperparam name, the values the range of values for that hyperparam;
- use `[1, 10, 100]` for C and `[0.01, 0.05, 0.1]` for gamma;
- perform `GridSearchCV`: use the f1 for the score and try to run the fold in parallel:look at the documentation how it is used;
- export the results in a Pandas dataframe and print it.
- What is the hyperparameters combination leading to the best average f1? What is the value of the best average f1?

In [None]:
### YOUR CODE HERE