# **Extra 3.1: Imbalanced Classification**

<hr>

## **1. Introduction**
In this additional practice, we are asked to:

> Create a binary classifier that learns to predict whether a lap has been deleted or not.

A lap is *"deleted"* when one of the drivers makes some type of irregularity during it, usually going beyond the track limits.

Our goal is to detect these types of laps using only data on *average speeds* at the *finish line* and *maximum speed in the sector*.

These speeds are typically altered when a driver does not comply with the sport regulations (exceeding limits, not respecting yellow flags, not reducing speed during the safety car, going too slow in exit or entry zones, etc.).

**In the following block of code, we load the data and retrieve the "Deleted" column that was removed during the preprocessing in practice 2.**

In [None]:
import pandas as pd

# Load files from Labs 2.2 and 3.1
data_full = pd.read_csv('https://raw.githubusercontent.com/AIC-Uniovi/Sistemas-Inteligentes/refs/heads/main/datasets/f1_23_monaco.csv')
data_reduced = pd.read_pickle('https://raw.githubusercontent.com/AIC-Uniovi/Sistemas-Inteligentes/refs/heads/main/datasets/f1_23_monaco.pkl')

# Merge missing columns without redundant rows
data_full['Time'] = data_full['Time'].astype(str)
data_reduced['Time'] = data_reduced['Time'].astype(str)
cols_to_add = [col for col in data_full.columns if col not in data_reduced.columns]
data = data_reduced.merge(data_full[['Time'] + cols_to_add], on = 'Time', how = 'left')
data = data.dropna(subset = ['SpeedI1'])

## **2. Preprocessing and Visualization**

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Create a new DataFrame <code>data_vel</code> with the relevant columns for our problem ( <i>"SpeedI1", "SpeedI2", "SpeedFL", "SpeedST" and "Deleted"<i>). To simplify, we will only keep the <b>Williams</b> and <b>Alpine</b> teams.
</div>

In [None]:
teams = ['Williams', 'Alpine']
data_vel = data.loc[data['Team'].isin(teams), ['Team', 'SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST', 'Deleted']].copy()

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Add the "Class" column to the DataFrame.
</div>

In [None]:
data_vel['Class'] = data_vel['Deleted'].apply(lambda x: 1 if x == True else 0)

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Visualize the distribution of maximum speed in the sector ("SpeedST") for both teams.
</div>

In [None]:
import matplotlib.pyplot as plt
import seaborn as sns

# With a KDE
plt.figure(figsize = (10, 4))
sns.kdeplot(data = data_vel, x = 'SpeedST', hue = 'Team', fill = True)
plt.title('Distribution by team')
plt.show()

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Split the dataset into <code>data_vel_train</code> and <code>data_vel_test</code> using the <code>train_test_split</code> function. Use 30% for testing.
</div>

In [None]:
from sklearn.model_selection import train_test_split
seed = 2533
data_vel_train, data_vel_test = train_test_split(data_vel, test_size = 0.30, random_state = seed)

## **3. Baselines**

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Train the baselines <i>Random</i> and <i>Zero-R</i>. Remember to split the training data into X and Y first.
</div>

In [None]:
from sklearn.dummy import DummyClassifier

X_train = data_vel_train[['SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']]
Y_train = data_vel_train['Class']

baseline_random = DummyClassifier(strategy = 'uniform', random_state = seed)
baseline_zeror = DummyClassifier(strategy = 'most_frequent')

baseline_random.fit(X_train, Y_train)
baseline_zeror.fit(X_train, Y_train)

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Split the test data into X and Y and make predictions with each model (<code>pred_random</code> and <code>pred_zeror</code>).
</div>

In [None]:
X_test = data_vel_test[['SpeedI1', 'SpeedI2', 'SpeedFL', 'SpeedST']]
Y_test = data_vel_test['Class']

pred_aleatorio = baseline_random.predict(X_test)
pred_zeror = baseline_zeror.predict(X_test)

We now have our models trained and have made predictions on the test set, so let's move on to evaluating their performance in a more objective way.

For this, we will once again use the **"metrics"** module from the **"scikit-learn"** library.

Remember that the most relevant metrics in classification problems are the following:

<div style="width:800px;background:white;padding:10px">
    <img src="https://i.imgur.com/7WwY9bZ.jpeg" style="margin-bottom:10px"> </img>
</div>

For our specific problem, they would mean the following:

* **Accuracy:** The percentage of correct predictions over the total number of predictions. Useful when the classes are balanced.
* **Precision:** Of all the times the model predicted a 1 (deleted), how many times was it correct?
* **Recall:** Of all the laps that were actually deleted, how many did the model successfully detect?
* **F1-Score:** The "harmonic" mean between Precision and Recall.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Use the functions from <code>sklearn.metrics</code> to evaluate the <i>Random</i> and <i>Zero-R</i> models. Obtain <i>Accuracy</i>, <i>Precision</i>, <i>Recall</i>, and <i>F1-score</i> metrics, as well as the <i>Confusion Matrix</i>.
</div>

In [None]:
from sklearn import metrics

print('-' * 50)
print('Random')
print('-' * 50)
print('· Confusion matrix:')
print(metrics.confusion_matrix(Y_test, pred_aleatorio))
print('· Accuracy:', metrics.accuracy_score(Y_test, pred_aleatorio))
print('· Precision:', metrics.precision_score(Y_test, pred_aleatorio))
print('· Recall:', metrics.recall_score(Y_test, pred_aleatorio))
print('· F1 Score:', metrics.f1_score(Y_test, pred_aleatorio))
print()
print('-' * 50)
print('Zero-R')
print('-' * 50)
print('· Confusion matrix:')
print(metrics.confusion_matrix(Y_test, pred_zeror))
print('· Accuracy:', metrics.accuracy_score(Y_test, pred_zeror))
print('· Precision:', metrics.precision_score(Y_test, pred_zeror))
print('· Recall:', metrics.recall_score(Y_test, pred_zeror))
print('· F1 Score:', metrics.f1_score(Y_test, pred_zeror))

The results obtained, except for Accuracy, leave much to be desired.

As you may remember, Accuracy is not a good metric when we are dealing with a **class imbalance**.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Obtain the number of examples of each class (positive and negative) in the <code>data_vel</code> set without splitting.
</div>

In [None]:
data_vel.groupby('Class').size()

This imbalance occurs because there are many more "normal" laps (class $0$) than "eliminated" laps (class $1$).
Therefore, we cannot rely on accuracy and will need to learn a different model that maximizes:

- **Recall**, to avoid false negatives (eliminated laps that the model did not detect).
- **Precision**, to avoid false positives (normal laps that the model classifies as eliminated).
- **F1-score**, if we are interested in balancing the two mentioned metrics.

Now we will try to train more complex models in order to improve these metrics.

## **4. Learning**

The following models we are going to learn, unlike the baselines, do use the input data ($X$) to predict the outputs ($Y$); therefore, from now on, it is crucial to standardize this data.

Remember that this helps to avoid variables measured on larger scales from dominating those measured on smaller scales, which would slow down the learning of the models.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Use the <a href="https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html"><code>StandardScaler()</code></a> from <i>sklearn</i> to standardize the train and test X data. Store the new data in <code>X_train_std</code> and <code>X_test_std</code>.
    <hr>
    Remember that the test data is unknown to us, so it cannot influence the calculation of the mean and standard deviation.
    <hr>
    How to use <code>StandardScaler()</code>:
    <ul>
        <li> <code>fit_transform(X_train)</code>: This should only be done with the training data. This method performs two steps: it obtains the means and standard deviations of each column (<code>fit()</code>) and standardizes based on them (<code>transform()</code>).</li>
        <li> <code>transform(X_test)</code>: It uses the means and standard deviations calculated from the train set (with <code>fit()</code>) to standardize.</li>
    </ul>
</div>

In [None]:
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
X_train_std = scaler.fit_transform(X_train)
X_test_std = scaler.transform(X_test)

print(X_train_std.mean(axis = 0))
print(X_train_std.std(axis = 0))

print(X_test_std.mean(axis = 0))
print(X_test_std.std(axis = 0))

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Train various models using the functions defined previously in practice 3.1.
</div>

In [None]:
def train_and_eval_model(model_name, model, X_train_std, Y_train, X_test_std, Y_test):
    # Train the model
    model.fit(X_train_std, Y_train.squeeze())
    # Predictions
    Y_train_pred = model.predict(X_train_std)
    Y_test_pred = model.predict(X_test_std)
    # Calculate metrics for train
    tr_accuracy = metrics.accuracy_score(Y_train, Y_train_pred)
    tr_precision = metrics.precision_score(Y_train, Y_train_pred, zero_division  =0)
    tr_recall = metrics.recall_score(Y_train, Y_train_pred)
    tr_f1 = metrics.f1_score(Y_train, Y_train_pred)
    
    # Calculate metrics for test
    tst_accuracy = metrics.accuracy_score(Y_test, Y_test_pred)
    tst_precision = metrics.precision_score(Y_test, Y_test_pred, zero_division = 0)
    tst_recall = metrics.recall_score(Y_test, Y_test_pred)
    tst_f1 = metrics.f1_score(Y_test, Y_test_pred)
    return (model_name, tr_accuracy, tr_precision, tr_recall, tr_f1, tst_accuracy, tst_precision, tst_recall, tst_f1)

In [None]:
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC


def train_and_eval(X_train_std, Y_train, X_test_std, Y_test):
    # Create a list to store the results of each model
    all_results = []
    
    # Random baseline
    the_model = DummyClassifier(strategy = 'uniform', random_state=seed)
    model_results = train_and_eval_model('Random', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Zero-R baseline
    the_model = DummyClassifier(strategy = 'most_frequent')
    model_results = train_and_eval_model('Zero-R', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Logistic Regression baseline
    the_model = LogisticRegression()
    model_results = train_and_eval_model('Log. Reg.', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # KNN baseline
    the_model = KNeighborsClassifier(n_neighbors = 3)
    model_results = train_and_eval_model('KNN', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Decision Tree baseline
    the_model = DecisionTreeClassifier(random_state = seed, max_depth = 2)
    model_results = train_and_eval_model('Tree', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Linear SVM baseline
    the_model = SVC(kernel = 'linear')
    model_results = train_and_eval_model('Linear SVM', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Polynomial SVM baseline
    the_model = SVC(kernel = 'poly', degree = 2, coef0 = 1)
    model_results = train_and_eval_model('Poly SVM', the_model, X_train_std, Y_train, X_test_std, Y_test)
    all_results.append(model_results)
    
    
    # Print the resulting dataframe
    multi_index = pd.MultiIndex.from_tuples([('Model', 'Name'), ('Train', 'Accuracy'), ('Train', 'Precision'), ('Train', 'Recall'), ('Train', 'F1'), ('Test', 'Accuracy'), ('Test', 'Precision'), ('Test', 'Recall'), ('Test', 'F1')])    
    all_results = pd.DataFrame(all_results, columns = multi_index)
    display(all_results)

train_and_eval(X_train_std, Y_train, X_test_std, Y_test)

## **5. Analysis of Results**

**What are the reasons for its poor performance?**

With such a class imbalance, the models **learn to predict only the majority class (0)**, generating high Accuracy values but low Precision, Recall, and F1 scores. 

This explains the poor performance in metrics relevant to the positive class, despite an acceptable Accuracy.

The decision tree stands out a bit, attempting to detect positives in training, although unsuccessfully in the test.

No other model correctly detects the positive class. Logistic Regression, KNN, or SVM have null recall and F1 scores, predicting only the majority class.

Another factor that might be affecting performance is the insufficiency of the input data ($X$). We may need to add another column from our dataset as input to improve performance.

<div class="alert alert-block alert-info">
    <b>Exercise:</b> Modify a hyperparameter of the best model to try to improve its result.
</div>

In [None]:
def train_and_eval(X_train_std, Y, X_test_std, Y_test):
    # Create a list to store the results of each model
    all_results = []

    # Baseline Tree (we set class_weight balanced, we can try with more depth)
    the_model = DecisionTreeClassifier(random_state = seed, max_depth = 3, class_weight = 'balanced', min_samples_split = 4, min_samples_leaf = 2)
    model_results = train_and_eval_model('Tree', the_model, X_train_std, Y, X_test_std, Y_test)
    all_results.append(model_results)
    
    # Print the resulting dataframe
    multi_index = pd.MultiIndex.from_tuples([ ('Model', 'Name'), ('Train', 'Accuracy'), ('Train', 'Precision'), ('Train', 'Recall'), ('Train', 'F1'), ('Test', 'Accuracy'), ('Test', 'Precision'), ('Test', 'Recall'), ('Test', 'F1') ])    
    all_results = pd.DataFrame(all_results, columns = multi_index)
    display(all_results)

train_and_eval(X_train_std, Y_train, X_test_std, Y_test)

As we can see, even by changing the hyperparameters, the difference between the classes is so exaggerated that it makes it impossible for the model to perform well.

One possible solution in this case would be the previously mentioned inclusion of an additional column (perhaps the lap or sector times might help us), or simply applying an **oversampling** strategy.

<div class="alert alert-block alert-warning">
    <strong>Oversampling:</strong> A technique to balance the number of examples in each class by repeating the minority samples.
</div>