# Session 10: Cross-Validation

Let's put in practice what we learned about cross-validation.

In this case, we are going to use the `K-Fold` cross-validation technique to evaluate the performance of a model.

We will be trying different combinations of hyperparameters for a `RandomForestClassifier` model within the `K-Fold` cross-validation. At the end, we will choose the best hyperparameters based on the average accuracy of the model.

In [1]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, precision_score, recall_score
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import cross_validate
import pandas as pd
import numpy as np
import plotly.express as px


In [2]:
data = pd.read_csv('/Users/dgh/Desktop/IE/ie-pda/pda2_pt/data/shipping.csv')

data.columns = [col.replace('.', '_').replace(' ', '_').lower() for col in data.columns]

data.head()

Unnamed: 0,id,warehouse_block,mode_of_shipment,customer_care_calls,customer_rating,cost_of_the_product,prior_purchases,product_importance,gender,discount_offered,weight_in_gms,reached_on_time_y_n
0,1,D,Flight,4,2,177,3,low,F,44,1233,1
1,2,F,Flight,4,5,216,2,low,M,59,3088,1
2,3,A,Flight,2,2,183,4,low,M,48,3374,1
3,4,B,Flight,3,3,176,4,medium,M,10,1177,1
4,5,C,Flight,2,2,184,3,medium,F,46,2484,1


## The dataset

In this dataset, we have information about different orders shipped by an e-commerce company. The dataset contains the following columns:

- **ID:** ID Number of Customers.
- **Warehouse block:** The Company has a big Warehouse which is divided into blocks such as A, B, C, ...
- **Mode of shipment:** The Company ships the products in multiple ways such as Ship, Flight, and Road.
- **Customer care calls:** The number of calls made for enquiry of the shipment.
- **Customer rating:** The company has rated from every customer. 1 is the lowest (Worst), 5 is the highest (Best).
- **Cost of the product:** Cost of the Product in US Dollars.
- **Prior purchases:** The Number of Prior Purchases.
- **Product importance:** The company has categorized the product in various parameters such as low, medium, high.
- **Gender:** Male and Female.
- **Discount offered:** Discount offered on that specific product.
- **Weight in gms:** It is the weight in grams.
- **Reached on time:** It is the target variable, where 1 indicates that the product has NOT reached on time and 0 indicates it has reached on time.

## Question 1

Separate the dataset into features and target variable.

## Question 2

Is the dataset balanced?

## Question 3

Dealing with the ID: should we keep it in the dataset?

If there are several rows with the same ID then it might be important to understand recurrence. If not, we can drop it.

Please check if there are any repeated IDs in the dataset. If there are, keep the column, otherwise drop it.

## Question 4 

Which are the categorical columns in the dataset?

## Question 4

Encode the categorical variables, using the `OneHotEncoder` or `LabelEncoder` from `sklearn`.

## Question 5

Split the dataset into training and test sets.

## Question 6

Create a `RandomForestClassifier` model and use the `K-Fold` cross-validation technique to evaluate the model, without changing the hyperparameters.

* Use accuracy as the metric to evaluate the model with `cross_validate`.
    * `accuracy_score` is the number of correct predictions made by the model over all kinds predictions made.
    $$ accuracy = \frac{TP + TN}{TP + TN + FP + FN} $$

* Use 10 folds in the cross-validation.

About classification metrics: https://scikit-learn.org/stable/modules/model_evaluation.html#classification-metrics

# Brief intro to random forests

Random forests are an ensemble learning method that can be used for classification and regression.

It works by training several decision trees during training and outputting the majority class (classification) or mean prediction (regression) of the individual trees.

The idea behind random forests is that each tree might be overfitting the data in a different way, so by averaging the predictions of the trees, we can reduce the variance of the model. One of the characteristics of random forests is that none of trees are trained with the same data, they only use some of the variables. By doing that, we can reduce the correlation between the trees and make the model more robust.

The three hyperparameters that we are going to use in this exercise are:

- `n_estimators`: the number of trees in the forest (an integer between 1 and as many as you want).
    * The higher the number of trees, the better the model will generalize to new data.
- `max_depth`: the maximum depth of the trees (an integer between 1 and as many as you want).
    * The higher the depth, the more complex the model will be, the more likely to overfit.
- `max_features`: the number of variables to consider when looking for the best split (an integer between 1 and the number of training variables).
    * The higher the number of features, the more likely to overfit. Should be compensated with a higher `n_estimators`: the more trees, the less features can be repeated in the trees.
    * A good rule of thumb is to use the square root of the number of features, rounded down.

## Question 7

Change the hyperparameters of the model and then use the `K-Fold` cross-validation technique to evaluate the model.

Using the following hyperparameters (more info [here](https://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html)):

* `n_estimators`: The number of trees in the forest.
* `max_depth`: The maximum depth of the tree.
* `max_features`: The number of variables from the training data to consider when looking for the best split.

Choose 3 combination of hyperparameters, for example, `n_estimators=100`, `max_depth=10`, `min_samples_split=2`, and then repeat the cross-validation process with the same value of `K` (10 folds) for each combination of hyperparameters.

Hint: in order to change the hyperparameters, you can create a new model with the desired hyperparameters.

```python
classifier = RandomForestClassifier(
    n_estimators=100,
    max_depth=10,
    max_features=np.sqrt(x_train.shape[1]).astype(int)
)
```

## Question 8

What was the best combination of hyperparameters in your opinion? Why?

Experiment 3 was the best in terms of accuracy in the validation sets.
* Experiment 1 seemed to overfit the data, while experiment 2 was not as good as experiment 3.

## Question 9

Now, use the best hyperparameters to train the model and evaluate it on the test set.

Print the accuracy of the model on the test set, and the confusion matrix.

Post the accuracy and confusion matrix of the model on forum!

## Question 10

What can you say about the model's performance on the test set? Check the recall and precision of the model to try and explain the model's performance.