# COMP0189: Applied Artificial Intelligence
## Week 3 (Model Selection and Assessment)

### After this week you will be able to ...
- encode categorical values with one-hot encoding
- know which encoding, scaling, and imputing method you should select in accordacne with the dataset characteristics
- impute missing data with KNN
- know how to streamline the preprocessing steps in advanced way (Pipeline and ColmnTransformer)
- perform model selection using different cross-validation methods
- perform model selection and model assessment using different partitions of the data

### Acknowledgements
- https://scikit-learn.org/stable/
- https://archive.ics.uci.edu/ml/datasets/adult

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

## Part 1: Encoding and Imputations

### Task 1: Load and Split the Dataset into train and test

In [2]:
# TASK 1: Load Dataset
# We are going to use the same adult dataset as previous week.
# We have cleaned the dataset, but did not touch the missing values.
from sklearn.model_selection import train_test_split
df = pd.read_csv("clean_adult.csv")
df

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [3]:
sorted(list(df["Education-num"].unique()))

[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16]

In [4]:
len(df["Education"].unique())

16

In [5]:
df = df.drop(columns=["Education"], axis=1) 
# remove Education column since its label encoded version is already present.

In [6]:
df

Unnamed: 0,Age,Workclass,Fnlwgt,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Y
0,39,State-gov,77516,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


In [7]:
def train_test_split_df(df, test_size = 0.1, target_cols = ["Y"]):
    """
    This function splits the dataframe into train and test sets.
    It also separates the target column from the feature columns.
    """
    df_data = df.drop(columns=target_cols, axis=1)
    df_target = df[target_cols]

    split_index = int((1 - test_size) * len(df_data))

    train_X_df = df_data[:split_index]
    train_Y_df = df_target[:split_index]
    test_X_df = df_data[split_index:]
    test_Y_df = df_target[split_index:]

    train_Y_df = np.where(train_Y_df == "<=50K", 0, 1)
    test_Y_df = np.where(test_Y_df == "<=50K", 0, 1)

    return train_X_df, train_Y_df, test_X_df, test_Y_df

In [8]:
train_X_df, train_Y_df, test_X_df, test_Y_df = train_test_split_df(df) # splits the data into train and test sets.

In [9]:
for col in train_X_df:

    if col in [
        "Age",
        "Fnlwgt",
        "Capital-gain",
        "Capital-loss",
        "Hours-per-week"
    ]:
        continue
    train_uniq_vals = train_X_df[col].unique()
    test_uniq_vals = test_X_df[col].unique()

    if not set(test_uniq_vals).issubset(set(train_uniq_vals)): 
        # False: test has unexpected elements not present in the training set.
        print(col, "has values that are not in train set")

### Task 2: Encode categorical variables (label/ordinal encoding & one-hot encoding)

### Important: We need special care when we are encoding categorical variables

**1. Take care of the missing values**
- Beware not to encode missing values unless you are intending to do so.
- Sometimes you want to encode missing values to a separate cateogory. For example, when you want to predict if passengers of titanic had survived or not, missing data of certain features can actually have meaning, i.e., Cabin information can be missing because the body was not found.

**2. Know which encoding and scaling method you should select**
- If your categories are ordinal, then it makes sense to use a LabelEncoder with a MinMaxScaler. For example, you can encode [low, medium, high], as [1,2,3], i.e., distance between low to high is larger than that of medium and high.

- However, if you have non-ordinal categorical values, like [White, Hispanic, Black, Asian], then it would be better to use a OneHotEncoder instead of forcing ordinality with a LabelEncoder. Otherwise the algorithms you use (especially distance based algorithms like KNN) will make the assumption that the distance between White and Asian is larger than White and Hispanic, which is nonsensical.

**3. Split before you encode to avoid data leakage**
- Split the dataset before you encode your data. It is natural for algorithms to see unknown values in the validation/test set that was not appearing in the train set. `sklearn.preprocessing.OneHotEncoder` is good at handling these unknown categories (`handle_unknown` parameter).

- Discussion: What if you are certain about all the possible categories that can appear for each feature? Can you encode all the values before splitting the dataset into train and test set?


This notebook shows the three points in the following sections with examples.

### Task 2-1: Label Encoding (with missing values)

In [4]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

label_encoder = LabelEncoder()

In [5]:
# display the columns
ordinal_columns = ['Education', 'Education-num', 'Marital-status', 'Relationship',
                   'Race', 'Sex']

In [6]:
label_encoded_df = df.copy()
label_encoded_df[ordinal_columns] = df[ordinal_columns].apply(label_encoder.fit_transform)

In [7]:
label_encoded_df[ordinal_columns]

Unnamed: 0,Education,Education-num,Marital-status,Relationship,Race,Sex
0,9,12,4,1,4,1
1,9,12,2,0,4,1
2,11,8,0,1,4,1
3,1,6,2,0,2,1
4,9,12,2,5,2,0
...,...,...,...,...,...,...
32556,7,11,2,5,4,0
32557,11,8,2,0,4,1
32558,11,8,6,4,4,0
32559,11,8,4,3,4,1


### Task 2-2: One Hot Encoding (with missing values imputation)

Tip 1: Impute the missing values (choose the right strategy) before doing OHE  
Tip 2: Try creating a separate dataframe with one-hot encoded columns and combine the dataframe with the original dataframe for the final one.

In [8]:
# Let's first impute the missing values.
# Since it's a categorical value, we don't use KNN or mean imputation.
# We will replace with the most frequent value.
from sklearn.impute import SimpleImputer

# most frequent imputation since the method could be used for both string 
# and numerical columns. 
mode_imputer = SimpleImputer(strategy='most_frequent')

missing_columns = df.columns

missing_df = df.copy()

missing_df[missing_columns] = mode_imputer.fit_transform(df)

In [9]:
print(missing_df.isnull().sum())

Age               0
Workclass         0
Fnlwgt            0
Education         0
Education-num     0
Marital-status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital-gain      0
Capital-loss      0
Hours-per-week    0
Native-country    0
Y                 0
dtype: int64


In [10]:
missing_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K


In [11]:
onehot_encoder = OneHotEncoder()

In [12]:
def apply_onehot_encoding(df, columns):
    # Perform One-Hot Encoding
    encoded_data = onehot_encoder.fit_transform(df[columns]).toarray()
    column_names = onehot_encoder.get_feature_names_out(columns)
    df_encoded = pd.DataFrame(encoded_data, columns=column_names)

    # Reset indices to ensure alignment
    df_reset = df.reset_index(drop=True)
    df_encoded_reset = df_encoded.reset_index(drop=True)

    # Drop original columns and concatenate the new One-Hot Encoded columns
    return pd.concat([df_reset.drop(columns, axis=1), df_encoded_reset], axis=1)

In [13]:
# categorical_columns = ['Workclass', 'Occupation', 'Native-country']
# experiment the code with the every categorical columns from the dataset
categorical_columns = ['Workclass', 'Education', 'Marital-status', 'Occupation', 'Relationship', 'Race', 'Sex', 'Native-country']

In [14]:
onehot_encoded_df = apply_onehot_encoding(missing_df, categorical_columns)

In [15]:
onehot_encoded_df.head()

Unnamed: 0,Age,Fnlwgt,Education-num,Capital-gain,Capital-loss,Hours-per-week,Y,Workclass_Federal-gov,Workclass_Local-gov,Workclass_Never-worked,...,Native-country_Portugal,Native-country_Puerto-Rico,Native-country_Scotland,Native-country_South,Native-country_Taiwan,Native-country_Thailand,Native-country_Trinadad&Tobago,Native-country_United-States,Native-country_Vietnam,Native-country_Yugoslavia
0,39,77516,13,2174,0,40,<=50K,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50,83311,13,0,0,13,<=50K,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,38,215646,9,0,0,40,<=50K,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,53,234721,7,0,0,40,<=50K,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,28,338409,13,0,0,40,<=50K,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


In [16]:
final_df = pd.concat([df, onehot_encoded_df], axis=1) # combines the original dataframe with the onehot encoded dataframe for a dataset that contains all the needed information. 

In [17]:
final_df.head()

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,...,Native-country_Portugal,Native-country_Puerto-Rico,Native-country_Scotland,Native-country_South,Native-country_Taiwan,Native-country_Thailand,Native-country_Trinadad&Tobago,Native-country_United-States,Native-country_Vietnam,Native-country_Yugoslavia
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### Side Note: Data Imputation with KNN
For the adult dataset, missing data present only in categorical values, so imputing strategy that makes floating point may not make sense.
However, for continuous values, you can use various imputation strategies, such as taking simple mean or mean value from K nearest neighbors (KNN).
If you use `sklearn.imput.KNNImputer`, each sample’s missing values are imputed using the `mean` value from `n_neighbors` nearest neighbors found in the training set.
If you want to use `mode` value from neighbors (for categorical data imputation) you need to implement the imputer by yourself.

- `sklearn-pandas` package (https://pypi.org/project/sklearn-pandas/1.5.0/) provides `CategoricalImputer` class, which is suitable for such processing

Here, we use iris dataset to show how to use KNNImputer for continuous values

In [16]:
from sklearn.datasets import load_iris
from sklearn.impute import KNNImputer

In [17]:
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

In [18]:
# Applying a random mask to make missing data
mask = np.random.choice([True, False], size=iris_df.shape[0] * iris_df.shape[1])
mask[:500] = True
np.random.shuffle(mask)
mask = np.reshape(mask, iris_df.shape)
iris_df = iris_df.mask(~mask)

iris_df.isnull().sum()

sepal length (cm)    17
sepal width (cm)      8
petal length (cm)    14
petal width (cm)     15
dtype: int64

In [19]:
train_X, test_X = iris_df[:100], iris_df[100:]

In [20]:
# It is important to impute the train and test set separately (not fitting KNN to test set) to avoid data leak.
imputer = KNNImputer(n_neighbors=5)
imputed_train_X = imputer.fit_transform(train_X)
imputed_test_X = imputer.transform(test_X)

In [21]:
del iris, iris_df, mask, train_X, test_X, imputer, imputed_train_X, imputed_test_X

### Task 3: Create different preprocessing strategies of your own
Create different versions of X (X1 and X2) by dropping missing values (X1) or using strategies for data imputation (X2). Define different preprocessing strategies using the `Pipeline` and `ColmnTransformer` class


### Task 3-1: Dropping missing values (X1)

In [22]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler
non_categorical_features = [
    "Age",
    "Fnlwgt",
    "Capital-gain",
    "Capital-loss",
    "Hours-per-week",
]
categorical_ohe_features = [
    "Workclass",
    "Education-num",
    "Marital-status",
    "Occupation",
    "Relationship",
    "Race",
    "Native-country",
]
categorical_le_features = ["Sex"]

In [23]:
# Your explorations here

print(df.columns[df.isnull().any()])

Index(['Workclass', 'Occupation', 'Native-country'], dtype='object')


In [24]:
X1 = df.dropna() # creates a dataframe, X1 that drops the null values.

In [25]:
X1.info()

<class 'pandas.core.frame.DataFrame'>
Index: 30162 entries, 0 to 32560
Data columns (total 15 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Age             30162 non-null  int64 
 1   Workclass       30162 non-null  object
 2   Fnlwgt          30162 non-null  int64 
 3   Education       30162 non-null  object
 4   Education-num   30162 non-null  int64 
 5   Marital-status  30162 non-null  object
 6   Occupation      30162 non-null  object
 7   Relationship    30162 non-null  object
 8   Race            30162 non-null  object
 9   Sex             30162 non-null  object
 10  Capital-gain    30162 non-null  int64 
 11  Capital-loss    30162 non-null  int64 
 12  Hours-per-week  30162 non-null  int64 
 13  Native-country  30162 non-null  object
 14  Y               30162 non-null  object
dtypes: int64(6), object(9)
memory usage: 3.7+ MB


#### Task 3-1: Using strategies for data imputation (X2)

In [26]:
categorical_features_with_na = ['Workclass', 'Occupation', 'Native-country']
numerical_features = ['Age', 'Fnlwgt', 'Education-num', 'Capital-gain', 'Capital-loss', 'Hours-per-week']
categorical_features_without_na = ['Education', 'Marital-status', 'Relationship', 'Race', 'Sex']

In [27]:
categorical_imputer = SimpleImputer(strategy='most_frequent')
categorical_pipeline = Pipeline(steps=[
    ('imputer', categorical_imputer),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

In [28]:
numerical_scaler = StandardScaler()
numerical_pipeline = Pipeline(steps=[
    ('scaler', numerical_scaler)
])

In [29]:
# Combining transformers for the ColumnTransformer
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_pipeline, numerical_features),
        ('cat_impute_ohe', categorical_pipeline, categorical_features_with_na),
        ('cat_ohe', OneHotEncoder(handle_unknown='ignore'), categorical_features_without_na)
    ]
)

In [30]:
X2_matrix = preprocessor.fit_transform(df)
X2_array = X2_matrix.toarray()
X2 = pd.DataFrame(X2_array)

In [31]:
type(X2)

pandas.core.frame.DataFrame

In [32]:
X2.head()

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,...,95,96,97,98,99,100,101,102,103,104
0,0.030671,-1.063611,1.134739,0.148453,-0.21666,-0.035429,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
1,0.837109,-1.008707,1.134739,-0.14592,-0.21666,-2.222153,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
2,-0.042642,0.245079,-0.42006,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0
3,1.057047,0.425801,-1.197459,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,1.0
4,-0.775768,1.408176,1.134739,-0.14592,-0.21666,-0.035429,0.0,0.0,0.0,1.0,...,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0


### Task 4:
Train different models (KNN, SVM) to predict the y from the two versions of X (X1 and X2) with a fixed value of the regularization parameter.
Centre and scale the data before training the models. Create tables or plots to show how accuracy varies for different imputation strategies or different models.

### Task 4-1: Training KNN and SVM Models with X1



In [33]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif

In [34]:
knn = KNeighborsClassifier(n_neighbors=5)

In [35]:
# split X1 into X and y 
y = X1['Y']
X = X1.drop(['Y'], axis=1)

In [36]:
# only select the non-categorical features for X
X = X[non_categorical_features]

In [37]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [38]:
X_train = StandardScaler().fit_transform(X_train)

In [39]:
knn.fit(X_train, y_train) # trains the training data on the KNN model. 

In [40]:
# test the data on the knn model
y_pred = knn.predict(X_test)



In [41]:
def generate_report(y_test, y_pred):
    print("Accuracy score: ", accuracy_score(y_test, y_pred))
    print("Confusion Matrix: \n", confusion_matrix(y_test, y_pred))
    print("Classification Report: \n", classification_report(y_test, y_pred))
    print(mutual_info_classif(X_train, y_train))

In [42]:
generate_report(y_test, y_pred)

Accuracy score:  0.757334659373446
Confusion Matrix: 
 [[4540    0]
 [1464   29]]
Classification Report: 
               precision    recall  f1-score   support

       <=50K       0.76      1.00      0.86      4540
        >50K       1.00      0.02      0.04      1493

    accuracy                           0.76      6033
   macro avg       0.88      0.51      0.45      6033
weighted avg       0.82      0.76      0.66      6033

[0.06593495 0.0244849  0.08479619 0.04068244 0.03819423]


In [43]:
svm_model = svm.SVC()

In [44]:
svm_model.fit(X_train, y_train)

In [45]:
y_pred = svm_model.predict(X_test)



In [46]:
generate_report(y_test, y_pred)

Accuracy score:  0.75252776396486
Confusion Matrix: 
 [[4540    0]
 [1493    0]]
Classification Report: 
               precision    recall  f1-score   support

       <=50K       0.75      1.00      0.86      4540
        >50K       0.00      0.00      0.00      1493

    accuracy                           0.75      6033
   macro avg       0.38      0.50      0.43      6033
weighted avg       0.57      0.75      0.65      6033



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


[0.06982779 0.02470751 0.08644799 0.03165074 0.04136588]


### Task 4-2: Training KNN and SVM Models with X2

## Part 2: Cross Validation (CV)

scikit-learn provides a nice visualisation of various cross validation methods.
This notebook focuses on different cross validation strategies and how to account for data structure during cross-validation.


Visit: https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#visualizing-cross-validation-behavior-in-scikit-learn

![kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_006.png)
![stra-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_003.png)
![group-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_004.png)
![stra-group-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_010.png)

In [None]:
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    GroupKFold,
    StratifiedGroupKFold,
    GridSearchCV,
)

### Task 5
Now apply cross-validation to the train set (k=5) for optimizing the models hyperparameters. After identifying the best hyperparameter, measure the performance on the test data by training the models on the training data using the optimal hyperparameter. Note: remember that the pre-processing steps, including data centering and scaling should be embedded in the CV.


#### Task 5-1
Plot the model performance (mean accuracy and SD) for different hyper-parameter values.
- How does the accuracy vary as function of the hyperparameter?


#### Task 5-2
Print the average cross-validation score, the best cross-validation score, the best hyperparameter and the test-score.
 - Is there a difference between the average cross-validation score, the best cross-validation score and the test-score?



### Task 6
Repeat task 5 using stratified CV with k=5. Centre and scale the data before training the models. Print the average cross-validation score, the best cross-validation score, the best hyperparameter and the test-score.

- Did the performances changes with the stratified CV?


### Task 7
Repeat task 5 using stratified group CV considering 'Race' as a group with k=5. Centre and scale the data before training the models. Print the average cross-validation score, the best cross-validation score, the best hyperparameter and the test-score.
 - Did the performances changes with the stratified group CV?

### Task 8
Now implement a nested CV for optimize the models’ hyper-parameters and assessing the models’ performance (with k=5 for the inner and outer loop). The inner loop should optimize the models’ hyper-parameters and the outer loop should assess the models’ performance.