# Day 3 - Machine Learning
## Model Selection and Assessment

### After learning from this notebook you will be able to ...
- encode categorical values with one-hot encoding
- know which encoding, scaling, and imputing method you should select in accordacne with the dataset characteristics
- impute missing data with KNN
- know how to streamline the preprocessing steps in advanced way (Pipeline and ColmnTransformer)
- select best model based on various cross-validation methods

### Acknowledgements
- https://github.com/kimdanny/COMP0189-practical

In [None]:
# Optional: For black-style code formatting within the notebook env.
!pip install nb-black

In [2]:
# Optional: For black-style code formatting within the notebook env.
%load_ext nb_black

<IPython.core.display.Javascript object>

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

<IPython.core.display.Javascript object>

## Part 1: Encoding and Imputations

### Task 1: Load and Split the Dataset into train and test

In [5]:
# TASK 1: Load Dataset
# We are going to use the same adult dataset as previous notebook,
# but now we have cleaned the dataset for you. However, we did not touch the missing values.
df = pd.read_csv("../../data/clean_adult.csv")
df

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country,Y
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,<=50K
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,<=50K
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,<=50K
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,<=50K
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,<=50K
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
32556,27,Private,257302,Assoc-acdm,12,Married-civ-spouse,Tech-support,Wife,White,Female,0,0,38,United-States,<=50K
32557,40,Private,154374,HS-grad,9,Married-civ-spouse,Machine-op-inspct,Husband,White,Male,0,0,40,United-States,>50K
32558,58,Private,151910,HS-grad,9,Widowed,Adm-clerical,Unmarried,White,Female,0,0,40,United-States,<=50K
32559,22,Private,201490,HS-grad,9,Never-married,Adm-clerical,Own-child,White,Male,0,0,20,United-States,<=50K


<IPython.core.display.Javascript object>

In [6]:
def train_test_split_df(df, test_ratio=0.1, target_col="Y"):
    df_data = df.drop([target_col], axis=1)
    df_target = df[target_col]

    split_index = int(len(df) * (1 - test_ratio))
    print(f"Splitting from index {split_index}")

    train_X_df = df_data[:split_index]
    test_X_df = df_data[split_index:]
    train_y_df = df_target[:split_index]
    test_y_df = df_target[split_index:]

    train_y_df = np.where(train_y_df == ">50K", 1, 0)
    test_y_df = np.where(test_y_df == ">50K", 1, 0)

    return train_X_df, train_y_df, test_X_df, test_y_df

<IPython.core.display.Javascript object>

In [7]:
# Splitting dataset into train and test
train_X_df, train_y_df, test_X_df, test_y_df = train_test_split_df(df)

Splitting from index 29304


<IPython.core.display.Javascript object>

### Task 2: Encode categorical variables (label/ordinal encoding & one-hot encoding)

### Important: We need special care when we are encoding categorical variables 

**1. Take care of the missing values**
- Beware not to encode missing values unless you are intending to do so.
- Sometimes you want to encode missing values to a separate cateogory. For example, when you want to predict if passengers of titanic had survived or not, missing data of certain features can actually have meaning, i.e., Cabin information can be missing because the body was not found.

**2. Know which encoding and scaling method you should select**
- If your categories are ordinal, then it makes sense to use a LabelEncoder with a MinMaxScaler. For example, you can encode [low, medium, high], as [1,2,3], i.e., distance between low to high is larger than that of medium and high.

- However, if you have non-ordinal categorical values, like [White, Hispanic, Black, Asian], then it would be better to use a OneHotEncoder instead of forcing ordinality with a LabelEncoder. Otherwise the algorithms you use (especially distance based algorithms like KNN) will make the assumption that the distance between White and Asian is larger than White and Hispanic, which is nonsensical.

**3. Split before you encode to avoid data leakage**
- Split the dataset before you encode your data. It is natural for algorithms to see unknown values in the validation/test set that was not appearing in the train set. `sklearn.preprocessing.OneHotEncoder` is good at handling these unknown categories (`handle_unknown` parameter).

- Discussion: What if you are certain about all the possible categories that can appear for each feature? Can you encode all the values before splitting the dataset into train and test set?


This notebook shows the three points in the following sections with examples.

### Task 2-1: Label Encoding (with missing values)

In [9]:
from sklearn.preprocessing import LabelEncoder, OrdinalEncoder, OneHotEncoder

label_encoder = LabelEncoder()

<IPython.core.display.Javascript object>

In [10]:
train_X_df.isnull().sum()

Age                  0
Workclass         1638
Fnlwgt               0
Education            0
Education-num        0
Marital-status       0
Occupation        1643
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
Hours-per-week       0
Native-country     525
dtype: int64

<IPython.core.display.Javascript object>

In [11]:
mask_df = train_X_df.isnull()
# We will encode these columns with LabelEncoder, and the rest with OneHotEncoder
categ = ["Sex"]
train_X_df[categ] = train_X_df[categ].apply(label_encoder.fit_transform)
# This masking process won't give any effect in this cell, where 'SEX' and 'Y' don't have any missing values.
# However, if the columns we are encoding have missing values,
# this is how you avoid encoding missing values as a separate category.
# To see the effect of masking, add 'Workclass' in the categ variable and check df.isnull().sum().
train_X_df = train_X_df.mask(mask_df, np.nan)
train_X_df

Unnamed: 0,Age,Workclass,Fnlwgt,Education,Education-num,Marital-status,Occupation,Relationship,Race,Sex,Capital-gain,Capital-loss,Hours-per-week,Native-country
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,1,2174,0,40,United-States
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,1,0,0,13,United-States
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,1,0,0,40,United-States
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,1,0,0,40,United-States
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,0,0,0,40,Cuba
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29299,39,Self-emp-not-inc,148443,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,1,0,0,60,United-States
29300,23,Private,91733,Bachelors,13,Never-married,Tech-support,Own-child,White,0,3325,0,40,United-States
29301,39,Private,176634,Assoc-acdm,12,Never-married,Tech-support,Not-in-family,White,0,0,0,40,United-States
29302,40,Local-gov,74949,Some-college,10,Never-married,Exec-managerial,Not-in-family,White,1,0,0,40,United-States


<IPython.core.display.Javascript object>

In [12]:
# The missing values are intact and were not encoded.
train_X_df.isnull().sum()

Age                  0
Workclass         1638
Fnlwgt               0
Education            0
Education-num        0
Marital-status       0
Occupation        1643
Relationship         0
Race                 0
Sex                  0
Capital-gain         0
Capital-loss         0
Hours-per-week       0
Native-country     525
dtype: int64

<IPython.core.display.Javascript object>

### Task 2-2: One Hot Encoding (with missing values imputation)

Tip 1: Impute the missing values (choose the right strategy) before doing OHE  
Tip 2: Try creating a separate dataframe with one-hot encoded columns and combine the dataframe with the original dataframe for the final one.

In [13]:
# Let's first impute the missing values.
# Since it's a categorical value, we don't use KNN or mean imputation.
# We will replace with the most frequent value.
from sklearn.impute import SimpleImputer

imputer = SimpleImputer(strategy="most_frequent")
imputed_train_X = imputer.fit_transform(train_X_df)
imputed_train_X_df = pd.DataFrame(imputed_train_X, columns=train_X_df.columns)

# Check that we have no missing values now
imputed_train_X_df.isnull().sum()

Age               0
Workclass         0
Fnlwgt            0
Education         0
Education-num     0
Marital-status    0
Occupation        0
Relationship      0
Race              0
Sex               0
Capital-gain      0
Capital-loss      0
Hours-per-week    0
Native-country    0
dtype: int64

<IPython.core.display.Javascript object>

In [14]:
# We want to turn these features into one-hot vectors
onehot_categ = [
    "Workclass",
    "Education",
    "Marital-status",
    "Occupation",
    "Relationship",
    "Race",
    "Native-country",
]
onehot_encoder = OneHotEncoder(sparse_output=False).fit(
    imputed_train_X_df[onehot_categ]
)
encoded = onehot_encoder.transform(imputed_train_X_df[onehot_categ])
encoded_df = pd.DataFrame(encoded, columns=onehot_encoder.get_feature_names_out())
encoded_df

Unnamed: 0,Workclass_Federal-gov,Workclass_Local-gov,Workclass_Never-worked,Workclass_Private,Workclass_Self-emp-inc,Workclass_Self-emp-not-inc,Workclass_State-gov,Workclass_Without-pay,Education_10th,Education_11th,...,Native-country_Portugal,Native-country_Puerto-Rico,Native-country_Scotland,Native-country_South,Native-country_Taiwan,Native-country_Thailand,Native-country_Trinadad&Tobago,Native-country_United-States,Native-country_Vietnam,Native-country_Yugoslavia
0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29299,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29300,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29301,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29302,0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


<IPython.core.display.Javascript object>

In [15]:
# After finishing encoding categorical variables,
# we make the final dataframe by concatenating it with the imputed dataframe
imputed_train_X_df = imputed_train_X_df.drop(onehot_categ, axis=1)
final_df_train_X_df = pd.concat([imputed_train_X_df, encoded_df], axis=1)
final_df_train_X_df

Unnamed: 0,Age,Fnlwgt,Education-num,Sex,Capital-gain,Capital-loss,Hours-per-week,Workclass_Federal-gov,Workclass_Local-gov,Workclass_Never-worked,...,Native-country_Portugal,Native-country_Puerto-Rico,Native-country_Scotland,Native-country_South,Native-country_Taiwan,Native-country_Thailand,Native-country_Trinadad&Tobago,Native-country_United-States,Native-country_Vietnam,Native-country_Yugoslavia
0,39,77516,13,1,2174,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
1,50,83311,13,1,0,0,13,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
2,38,215646,9,1,0,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
3,53,234721,7,1,0,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
4,28,338409,13,0,0,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
29299,39,148443,9,1,0,0,60,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29300,23,91733,13,0,3325,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29301,39,176634,12,0,0,0,40,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0
29302,40,74949,10,1,0,0,40,0.0,1.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0


<IPython.core.display.Javascript object>

In [16]:
final_df_train_X_df.shape

(29304, 104)

<IPython.core.display.Javascript object>

### Another (recommended) Method: Using `Pipeline` and `ColmnTransformer` to streamline the workflow.

We will do the same operation (with addition of scaling) but with much more succinct way.   
We use `Pipeline` and `ColmnTransformer`.  
`ColmnTransformer` is needed to apply different preprocessing steps to selected columns.

In [17]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer, KNNImputer
from sklearn.preprocessing import StandardScaler

<IPython.core.display.Javascript object>

In [18]:
# Reload the dataset
df = pd.read_csv("../../data/clean_adult.csv")
train_X, train_y, test_X, test_y = train_test_split_df(df)

Splitting from index 29304


<IPython.core.display.Javascript object>

In [19]:
non_categorical_features = [
    "Age",
    "Fnlwgt",
    "Education-num",
    "Capital-gain",
    "Capital-loss",
    "Hours-per-week",
]
categorical_ohe_features = [
    "Workclass",
    "Marital-status",
    "Occupation",
    "Relationship",
    "Race",
    "Native-country",
]
categorical_le_features = ["Sex", "Education"]

<IPython.core.display.Javascript object>

In [20]:
# For features like 'Age' and 'Fnlwgt'
non_categorical_transformer = Pipeline(
    # For KNNImputer, see the side note below
    # We can add scaling for non-categorical features
    steps=[("KNNImputer", KNNImputer(n_neighbors=5)), ("scaling", StandardScaler())]
)

# For features like 'Workclass' and 'Education'
categorical_ohe_transformer = Pipeline(
    steps=[
        ("SimpleImputer", SimpleImputer(strategy="most_frequent")),
        ("OHE", OneHotEncoder(sparse_output=False, handle_unknown="ignore")),
        # no need to scale
    ]
)

# For features like 'Sex'
categorical_le_transformer = Pipeline(
    steps=[
        ("ModeImputer", SimpleImputer(strategy="most_frequent")),
        # Here, we change to the OrdinalEncoder as LabelEncoder is used for the target variable
        # Try changing the OrdinalEncoder to LabelEncoder to see what error you see,
        # and check the documentation of LabelEncoder
        ("LE", OrdinalEncoder()),
        # In the case of adult dataset, no need to scale for just 'Sex' variable,
        # but other label/ordinal encoded categorical features can be ordinal -> scaling
        ("scaling", StandardScaler()),
    ]
)

<IPython.core.display.Javascript object>

In [21]:
ct = ColumnTransformer(
    transformers=[
        ("non-categorical", non_categorical_transformer, non_categorical_features),
        ("categorical-ohe", categorical_ohe_transformer, categorical_ohe_features),
        ("categorical-le", categorical_le_transformer, categorical_le_features),
    ]
)

<IPython.core.display.Javascript object>

In [22]:
transformed_train_X = ct.fit_transform(train_X, train_y)
transformed_test_X = ct.transform(test_X)

<IPython.core.display.Javascript object>

In [23]:
transformed_train_X.shape, train_y.shape, transformed_test_X.shape, test_y.shape

((29304, 89), (29304,), (3257, 89), (3257,))

<IPython.core.display.Javascript object>

### Side Note: Data Imputation with KNN
For the adult dataset, missing data present only in categorical values, so imputing strategy that makes floating point may not make sense.
However, for continuous values, you can use various imputation strategies, such as taking simple mean or mean value from K nearest neighbors (KNN).
If you use `sklearn.imput.KNNImputer`, each sample’s missing values are imputed using the `mean` value from `n_neighbors` nearest neighbors found in the training set.
If you want to use `mode` value from neighbors (for categorical data imputation) you need to implement the imputer by yourself.

- `sklearn-pandas` package (https://pypi.org/project/sklearn-pandas/1.5.0/) provides `CategoricalImputer` class, which is suitable for such processing

Here, we use iris dataset to show how to use KNNImputer for continuous values

In [24]:
from sklearn.datasets import load_iris
from sklearn.impute import KNNImputer

<IPython.core.display.Javascript object>

In [25]:
iris = load_iris()
iris_df = pd.DataFrame(iris.data, columns=iris.feature_names)

<IPython.core.display.Javascript object>

In [26]:
# Applying a random mask to make missing data
mask = np.random.choice([True, False], size=iris_df.shape[0] * iris_df.shape[1])
mask[:500] = True
np.random.shuffle(mask)
mask = np.reshape(mask, iris_df.shape)
iris_df = iris_df.mask(~mask)

iris_df.isnull().sum()

sepal length (cm)    10
sepal width (cm)      9
petal length (cm)    11
petal width (cm)     13
dtype: int64

<IPython.core.display.Javascript object>

In [27]:
train_X, test_X = iris_df[:100], iris_df[100:]

<IPython.core.display.Javascript object>

In [28]:
# It is important to impute the train and test set separately (not fitting KNN to test set) to avoid data leak.
imputer = KNNImputer(n_neighbors=5)
imputed_train_X = imputer.fit_transform(train_X)
imputed_test_X = imputer.transform(test_X)

<IPython.core.display.Javascript object>

In [29]:
del iris, iris_df, mask, train_X, test_X, imputer, imputed_train_X, imputed_test_X

<IPython.core.display.Javascript object>

### Task 3: Create different preprocessing strategies of your own
Create different versions of X (X1 and X2) and y using different strategies for data imputation.  
Define different preprocessing strategies using the `Pipeline` and `ColmnTransformer` class

In [30]:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer

<IPython.core.display.Javascript object>

In [31]:
# Your explorations here

<IPython.core.display.Javascript object>

### Task 4: 
Train different models (KNN, SVM) to predict the y from the two versions of X (X1 and X2) with a fixed value of the regularization parameter. 
Centre and scale the data before training the models. Create tables or plots to show how accuracy varies for different imputation strategies or different models. 

### Task 4-1: Training KNN and SVM Models (with original preprocessing)

In [32]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn import svm

from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.metrics import classification_report
from sklearn.feature_selection import mutual_info_classif

<IPython.core.display.Javascript object>

In [33]:
model_knn = KNeighborsClassifier()
model_knn.fit(transformed_train_X, train_y)

# Predict Output
y_hat_knn = model_knn.predict(transformed_test_X)
accuracy_score(y_hat_knn, test_y)

0.8277556033159349

<IPython.core.display.Javascript object>

In [34]:
model_svm = svm.SVC(kernel="linear")  # Linear Kernel
model_svm.fit(transformed_train_X, train_y)

# Predict Output
y_hat_svm = model_svm.predict(transformed_test_X)
accuracy_score(y_hat_svm, test_y)

0.8440282468529321

<IPython.core.display.Javascript object>

### Task 4-2: Investigation
Create tables or plots to show how accuracy varies for different imputation strategies or different models. 
- What is the impact of the imputation strategy on the accuracy? 
- What is the impact of the model on the accuracy? 

## Part 2: Cross Validation (CV)

`scikit-learn` provides a nice visulaisation of various cross validation methods.  
This notebook focuses on the three main CV techniques: `KFold`, `StratifiedKFold`, `GroupKFold`, and `StratifiedGroupKFold` for optimizing models' hyper-parameters and to understand how different strategies might affect the models' performance. 

Visit: https://scikit-learn.org/stable/auto_examples/model_selection/plot_cv_indices.html#visualizing-cross-validation-behavior-in-scikit-learn

![kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_006.png)
![stra-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_003.png)
![group-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_004.png)
![stra-group-kfold](https://scikit-learn.org/stable/_images/sphx_glr_plot_cv_indices_010.png)

In [38]:
from sklearn.model_selection import (
    KFold,
    StratifiedKFold,
    GroupKFold,
    StratifiedGroupKFold,
    GridSearchCV,
)

<IPython.core.display.Javascript object>

### Task 5
Now do CV to optimize the models hyperparameter using different k (2, 5, 10, 20) for the k-fold CV to predict the y from the two versions of X (X1 and X2) and plot the model performance (mean accuracy and SD) for the different k. Centre and scale the data before training the models. 

Note: remember that the pre-processing steps, including data centering and scaling should be embedded in the CV. 

- How does the accuracy vary as the number of folds increase? 

In [39]:
# KNN

result = []
for n in [3, 5, 10, 20]:
    kf = KFold(n_splits=n)
    tem = []
    for i, (train_index, test_index) in enumerate(kf.split(transformed_train_X)):
        model_knn = KNeighborsClassifier()
        k_range = np.arange(1, 10)
        # define grid search
        grid = dict(n_neighbors=k_range)
        cv = KFold(n_splits=n - 1)
        grid_search = GridSearchCV(
            estimator=model_knn,
            param_grid=grid,
            pre_dispatch=6,
            n_jobs=6,
            cv=cv,
            scoring="accuracy",
            error_score=0,
            verbose=1,
        )
        grid_result = grid_search.fit(
            transformed_train_X[train_index], train_y[train_index]
        )
        model_knn = KNeighborsClassifier(
            n_neighbors=grid_result.best_params_["n_neighbors"]
        )
        model_knn.fit(transformed_train_X[train_index], train_y[train_index])

        y_hat_knn = model_knn.predict(transformed_train_X[test_index])
        tem.append(accuracy_score(y_hat_knn, train_y[test_index]))
    result.append(tem)

Fitting 2 folds for each of 9 candidates, totalling 18 fits
Fitting 2 folds for each of 9 candidates, totalling 18 fits
Fitting 2 folds for each of 9 candidates, totalling 18 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates, totalling 81 fits
Fitting 9 folds for each of 9 candidates

<IPython.core.display.Javascript object>

In [40]:
result

[[0.8392710892710893, 0.8341523341523341, 0.8374283374283374],
 [0.8334755161235284,
  0.8372291417846783,
  0.838423477222317,
  0.8469544446340215,
  0.840443686006826],
 [0.8242920504947117,
  0.8375980893892869,
  0.8437393381098601,
  0.8259979529170931,
  0.8440273037542663,
  0.8409556313993174,
  0.8402730375426621,
  0.8535836177474403,
  0.8361774744027304,
  0.8416382252559726],
 [0.8212824010914052,
  0.8281036834924966,
  0.8315143246930423,
  0.844474761255116,
  0.8382252559726963,
  0.8477815699658703,
  0.8389078498293515,
  0.8197952218430035,
  0.8505119453924914,
  0.8361774744027304,
  0.8382252559726963,
  0.8477815699658703,
  0.8368600682593856,
  0.8505119453924914,
  0.851877133105802,
  0.8552901023890785,
  0.8511945392491468,
  0.8232081911262799,
  0.8559726962457338,
  0.8368600682593856]]

<IPython.core.display.Javascript object>

### Task 6
Repeat task 5 using stratified CV with k=5. Compute the accuracy of the models implemented in task 5 (with k=5) and check if the model with stratified CV performs better across the different folds. Centre and scale the data before training the models. Create tables or plots to show these results.

In [41]:
# KNN

n = 5
kf = StratifiedKFold(n_splits=n)
tem = []
for i, (train_index, test_index) in enumerate(kf.split(transformed_train_X, train_y)):
    model_knn = KNeighborsClassifier()
    k_range = np.arange(1, 10)
    # define grid search
    grid = dict(n_neighbors=k_range)
    cv = StratifiedKFold(n_splits=n - 1)
    grid_search = GridSearchCV(
        estimator=model_knn,
        param_grid=grid,
        n_jobs=2,
        cv=cv,
        scoring="accuracy",
        error_score=0,
        verbose=1,
    )
    grid_result = grid_search.fit(
        transformed_train_X[train_index], train_y[train_index]
    )
    model_knn = KNeighborsClassifier(
        n_neighbors=grid_result.best_params_["n_neighbors"]
    )
    model_knn.fit(transformed_train_X[train_index], train_y[train_index])

    y_hat_knn = model_knn.predict(transformed_train_X[test_index])
    tem.append(accuracy_score(y_hat_knn, train_y[test_index]))

Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits


<IPython.core.display.Javascript object>

In [42]:
tem

[0.8338167548199966,
 0.8365466643917421,
 0.8375703804811465,
 0.8467838252857874,
 0.8431740614334471]

<IPython.core.display.Javascript object>

### Task 7
Repeat task 5 using stratified group CV considering 'Race' as a group with k=5.
Compute the accuracy of the models implemented in task 5 (with k=5) and check if the model with stratified group CV performs better across the different races. 
Centre and scale the data before training the models. 
Create tables or plots to show these results. 

In [43]:
n = 5
kf = StratifiedGroupKFold(n_splits=n)
tem = []
group = train_X_df["Race"]
for i, (train_index, test_index) in enumerate(
    kf.split(transformed_train_X, train_y, group)
):
    model_knn = KNeighborsClassifier()
    k_range = np.arange(1, 10)
    # define grid search
    grid = dict(n_neighbors=k_range)
    tem_group = train_X_df["Race"][train_index]
    cv = StratifiedGroupKFold(n_splits=n - 1)
    grid_search = GridSearchCV(
        estimator=model_knn,
        param_grid=grid,
        n_jobs=2,
        cv=cv,
        scoring="accuracy",
        error_score=0,
        verbose=1,
    )
    grid_result = grid_search.fit(
        transformed_train_X[train_index], train_y[train_index], groups=tem_group
    )
    model_knn = KNeighborsClassifier(
        n_neighbors=grid_result.best_params_["n_neighbors"]
    )
    model_knn.fit(transformed_train_X[train_index], train_y[train_index])

    y_hat_knn = model_knn.predict(transformed_train_X[test_index])
    tem.append(accuracy_score(y_hat_knn, train_y[test_index]))

Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits
Fitting 4 folds for each of 9 candidates, totalling 36 fits


<IPython.core.display.Javascript object>

In [44]:
tem

[0.8188064232643605,
 0.9006056287851799,
 0.8261802575107297,
 0.8939929328621908,
 0.9193548387096774]

<IPython.core.display.Javascript object>