# KFold Cross Validation and Hyperparameter Tuning

**Author:** Manaranjan Pradhan</br>
**Email ID:** manaranjan@gmail.com</br>
**LinkedIn:** https://www.linkedin.com/in/manaranjanpradhan/

## Load Dataset

Loading the used car resale price dataset.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sn

In [2]:
cars_df = pd.read_csv( "final_cars_maruti.csv" )

In [3]:
cars_df.sample(5)

Unnamed: 0,Location,Fuel_Type,Transmission,Owner_Type,Seats,Price,Age,Model,Mileage,Power,KM_Driven
186,Pune,Petrol,Manual,First,5,4.65,7,swift,18.6,85.8,32
965,Coimbatore,Petrol,Manual,First,5,4.11,1,omni,14.0,35.0,4
585,Coimbatore,Petrol,Automatic,First,5,6.93,3,swift,18.5,83.14,46
582,Ahmedabad,Diesel,Manual,First,5,6.25,5,ciaz,28.09,88.5,52
398,Kochi,Petrol,Automatic,First,5,7.22,3,baleno,21.4,83.1,36


In [4]:
cars_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1010 entries, 0 to 1009
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   Location      1010 non-null   object 
 1   Fuel_Type     1010 non-null   object 
 2   Transmission  1010 non-null   object 
 3   Owner_Type    1010 non-null   object 
 4   Seats         1010 non-null   int64  
 5   Price         1010 non-null   float64
 6   Age           1010 non-null   int64  
 7   Model         1010 non-null   object 
 8   Mileage       1010 non-null   float64
 9   Power         1010 non-null   float64
 10  KM_Driven     1010 non-null   int64  
dtypes: float64(3), int64(3), object(5)
memory usage: 86.9+ KB


Selecting the features that will be used for modeling.

In [5]:
x_features = ['Fuel_Type', 
              'Transmission', 
              'Owner_Type', 
              'Age', 
              'Model', 
              'KM_Driven']

In [6]:
x_features

['Fuel_Type', 'Transmission', 'Owner_Type', 'Age', 'Model', 'KM_Driven']

In [7]:
cat_vars = ['Fuel_Type',
            'Transmission',
            'Owner_Type',
            'Model']

In [8]:
num_vars = list(set(x_features) - set(cat_vars))

In [9]:
num_vars

['KM_Driven', 'Age']

### Setting X and y variables

In [10]:
X = cars_df[x_features]
y = cars_df['Price']

### Data Splitting

In [11]:
from sklearn.model_selection import train_test_split

In [12]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 80)

In [13]:
X_train.shape

(808, 6)

In [14]:
X_test.shape

(202, 6)

## Creating Pipelines for feature transformation.

1. Categorical columns
    - OHE Encoding
2. Numerical Columns
    - No Transformation Required

In [15]:
from sklearn.preprocessing import OneHotEncoder, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

#### Pipeline for OHE for categorical columns

In [16]:
ohe_encoder = OneHotEncoder(handle_unknown='ignore')
cat_transformer = Pipeline(steps=[('oheencoder', ohe_encoder)])

#### Pipeline for OHE for numerical columns

In [17]:
minmax_scaler = MinMaxScaler()
num_transformer = Pipeline(steps=[('scaler', minmax_scaler)])

#### Defining the processing pipeline

In [18]:
preprocessor = ColumnTransformer(
        transformers = [('numerical', num_transformer, num_vars),
                        ('categorical', cat_transformer, cat_vars)])

## KNN Regression

Building the model.

In [19]:
from sklearn.neighbors import KNeighborsRegressor

We will use 5 neighbors and uniform weights to estimated the price.

In [20]:
knn = KNeighborsRegressor(n_neighbors=5, 
                          weights='uniform')

In [21]:
knn_pipeline = Pipeline (steps = [('preprocessor', preprocessor),
                                   ('regression', knn)])

In [22]:
knn_pipeline.fit(X_train, y_train)

## Predict on Test Set

How well it is performing on the test set?

In [23]:
y_pred = knn_pipeline.predict(X_test)

In [24]:
from sklearn.metrics import mean_squared_error, r2_score

In [25]:
np.round(r2_score(y_test, y_pred), 5)

0.85531

## What happens if the training set and test set changes?

- Change the random_state, build model and measure accuracy again

In [26]:
X_train, X_test, y_train, y_test = train_test_split(X,
                                                    y,
                                                    train_size = 0.8,
                                                    random_state = 200)

In [27]:
knn_pipeline.fit(X_train, y_train)
np.round(r2_score(y_test,  knn_pipeline.predict(X_test)), 5)

0.87923

## Drawbacks of using only one train and test split

Relying on a single train-test split for measuring model accuracy can introduce bias, variability, and lack of confidence in the reported performance.

- The performance of the model can be highly dependent on the specific instances included in the training and test sets. If the split happens to be unrepresentative of the overall data distribution, the obtained accuracy may not reflect the true performance of the model.

- With only one train-test split, there is no measure of confidence or uncertainty in the reported accuracy. There is no way to estimate the confidence intervals and assess the stability of the model's performance.

### K-Fold Cross Validation

- Cross-validation provides a more robust estimate of a model's performance compared to a single train-test split. It partitions the data into k subsets (folds) and performs k iterations, each time using a different fold as test set and remaining data as train set.

Source: https://scikit-learn.org/stable/modules/cross_validation.html

<img src="kfold.png" alt="Normal Distribution" width="600"/>

- It provides more comprehensive evaluation of the model's performance across different data subsets.
- 
Better Utilization of Data: Cross-validation allows us to make efficient use of the available data. In traditional train-test splits, a portion of the data is used only once for testing, which reduces the amount of data available for training. With k-fold cross-validation, every data point is used for both training and validation, leading to a more reliable estimation of performance.

In [28]:
from sklearn.model_selection import cross_val_score

In [29]:
scores = cross_val_score(knn_pipeline,
                         X_train,
                         y_train,
                         cv = 5,
                         scoring = 'r2')

In [30]:
scores

array([0.85730414, 0.85397204, 0.82937946, 0.85533755, 0.85889236])

In [31]:
scores.mean()

0.8509771107614357

In [32]:
scores.std()

0.01092816485058668

### Interpreting the cross validation scores 

- **Mean Accuracy:** Mean accuracy represents the average performance of the model. If the mean accuracy is high and consistent across the folds, it indicates that the model is performing well and is likely to generalize to new data.


- **Variance:**  Look at the variance or standard deviation of the accuracy values across the folds. A high variance suggests that the model's performance is sensitive to the specific data splits used in cross-validation. It may indicate *instability or inconsistency* in the model's predictions when faced with different subsets of the data.


- **Data Quality and Quantity:** The variance in model accuracy could also be influenced by the quality and quantity of the available data. If the dataset is small or contains a high level of noise or outliers, it can contribute to increased variance. Increasing the amount of data or improving its quality can potentially reduce the variance.


- **Model Selection and Hyperparameter Tuning:** Cross-validation helps in comparing and selecting the best model among different alternatives or in tuning hyperparameters. By evaluating each model on multiple folds and averaging the results, we can make more informed decisions about which model or set of hyperparameters performs the best on average across different data subsets.

## Hyperparameter Tuning using Grid Search

- In Grid search, we define a grid of hyperparameter values to explore. 
- The grid search algorithm exhaustively searches through all possible combinations of hyperparameters to find the optimal set that yields the best performance metric, such as r2 score. 
- The performance of each combination is compared, and the hyperparameter set with the best performance is selected as the optimal choice.
- For better compare the metrics, KFold cross-validation is used, which involves dividing the training data into multiple subsets (folds), training the model on a combination of folds, and evaluating it on the remaining fold. This process is repeated for each combination of hyperparameters.

In [33]:
from sklearn.model_selection import GridSearchCV

### Defining pipeline

In [34]:
knn = KNeighborsRegressor()
knn_pipeline = Pipeline (steps = [('preprocessor', preprocessor),
                                  ('knn', knn)])

### Defining Grid

In [35]:
grid = {'knn__n_neighbors': list(range(5, 30, 2)),
        'knn__weights': ['uniform', 'distance']}

In [36]:
knn_grid = GridSearchCV(knn_pipeline,
                        param_grid = grid,
                        cv = 5,
                        scoring = 'r2')

### Searching for optimal hyperparamters

In [37]:
knn_grid.fit(X_train, y_train)

### What are the optimal values for the parameters?

In [38]:
knn_grid.best_params_

{'knn__n_neighbors': 7, 'knn__weights': 'distance'}

In [39]:
knn_grid.best_score_

0.8531482991210609

### Detailed Search Results

In [40]:
pd.DataFrame(knn_grid.cv_results_)

Unnamed: 0,mean_fit_time,std_fit_time,mean_score_time,std_score_time,param_knn__n_neighbors,param_knn__weights,params,split0_test_score,split1_test_score,split2_test_score,split3_test_score,split4_test_score,mean_test_score,std_test_score,rank_test_score
0,0.014722,0.004276,0.017744,0.005612,5,uniform,"{'knn__n_neighbors': 5, 'knn__weights': 'unifo...",0.857304,0.853972,0.829379,0.855338,0.858892,0.850977,0.010928,4
1,0.012273,0.000885,0.014646,0.000757,5,distance,"{'knn__n_neighbors': 5, 'knn__weights': 'dista...",0.85665,0.858882,0.817858,0.855694,0.857488,0.849314,0.015763,5
2,0.011225,0.000367,0.013766,0.000517,7,uniform,"{'knn__n_neighbors': 7, 'knn__weights': 'unifo...",0.863646,0.849905,0.826109,0.86808,0.854528,0.852454,0.014658,2
3,0.011616,0.000594,0.013753,0.000372,7,distance,"{'knn__n_neighbors': 7, 'knn__weights': 'dista...",0.861462,0.859444,0.817708,0.865248,0.86188,0.853148,0.017818,1
4,0.01106,0.000168,0.013241,0.000141,9,uniform,"{'knn__n_neighbors': 9, 'knn__weights': 'unifo...",0.864624,0.838064,0.819089,0.852279,0.843348,0.843481,0.015158,9
5,0.011126,0.000271,0.013661,0.000182,9,distance,"{'knn__n_neighbors': 9, 'knn__weights': 'dista...",0.86512,0.852849,0.816658,0.860852,0.861978,0.851491,0.01788,3
6,0.010777,9.1e-05,0.013264,0.000191,11,uniform,"{'knn__n_neighbors': 11, 'knn__weights': 'unif...",0.857138,0.8233,0.816639,0.837699,0.823373,0.83163,0.01449,17
7,0.011472,0.000521,0.014372,0.000361,11,distance,"{'knn__n_neighbors': 11, 'knn__weights': 'dist...",0.863945,0.845988,0.817736,0.856368,0.856036,0.848015,0.016179,6
8,0.012205,0.001056,0.014224,0.000488,13,uniform,"{'knn__n_neighbors': 13, 'knn__weights': 'unif...",0.849161,0.807225,0.820745,0.827329,0.807801,0.822452,0.015405,18
9,0.012058,0.000407,0.016593,0.001557,13,distance,"{'knn__n_neighbors': 13, 'knn__weights': 'dist...",0.863177,0.838261,0.821456,0.854579,0.853126,0.84612,0.014708,7


## Building the final model

In [41]:
knn = KNeighborsRegressor(n_neighbors=7, weights='distance')
knn_pipeline = Pipeline (steps = [('preprocessor', preprocessor),
                                  ('knn', knn)])
knn_pipeline.fit(X_train, y_train)

In [42]:
r2_score(y_test, knn_pipeline.predict(X_test))

0.8603641992480204