# Hyperparameter Tuning for K-Nearest Neighbors Model

## Introduction

This notebook aims to optimize the performance of the K-Nearest Neighbors Model applied to the League of Legends matches dataset. We will compare different hyperparameter tuning methods: Grid Search, Randomized Search, and Bayesian Optimization.

## Installing Prerequisites


In [1]:
# !pip install catboost
# !pip install scikit-optimize
# !pip install category_encoders
# !pip install bayesian-optimization

## Importing Libraries

In [2]:
import time
import pickle
import numpy as np
import pandas as pd
import category_encoders as ce
import matplotlib.pyplot as plt
from skopt import BayesSearchCV
from catboost import CatBoostClassifier
from bayes_opt import BayesianOptimization
from sklearn.metrics import accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score

## Loading Data

First, we load the dataset and split it into a training and testing sets.

In [3]:
#Loading Dataset
file_path = '../../sentiment_scores.csv'
df = pd.read_csv(file_path)

#Splitting into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

## Data Preprocessing

We preprocess the data using One-Hot Encoding and Target Encoding.

One-Hot Encoding the low cardinality variables, namely, 'League',  'Season' and 'Type'.

In [4]:
# One-Hot Encoding
train_df_onehot = pd.get_dummies(train_df, columns=['League', 'Season', 'Type'])
test_df_onehot = pd.get_dummies(test_df, columns=['League', 'Season', 'Type'])
missing_cols = set(train_df_onehot.columns) - set(test_df_onehot.columns)

# Handling missing columns
for c in missing_cols:
    test_df_onehot[c] = 0
test_df_onehot = test_df_onehot[train_df_onehot.columns]

Applying Target Encoding to the high cardinality variables.

In [5]:
# Target Encoding
target_cols = [
    'blueTop', 'blueJungle', 'blueMiddle', 'blueADC', 'blueSupport',
    'redTop', 'redJungle', 'redMiddle', 'redADC', 'redSupport',
    'blueTopChamp', 'blueJungleChamp', 'blueMiddleChamp', 'blueADCChamp', 'blueSupportChamp',
    'redTopChamp', 'redJungleChamp', 'redMiddleChamp', 'redADCChamp', 'redSupportChamp',
    'blueTeamTag', 'redTeamTag'
]
target_variable = 'bResult'
encoder = ce.TargetEncoder(cols=target_cols)
encoder.fit(train_df, train_df[target_variable])
train_df_target_encoded = encoder.transform(train_df)
test_df_target_encoded = encoder.transform(test_df)

Combining the One-Hot and Target Encoded Dataframes for training and testing sets.

In [6]:
# Removing Target Encoded Columns from One-Hot Encoded DataFrame
train_df_onehot = train_df_onehot.drop(columns=target_cols, axis=1)
test_df_onehot = test_df_onehot.drop(columns=target_cols, axis=1)

# Concatenating One-Hot and Target Encoded DataFrames:
train_df_encoded = pd.concat([train_df_onehot, train_df_target_encoded[target_cols]], axis=1)
test_df_encoded = pd.concat([test_df_onehot, test_df_target_encoded[target_cols]], axis=1)

## Feature Selection

We perform feature selection using averaged importance scores from mutual information and CatBoost.

In [7]:
# Separate features and target variable
X_train = train_df_encoded.drop([target_variable], axis=1)
y_train = train_df_encoded[target_variable]

# Calculate Mutual Information scores
mi_scores = mutual_info_classif(X_train, y_train)
mi_scores = pd.Series(mi_scores, name='MI_Scores', index=X_train.columns)

# CatBoost Importance
catboost_model = CatBoostClassifier(iterations=100, verbose=0)
catboost_model.fit(X_train, y_train)
catboost_importances = pd.Series(catboost_model.get_feature_importance(), name='CatBoost_Importance', index=X_train.columns)

# Combine and Normalize
importance_df = pd.concat([mi_scores, catboost_importances], axis=1)
importance_df['MI_Scores'] = (importance_df['MI_Scores'] - importance_df['MI_Scores'].min()) / (importance_df['MI_Scores'].max() - importance_df['MI_Scores'].min())
importance_df['CatBoost_Importance'] = (importance_df['CatBoost_Importance'] - importance_df['CatBoost_Importance'].min()) / (importance_df['CatBoost_Importance'].max() - importance_df['CatBoost_Importance'].min())
importance_df['Combined_Importance'] = (importance_df['MI_Scores'] + importance_df['CatBoost_Importance']) / 2

# Sort and Select Features
sorted_features = importance_df.sort_values(by='Combined_Importance', ascending=False).index
N = 22
selected_features = sorted_features[:N]
X_train_selected = X_train[selected_features]
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]

## Hyperparameter Tuning

We use three methods for hyperparameter tuning: Grid Search, Randomized Search, and Bayesian Optimization.

In [8]:
# Training the model to prepare for Hyperparameter Tuning
def train_evaluate_knn(params):
    knn_clf = KNeighborsClassifier(
        n_neighbors=int(params['n_neighbors']),
        weights=params['weights'],
        algorithm=params['algorithm'],
        leaf_size=int(params['leaf_size']),
        p=params['p'],
    )
    knn_clf.fit(X_train_selected, y_train)
    y_pred = knn_clf.predict(X_test_selected)
    return accuracy_score(y_test, y_pred)

In [9]:
# Grid Search
param_grid = {
    'n_neighbors': [3, 5, 7, 9], 
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [20, 30, 40], 
    'p': [1, 2]  
}

grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)
grid_best_params_knn = grid_search.best_params_
grid_best_accuracy_knn = grid_search.best_score_

print(f"Grid Search - Best Params: {grid_best_params_knn}, Best Accuracy: {grid_best_accuracy_knn}")

Grid Search - Best Params: {'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 9, 'p': 2, 'weights': 'distance'}, Best Accuracy: 0.6846954121147669


In [10]:
# Randomized Search
param_dist_knn = {
    'n_neighbors': [3, 5, 7, 9, 11],  
    'weights': ['uniform', 'distance'],
    'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
    'leaf_size': [20, 30, 40],  
    'p': [1, 2]  
}

random_search_knn = RandomizedSearchCV(KNeighborsClassifier(), param_dist_knn, n_iter=20, cv=5, scoring='accuracy')
random_search_knn.fit(X_train_selected, y_train)
random_best_params_knn = random_search_knn.best_params_
random_best_accuracy_knn = random_search_knn.best_score_

print(f"Randomized Search - Best Params: {random_best_params_knn}, Best Accuracy: {random_best_accuracy_knn}")

Randomized Search - Best Params: {'weights': 'distance', 'p': 2, 'n_neighbors': 11, 'leaf_size': 20, 'algorithm': 'ball_tree'}, Best Accuracy: 0.6873357824970728


In [11]:
# Bayesian Optimization
y_test = test_df_encoded[target_variable]

def knn_bayesian(n_neighbors, leaf_size, p, weights, algorithm):
    params = {
        'n_neighbors': int(n_neighbors),
        'leaf_size': int(leaf_size),
        'p': int(p),
        'weights': 'uniform' if weights < 0.5 else 'distance',
        'algorithm': 'auto' if algorithm < 0.5 else 'ball_tree'
    }
    return train_evaluate_knn(params)


optimizer_knn = BayesianOptimization(
    f=knn_bayesian,
    pbounds={
        'n_neighbors': (3, 15), 
        'leaf_size': (20, 50),  
        'p': (1, 2), 
        'weights': (0, 1),  
        'algorithm': (0, 1)  
    },
    random_state=42,
)

optimizer_knn.maximize(init_points=5, n_iter=15)

# Extracting the best parameters and the best accuracy
bayesian_best_params_knn = optimizer_knn.max['params']
bayesian_best_accuracy_knn = optimizer_knn.max['target']

# Convert optimized parameters to integers where necessary and map float to string parameters
bayesian_best_params_knn['n_neighbors'] = int(bayesian_best_params_knn['n_neighbors'])
bayesian_best_params_knn['leaf_size'] = int(bayesian_best_params_knn['leaf_size'])
bayesian_best_params_knn['p'] = int(bayesian_best_params_knn['p'])
bayesian_best_params_knn['weights'] = 'uniform' if bayesian_best_params_knn['weights'] < 0.5 else 'distance'
bayesian_best_params_knn['algorithm'] = 'auto' if bayesian_best_params_knn['algorithm'] < 0.5 else 'ball_tree'

print(f"Bayesian Optimization - Best Params: {bayesian_best_params_knn}, Best Accuracy: {bayesian_best_accuracy_knn}")

|   iter    |  target   | algorithm | leaf_size | n_neig... |     p     |  weights  |
-------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.6148   [0m | [0m0.3745   [0m | [0m48.52    [0m | [0m11.78    [0m | [0m1.599    [0m | [0m0.156    [0m |
| [95m2        [0m | [95m0.6201   [0m | [95m0.156    [0m | [95m21.74    [0m | [95m13.39    [0m | [95m1.601    [0m | [95m0.7081   [0m |
| [0m3        [0m | [0m0.6121   [0m | [0m0.02058  [0m | [0m49.1     [0m | [0m12.99    [0m | [0m1.212    [0m | [0m0.1818   [0m |
| [0m4        [0m | [0m0.6069   [0m | [0m0.1834   [0m | [0m29.13    [0m | [0m9.297    [0m | [0m1.432    [0m | [0m0.2912   [0m |
| [0m5        [0m | [0m0.5897   [0m | [0m0.6119   [0m | [0m24.18    [0m | [0m6.506    [0m | [0m1.366    [0m | [0m0.4561   [0m |
| [0m6        [0m | [0m0.6161   [0m | [0m1.0      [0m | [0m24.76    [0m | [0m15.0     [0m | [0m2.0  

## Evaluating Models

After hyperparameter tuning, we evaluate the best models from each method on the test data.

In [12]:
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score
import time
import pandas as pd

def train_evaluate(params, X_train, y_train, X_test, y_test, method):
    for param_name in ['n_neighbors', 'leaf_size', 'p']:
        if param_name in params:
            params[param_name] = int(params[param_name])

    # Training the model and recording training time
    start_train_time = time.time()
    knn_clf = KNeighborsClassifier(**params)
    knn_clf.fit(X_train, y_train)
    end_train_time = time.time()  
    
    # Testing the model and recording prediction time
    start_test_time = time.time()  
    y_pred = knn_clf.predict(X_test)
    y_prob = knn_clf.predict_proba(X_test)[:, 1]
    end_test_time = time.time()  
    
    # Calculating the metrics
    test_accuracy = accuracy_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred)
    test_recall = recall_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    train_time = end_train_time - start_train_time
    test_time = end_test_time - start_test_time
    
    # Creating the results dataframe
    results_df = pd.DataFrame({
        'Method': [method],
        'Accuracy': [test_accuracy],
        'Precision': [test_precision],
        'Recall': [test_recall],
        'F1-Score': [test_f1],
        'Training Time (s)': [train_time],  
        'Prediction Time (s)': [test_time]     
    })
    
    return results_df

In [13]:
# Prepare the test data
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]
y_test = test_df_encoded[target_variable]

# Convert Bayesian best parameters to the correct format for k-NN
bayesian_best_params_knn = {key: int(value) if isinstance(value, float) and value.is_integer() else value for key, value in bayesian_best_params_knn.items()}

# Evaluate the k-NN model using best parameters from each method
results_grid_knn = train_evaluate(grid_best_params_knn, X_train_selected, y_train, X_test_selected, y_test, 'Grid Search')
results_random_knn = train_evaluate(random_best_params_knn, X_train_selected, y_train, X_test_selected, y_test, 'Random Search')
results_bayesian_knn = train_evaluate(bayesian_best_params_knn, X_train_selected, y_train, X_test_selected, y_test, 'Bayesian Optimization')

## Results Comparison

Comparing the performance of the models using the best hyperparameters from Grid Search, Randomized Search, and Bayesian Optimization methods.

In [14]:
# Concatenate the k-NN results
final_results_knn = pd.concat([results_grid_knn, results_random_knn, results_bayesian_knn], axis=0).reset_index(drop=True)

# Display the best parameters for k-NN
print(f"Grid Search for k-NN - Best Params: {grid_best_params_knn}, Best Accuracy: {grid_best_accuracy_knn}")
print(f"Random Search for k-NN - Best Params: {random_best_params_knn}, Best Accuracy: {random_best_accuracy_knn}")
print(f"Bayesian Optimization for k-NN - Best Params: {bayesian_best_params_knn}, Best Accuracy: {bayesian_best_accuracy_knn}")

# Display the final results for k-NN
print(final_results_knn)

Grid Search for k-NN - Best Params: {'algorithm': 'auto', 'leaf_size': 20, 'n_neighbors': 9, 'p': 2, 'weights': 'distance'}, Best Accuracy: 0.6846954121147669
Random Search for k-NN - Best Params: {'weights': 'distance', 'p': 2, 'n_neighbors': 11, 'leaf_size': 20, 'algorithm': 'ball_tree'}, Best Accuracy: 0.6873357824970728
Bayesian Optimization for k-NN - Best Params: {'algorithm': 'auto', 'leaf_size': 21, 'n_neighbors': 13, 'p': 1, 'weights': 'distance'}, Best Accuracy: 0.6200527704485488
                  Method  Accuracy  Precision    Recall  F1-Score  \
0            Grid Search  0.612137   0.641892  0.678571  0.659722   
1          Random Search  0.608179   0.638202  0.676190  0.656647   
2  Bayesian Optimization  0.620053   0.646018  0.695238  0.669725   

   Training Time (s)  Prediction Time (s)  
0           0.004257             0.140374  
1           0.020000             0.311360  
2           0.000000             0.221242  
