# Hyperparameter Tuning for Gaussian Naive Bayes Model

## Introduction

This notebook aims to optimize the performance of the Gaussian Naive Bayes model applied to the League of Legends matches dataset. We will compare different hyperparameter tuning methods: Grid Search, Randomized Search, and Bayesian Optimization.

## Installing Prerequisites


In [1]:
# !pip install catboost
# !pip install scikit-optimize
# !pip install category_encoders
# !pip install bayesian-optimization

## Importing Libraries

In [2]:
import time
import pickle
import numpy as np
import pandas as pd
import category_encoders as ce
import matplotlib.pyplot as plt
from skopt import BayesSearchCV
from catboost import CatBoostClassifier
from sklearn.naive_bayes import GaussianNB
from bayes_opt import BayesianOptimization
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score

## Loading Data

First, we load the dataset and split it into a training and testing sets.

In [3]:
#Loading Dataset
file_path = '../../league_of_legends.csv'
df = pd.read_csv(file_path)

#Splitting into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

## Data Preprocessing

We preprocess the data using One-Hot Encoding and Target Encoding.

One-Hot Encoding the low cardinality variables, namely, 'League',  'Season' and 'Type'.

In [4]:
# One-Hot Encoding
train_df_onehot = pd.get_dummies(train_df, columns=['League', 'Season', 'Type'])
test_df_onehot = pd.get_dummies(test_df, columns=['League', 'Season', 'Type'])
missing_cols = set(train_df_onehot.columns) - set(test_df_onehot.columns)

# Handling missing columns
for c in missing_cols:
    test_df_onehot[c] = 0
test_df_onehot = test_df_onehot[train_df_onehot.columns]

Applying Target Encoding to the high cardinality variables.

In [5]:
# Target Encoding
target_cols = [
    'blueTop', 'blueJungle', 'blueMiddle', 'blueADC', 'blueSupport',
    'redTop', 'redJungle', 'redMiddle', 'redADC', 'redSupport',
    'blueTopChamp', 'blueJungleChamp', 'blueMiddleChamp', 'blueADCChamp', 'blueSupportChamp',
    'redTopChamp', 'redJungleChamp', 'redMiddleChamp', 'redADCChamp', 'redSupportChamp',
    'blueTeamTag', 'redTeamTag'
]
target_variable = 'bResult'
encoder = ce.TargetEncoder(cols=target_cols)
encoder.fit(train_df, train_df[target_variable])
train_df_target_encoded = encoder.transform(train_df)
test_df_target_encoded = encoder.transform(test_df)

Combining the One-Hot and Target Encoded Dataframes for training and testing sets.

In [6]:
# Removing Target Encoded Columns from One-Hot Encoded DataFrame
train_df_onehot = train_df_onehot.drop(columns=target_cols, axis=1)
test_df_onehot = test_df_onehot.drop(columns=target_cols, axis=1)

# Concatenating One-Hot and Target Encoded DataFrames:
train_df_encoded = pd.concat([train_df_onehot, train_df_target_encoded[target_cols]], axis=1)
test_df_encoded = pd.concat([test_df_onehot, test_df_target_encoded[target_cols]], axis=1)

## Feature Selection

We perform feature selection using averaged importance scores from mutual information and CatBoost.

In [7]:
# Separate features and target variable
X_train = train_df_encoded.drop([target_variable], axis=1)
y_train = train_df_encoded[target_variable]

# Calculate Mutual Information scores
mi_scores = mutual_info_classif(X_train, y_train)
mi_scores = pd.Series(mi_scores, name='MI_Scores', index=X_train.columns)

# CatBoost Importance
catboost_model = CatBoostClassifier(iterations=100, verbose=0)
catboost_model.fit(X_train, y_train)
catboost_importances = pd.Series(catboost_model.get_feature_importance(), name='CatBoost_Importance', index=X_train.columns)

# Combine and Normalize
importance_df = pd.concat([mi_scores, catboost_importances], axis=1)
importance_df['MI_Scores'] = (importance_df['MI_Scores'] - importance_df['MI_Scores'].min()) / (importance_df['MI_Scores'].max() - importance_df['MI_Scores'].min())
importance_df['CatBoost_Importance'] = (importance_df['CatBoost_Importance'] - importance_df['CatBoost_Importance'].min()) / (importance_df['CatBoost_Importance'].max() - importance_df['CatBoost_Importance'].min())
importance_df['Combined_Importance'] = (importance_df['MI_Scores'] + importance_df['CatBoost_Importance']) / 2

# Sort and Select Features
sorted_features = importance_df.sort_values(by='Combined_Importance', ascending=False).index
N = 22
selected_features = sorted_features[:N]
X_train_selected = X_train[selected_features]
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]

## Hyperparameter Tuning

We use three methods for hyperparameter tuning: Grid Search, Randomized Search, and Bayesian Optimization.

In [8]:
# Training the model to prepare for Hyperparameter Tuning (if needed)
def train_evaluate_nb(params=None):
    # Create a Gaussian Naive Bayes classifier
    nb_clf = GaussianNB()
    
    # Fit the model to the training data
    nb_clf.fit(X_train_selected, y_train)
    
    # Make predictions on the test data
    y_pred = nb_clf.predict(X_test_selected)
    
    # Return the accuracy score
    return accuracy_score(y_test, y_pred)

In [9]:
# Grid Search
param_grid = {
    'var_smoothing': [1e-11, 1e-10, 1e-9, 1e-8, 1e-7]
}

grid_search = GridSearchCV(GaussianNB(), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)
grid_best_params_nb = grid_search.best_params_
grid_best_accuracy_nb = grid_search.best_score_

In [10]:
# Randomized Search
param_dist = {
    'var_smoothing': np.logspace(-11, -7, num=20)
}

random_search = RandomizedSearchCV(GaussianNB(), param_dist, n_iter=10, cv=5, scoring='accuracy', random_state=42)
random_search.fit(X_train_selected, y_train)
random_best_params_nb = random_search.best_params_
random_best_accuracy_nb = random_search.best_score_

In [11]:
# Bayesian Optimization
y_test = test_df_encoded[target_variable]

def nb_bayesian(var_smoothing):
    params = {
        'var_smoothing': var_smoothing
    }
    return train_evaluate_nb(params)

optimizer = BayesianOptimization(
    f=nb_bayesian,
    pbounds={
        'var_smoothing': (1e-11, 1e-7)
    },
    random_state=42
)

optimizer.maximize(init_points=5, n_iter=15)
bayesian_best_params_nb = optimizer.max['params']
bayesian_best_accuracy_nb = optimizer.max['target']

|   iter    |  target   | var_sm... |
-------------------------------------
| [0m1        [0m | [0m0.6346   [0m | [0m3.746e-08[0m |
| [0m2        [0m | [0m0.6346   [0m | [0m9.507e-08[0m |
| [0m3        [0m | [0m0.6346   [0m | [0m7.32e-08 [0m |
| [0m4        [0m | [0m0.6346   [0m | [0m5.987e-08[0m |
| [0m5        [0m | [0m0.6346   [0m | [0m1.561e-08[0m |
| [0m6        [0m | [0m0.6346   [0m | [0m3.916e-08[0m |
| [0m7        [0m | [0m0.6346   [0m | [0m1.076e-08[0m |
| [0m8        [0m | [0m0.6346   [0m | [0m3.357e-08[0m |
| [0m9        [0m | [0m0.6346   [0m | [0m1.252e-08[0m |
| [0m10       [0m | [0m0.6346   [0m | [0m9.561e-08[0m |
| [0m11       [0m | [0m0.6346   [0m | [0m4.846e-08[0m |
| [0m12       [0m | [0m0.6346   [0m | [0m5.741e-09[0m |
| [0m13       [0m | [0m0.6346   [0m | [0m6.952e-08[0m |
| [0m14       [0m | [0m0.6346   [0m | [0m5.892e-08[0m |
| [0m15       [0m | [0m0.6346   [0m | [0m1.605e-09

## Evaluating Models

After hyperparameter tuning, we evaluate the best models from each method on the test data.

In [12]:
def train_evaluate(params, X_train, y_train, X_test, y_test, method):
    var_smoothing = params.get('var_smoothing', 1e-9)
    
    # Training the model and recording training time
    naive_bayes_clf = GaussianNB(var_smoothing=var_smoothing)
    start_train_time = time.time()
    naive_bayes_clf.fit(X_train, y_train)
    end_train_time = time.time()  
    
    # Testing the model and recording prediction time
    start_test_time = time.time()
    y_pred = naive_bayes_clf.predict(X_test)
    y_prob = naive_bayes_clf.predict_proba(X_test)[:, 1]
    end_test_time = time.time()  
    
    # Calculating the metrics
    test_accuracy = accuracy_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred)
    test_recall = recall_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    train_time = end_train_time - start_train_time
    test_time = end_test_time - start_test_time
    
    # Creating the results dataframe
    results_df = pd.DataFrame({
        'Method': [method],
        'Accuracy': [test_accuracy],
        'Precision': [test_precision],
        'Recall': [test_recall],
        'F1-Score': [test_f1],
        'Training Time (s)': [train_time],  
        'Prediction Time (s)': [test_time]     
    })
    
    return results_df

In [13]:
# Prepare the test data
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]
y_test = test_df_encoded[target_variable]

# Convert Bayesian best parameters to correct format if needed
bayesian_best_params_nb = {key: int(value) if isinstance(value, float) and value.is_integer() else value for key, value in bayesian_best_params_nb.items()}

# Evaluate the model using best parameters from each method for Gaussian Naive Bayes
results_grid_nb = train_evaluate(grid_best_params_nb, X_train_selected, y_train, X_test_selected, y_test, 'Grid Search')
results_random_nb = train_evaluate(random_best_params_nb, X_train_selected, y_train, X_test_selected, y_test, 'Random Search')
results_bayesian_nb = train_evaluate(bayesian_best_params_nb, X_train_selected, y_train, X_test_selected, y_test, 'Bayesian Optimization')

# Concatenate the results
final_results_nb = pd.concat([results_grid_nb, results_random_nb, results_bayesian_nb], axis=0).reset_index(drop=True)

## Results Comparison

Comparing the performance of the models using the best hyperparameters from Grid Search, Randomized Search, and Bayesian Optimization methods.

In [14]:
# Concatenate the results for Naive Bayes
final_results_nb = pd.concat([results_grid_nb, results_random_nb, results_bayesian_nb], axis=0).reset_index(drop=True)

# Best Parameters for Naive Bayes
print(f"Grid Search (Naive Bayes) - Best Params: {grid_best_params_nb}, Best Accuracy: {grid_best_accuracy_nb}")
print(f"Random Search (Naive Bayes) - Best Params: {random_best_params_nb}, Best Accuracy: {random_best_accuracy_nb}")
print(f"Bayesian Optimization (Naive Bayes) - Best Params: {bayesian_best_params_nb}, Best Accuracy: {bayesian_best_accuracy_nb}")

# Display the final results for Naive Bayes
print("Final Results for Naive Bayes:")
print(final_results_nb)

Grid Search (Naive Bayes) - Best Params: {'var_smoothing': 1e-11}, Best Accuracy: 0.7063919951016725
Random Search (Naive Bayes) - Best Params: {'var_smoothing': 1e-11}, Best Accuracy: 0.7063919951016725
Bayesian Optimization (Naive Bayes) - Best Params: {'var_smoothing': 3.7460266483547775e-08}, Best Accuracy: 0.6345646437994723
Final Results for Naive Bayes:
                  Method  Accuracy  Precision    Recall  F1-Score  \
0            Grid Search  0.634565   0.667447  0.678571  0.672963   
1          Random Search  0.634565   0.667447  0.678571  0.672963   
2  Bayesian Optimization  0.634565   0.667447  0.678571  0.672963   

   Training Time (s)  Prediction Time (s)  
0           0.000439             0.004463  
1           0.005017             0.002009  
2           0.003999             0.001998  
