# Hyperparameter Tuning for XGBoost Model

## Introduction

This notebook aims to optimize the performance of the XGBoost classifier applied to the League of Legends matches dataset. We will compare different hyperparameter tuning methods: Grid Search, Randomized Search, and Bayesian Optimization.

## Installing Prerequisites


In [1]:
# !pip install xgboost
# !pip install catboost
# !pip install scikit-optimize
# !pip install category_encoders
# !pip install bayesian-optimization

## Importing Libraries

In [2]:
import time
import pickle
import numpy as np
import pandas as pd
import xgboost as xgb
import category_encoders as ce
import matplotlib.pyplot as plt
from skopt import BayesSearchCV
from catboost import CatBoostClassifier
from bayes_opt import BayesianOptimization
from sklearn.metrics import accuracy_score
from sklearn.feature_selection import mutual_info_classif
from sklearn.model_selection import GridSearchCV, RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_curve, roc_auc_score

## Loading Data

First, we load the dataset and split it into a training and testing sets.

In [3]:
#Loading Dataset
file_path = '../../league_of_legends.csv'
df = pd.read_csv(file_path)

#Splitting into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.1, random_state=42)

## Data Preprocessing

We preprocess the data using One-Hot Encoding and Target Encoding.

One-Hot Encoding the low cardinality variables, namely, 'League',  'Season' and 'Type'.

In [4]:
# One-Hot Encoding
train_df_onehot = pd.get_dummies(train_df, columns=['League', 'Season', 'Type'])
test_df_onehot = pd.get_dummies(test_df, columns=['League', 'Season', 'Type'])
missing_cols = set(train_df_onehot.columns) - set(test_df_onehot.columns)

# Handling missing columns
for c in missing_cols:
    test_df_onehot[c] = 0
test_df_onehot = test_df_onehot[train_df_onehot.columns]

Applying Target Encoding to the high cardinality variables.

In [5]:
# Target Encoding
target_cols = [
    'blueTop', 'blueJungle', 'blueMiddle', 'blueADC', 'blueSupport',
    'redTop', 'redJungle', 'redMiddle', 'redADC', 'redSupport',
    'blueTopChamp', 'blueJungleChamp', 'blueMiddleChamp', 'blueADCChamp', 'blueSupportChamp',
    'redTopChamp', 'redJungleChamp', 'redMiddleChamp', 'redADCChamp', 'redSupportChamp',
    'blueTeamTag', 'redTeamTag'
]
target_variable = 'bResult'
encoder = ce.TargetEncoder(cols=target_cols)
encoder.fit(train_df, train_df[target_variable])
train_df_target_encoded = encoder.transform(train_df)
test_df_target_encoded = encoder.transform(test_df)

Combining the One-Hot and Target Encoded Dataframes for training and testing sets.

In [6]:
# Removing Target Encoded Columns from One-Hot Encoded DataFrame
train_df_onehot = train_df_onehot.drop(columns=target_cols, axis=1)
test_df_onehot = test_df_onehot.drop(columns=target_cols, axis=1)

# Concatenating One-Hot and Target Encoded DataFrames:
train_df_encoded = pd.concat([train_df_onehot, train_df_target_encoded[target_cols]], axis=1)
test_df_encoded = pd.concat([test_df_onehot, test_df_target_encoded[target_cols]], axis=1)

## Feature Selection

We perform feature selection using averaged importance scores from mutual information and CatBoost.

In [7]:
# Separate features and target variable
X_train = train_df_encoded.drop([target_variable], axis=1)
y_train = train_df_encoded[target_variable]

# Calculate Mutual Information scores
mi_scores = mutual_info_classif(X_train, y_train)
mi_scores = pd.Series(mi_scores, name='MI_Scores', index=X_train.columns)

# CatBoost Importance
catboost_model = CatBoostClassifier(iterations=100, verbose=0)
catboost_model.fit(X_train, y_train)
catboost_importances = pd.Series(catboost_model.get_feature_importance(), name='CatBoost_Importance', index=X_train.columns)

# Combine and Normalize
importance_df = pd.concat([mi_scores, catboost_importances], axis=1)
importance_df['MI_Scores'] = (importance_df['MI_Scores'] - importance_df['MI_Scores'].min()) / (importance_df['MI_Scores'].max() - importance_df['MI_Scores'].min())
importance_df['CatBoost_Importance'] = (importance_df['CatBoost_Importance'] - importance_df['CatBoost_Importance'].min()) / (importance_df['CatBoost_Importance'].max() - importance_df['CatBoost_Importance'].min())
importance_df['Combined_Importance'] = (importance_df['MI_Scores'] + importance_df['CatBoost_Importance']) / 2

# Sort and Select Features
sorted_features = importance_df.sort_values(by='Combined_Importance', ascending=False).index
N = 22
selected_features = sorted_features[:N]
X_train_selected = X_train[selected_features]
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]

## Hyperparameter Tuning

We use three methods for hyperparameter tuning: Grid Search, Randomized Search, and Bayesian Optimization.

In [8]:
# Training the model to prepare for Hyperparameter Tuning
def train_evaluate_xgb(params):
    xgb_clf = xgb.XGBClassifier(
        learning_rate=params['learning_rate'],
        max_depth=int(params['max_depth']),
        n_estimators=int(params['n_estimators']),
        subsample=params['subsample'],
        colsample_bytree=params['colsample_bytree'],
        random_state=42
    )
    xgb_clf.fit(X_train_selected, y_train)
    y_pred = xgb_clf.predict(X_test_selected)
    return accuracy_score(y_test, y_pred)

In [9]:
# Grid Search
param_grid = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [2, 3, 4],
    'n_estimators': [100, 200, 300],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

grid_search = GridSearchCV(xgb.XGBClassifier(random_state=42), param_grid, cv=5, scoring='accuracy')
grid_search.fit(X_train_selected, y_train)
grid_best_params = grid_search.best_params_
grid_best_accuracy = grid_search.best_score_

In [10]:
# Randomized Search
param_dist = {
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [2, 3, 4],
    'n_estimators': [100, 200, 300],
    'subsample': [0.7, 0.8, 0.9],
    'colsample_bytree': [0.7, 0.8, 0.9]
}

random_search = RandomizedSearchCV(xgb.XGBClassifier(random_state=42), param_dist, n_iter=20, cv=5, scoring='accuracy')
random_search.fit(X_train_selected, y_train)
random_best_params = random_search.best_params_
random_best_accuracy = random_search.best_score_

In [11]:
# Bayesian Optimization
y_test = test_df_encoded[target_variable]

def xgb_bayesian(learning_rate, max_depth, n_estimators, subsample, colsample_bytree):
    params = {
        'learning_rate': learning_rate,
        'max_depth': max_depth,
        'n_estimators': n_estimators,
        'subsample': subsample,
        'colsample_bytree': colsample_bytree
    }
    return train_evaluate_xgb(params)

optimizer = BayesianOptimization(
    f=xgb_bayesian,
    pbounds={
        'learning_rate': (0.01, 0.2),
        'max_depth': (2, 4),
        'n_estimators': (100, 300),
        'subsample': (0.7, 0.9),
        'colsample_bytree': (0.7, 0.9)
    },
    random_state=42
)

optimizer.maximize(init_points=5, n_iter=15)
bayesian_best_params = optimizer.max['params']
bayesian_best_accuracy = optimizer.max['target']

|   iter    |  target   | colsam... | learni... | max_depth | n_esti... | subsample |
-------------------------------------------------------------------------------------
| [0m1        [0m | [0m0.6385   [0m | [0m0.7749   [0m | [0m0.1906   [0m | [0m3.464    [0m | [0m219.7    [0m | [0m0.7312   [0m |
| [0m2        [0m | [0m0.6385   [0m | [0m0.7312   [0m | [0m0.02104  [0m | [0m3.732    [0m | [0m220.2    [0m | [0m0.8416   [0m |
| [0m3        [0m | [0m0.6332   [0m | [0m0.7041   [0m | [0m0.1943   [0m | [0m3.665    [0m | [0m142.5    [0m | [0m0.7364   [0m |
| [95m4        [0m | [95m0.6464   [0m | [95m0.7367   [0m | [95m0.06781  [0m | [95m3.05     [0m | [95m186.4    [0m | [95m0.7582   [0m |
| [0m5        [0m | [0m0.6372   [0m | [0m0.8224   [0m | [0m0.0365   [0m | [0m2.584    [0m | [0m173.3    [0m | [0m0.7912   [0m |
| [0m6        [0m | [0m0.6425   [0m | [0m0.8789   [0m | [0m0.01221  [0m | [0m3.932    [0m | [0m194.7

## Evaluating Models

After hyperparameter tuning, we evaluate the best models from each method on the test data.

In [12]:
def train_evaluate(params, X_train, y_train, X_test, y_test, method):
    for param_name in ['max_depth', 'n_estimators']:
        if param_name in params:
            params[param_name] = int(params[param_name])
    
    # Training the model and recording training time
    xgb_clf = xgb.XGBClassifier(random_state=42, **params)
    start_train_time = time.time()
    xgb_clf.fit(X_train, y_train)
    end_train_time = time.time()  
    
    # Testing the model and recording prediction time
    start_test_time = time.time()  
    y_pred = xgb_clf.predict(X_test)
    y_prob = xgb_clf.predict_proba(X_test)[:, 1]
    end_test_time = time.time()  
    
    # Calculating the metrics
    test_accuracy = accuracy_score(y_test, y_pred)
    test_precision = precision_score(y_test, y_pred)
    test_recall = recall_score(y_test, y_pred)
    test_f1 = f1_score(y_test, y_pred)
    train_time = end_train_time - start_train_time
    test_time = end_test_time - start_test_time
    
    # Creating the results dataframe
    results_df = pd.DataFrame({
        'Method': [method],
        'Accuracy': [test_accuracy],
        'Precision': [test_precision],
        'Recall': [test_recall],
        'F1-Score': [test_f1],
        'Training Time (s)': [train_time],  
        'Prediction Time (s)': [test_time]     
    })
    return results_df

In [13]:
# Prepare the test data
X_test_selected = test_df_encoded.drop([target_variable], axis=1)[selected_features]
y_test = test_df_encoded[target_variable]

# Convert Bayesian best parameters to correct format
bayesian_best_params = {key: int(value) if isinstance(value, float) and value.is_integer() else value for key, value in bayesian_best_params.items()}

# Evaluate the model using best parameters from each method
results_grid = train_evaluate(grid_best_params, X_train_selected, y_train, X_test_selected, y_test, 'Grid Search')
results_random = train_evaluate(random_best_params, X_train_selected, y_train, X_test_selected, y_test, 'Random Search')
results_bayesian = train_evaluate(bayesian_best_params, X_train_selected, y_train, X_test_selected, y_test, 'Bayesian Optimization')

## Results Comparison

Comparing the performance of the models using the best hyperparameters from Grid Search, Randomized Search, and Bayesian Optimization methods.

In [14]:
# Concatenate the results
final_results = pd.concat([results_grid, results_random, results_bayesian], axis=0).reset_index(drop=True)

#Best Parameters
print(f"Grid Search - Best Params: {grid_best_params}, Best Accuracy: {grid_best_accuracy}")
print(f"Random Search - Best Params: {random_best_params}, Best Accuracy: {random_best_accuracy}")
print(f"Bayesian Optimization - Best Params: {bayesian_best_params}, Best Accuracy: {bayesian_best_accuracy}")

# Display the final results
print(final_results)

Grid Search - Best Params: {'colsample_bytree': 0.7, 'learning_rate': 0.1, 'max_depth': 3, 'n_estimators': 200, 'subsample': 0.8}, Best Accuracy: 0.7169439162987551
Random Search - Best Params: {'subsample': 0.8, 'n_estimators': 300, 'max_depth': 2, 'learning_rate': 0.1, 'colsample_bytree': 0.9}, Best Accuracy: 0.7140121169153427
Bayesian Optimization - Best Params: {'colsample_bytree': 0.8716002139300233, 'learning_rate': 0.18670599405725474, 'max_depth': 2, 'n_estimators': 251, 'subsample': 0.8533695880915422}, Best Accuracy: 0.6569920844327177
                  Method  Accuracy  Precision    Recall  F1-Score  \
0            Grid Search  0.663588   0.686230  0.723810  0.704519   
1          Random Search  0.645119   0.669663  0.709524  0.689017   
2  Bayesian Optimization  0.656992   0.678571  0.723810  0.700461   

   Training Time (s)  Prediction Time (s)  
0           0.180879             0.006434  
1           0.218117             0.010000  
2           0.180362             0.010