# Hyperparameter Tuning in XGBoost with Grid Search

The easiest way to perform hyperparameter tuning is to use the scikit-learn API for XGBoost. This will allow us to use the hyperparameter tuning libraries that scikit-learn offers such as GridSearchCV

In [71]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import (
    accuracy_score, precision_score, recall_score, f1_score
)

from xgboost import XGBClassifier

In [72]:
hr_data = pd.read_csv('hr_analytics.csv')

hr_data.head().T

Unnamed: 0,0,1,2,3,4
employee_id,65438,65141,7513,2542,48945
department,Sales & Marketing,Operations,Sales & Marketing,Sales & Marketing,Technology
region,region_7,region_22,region_19,region_23,region_26
education,Master's & above,Bachelor's,Bachelor's,Bachelor's,Bachelor's
gender,f,m,m,m,m
recruitment_channel,sourcing,other,sourcing,other,other
no_of_trainings,1,1,1,2,1
age,35,30,34,39,45
previous_year_rating,5.0,5.0,3.0,1.0,3.0
length_of_service,8,4,7,10,2


In [73]:
hr_data.drop("employee_id", axis=1, inplace=True)

cat_features = hr_data.select_dtypes(include='object').columns.tolist()

for col in cat_features:
    hr_data[col] = hr_data[col].astype('category')

hr_data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 54735 entries, 0 to 54734
Data columns (total 12 columns):
 #   Column                Non-Null Count  Dtype   
---  ------                --------------  -----   
 0   department            54735 non-null  category
 1   region                54735 non-null  category
 2   education             52329 non-null  category
 3   gender                54735 non-null  category
 4   recruitment_channel   54735 non-null  category
 5   no_of_trainings       54735 non-null  int64   
 6   age                   54735 non-null  int64   
 7   previous_year_rating  50616 non-null  float64 
 8   length_of_service     54735 non-null  int64   
 9   awards_won?           54735 non-null  int64   
 10  avg_training_score    54735 non-null  int64   
 11  is_promoted           54735 non-null  int64   
dtypes: category(5), float64(1), int64(6)
memory usage: 3.2 MB


In [74]:
X = hr_data.drop('is_promoted', axis=1)
y = hr_data['is_promoted']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=42, stratify=y)

X_train.shape, X_test.shape

((38314, 11), (16421, 11))

## Baseline model with default parameters

In [75]:
xgb_clf = XGBClassifier(
    enable_categorical=True, 
    objective='binary:logistic', 
    tree_method='hist',
    eval_metric='auc',
    random_state=42
)


xgb_clf.fit(X_train, y_train)

print("XGBoost model trained successfully.")

XGBoost model trained successfully.


In [76]:
def compute_metrics(model):
    y_pred = model.predict(X_test)

    accuracy = accuracy_score(y_test, y_pred)
    precision = precision_score(y_test, y_pred)
    recall = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)

    print("Evaluation Metrics:")
    print("Accuracy:", round(accuracy, 4))
    print("Precision:", round(precision, 4))
    print("Recall:", round(recall, 4))
    print("F1 Score:", round(f1, 4))

Note that the recall score is very low in the default model - let's now tune hyperparameters to find best recall

In [78]:
compute_metrics(xgb_clf)

Evaluation Metrics:
Accuracy: 0.9406
Precision: 0.8813
Recall: 0.3503
F1 Score: 0.5013


Let's optimize the recall score using hyperparameter tuning. We're ok with false positives but we want to identify as many positive values from the dataset as possible

In [79]:
param_grid = {
    'learning_rate': [0.3, 0.01],
    'max_depth': [2, 5, 10],
    'gamma': [0, 0.1, 1],
    'reg_alpha': [0, 0.1],
    'reg_lambda': [1, 2],
    'scale_pos_weight': [2, 8, 10],
    'n_estimators': [200, 500, 800]
}

xgb_clf = XGBClassifier(
    enable_categorical=True, 
    objective='binary:logistic', 
    tree_method='hist',
    eval_metric='auc',
    random_state=42
)

grid = GridSearchCV(
    estimator=xgb_clf,
    param_grid=param_grid,
    cv=5,
    scoring='recall',
    n_jobs=-1,
    return_train_score=True
)

grid.fit(X_train, y_train)

In [80]:
print("Best parameters:", grid.best_params_)

Best parameters: {'gamma': 1, 'learning_rate': 0.3, 'max_depth': 2, 'n_estimators': 500, 'reg_alpha': 0.1, 'reg_lambda': 2, 'scale_pos_weight': 10}


In [81]:
compute_metrics(grid.best_estimator_)

Evaluation Metrics:
Accuracy: 0.8294
Precision: 0.2784
Recall: 0.6297
F1 Score: 0.3861
