<a href="https://colab.research.google.com/github/ketanp23/scsd-ddm-class/blob/main/XGBoost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

Learning Rate (Î·): An important variable that modifies how much each tree contributes to the final prediction. While more trees are needed smaller values frequently result in more accurate models.

Max Depth: This parameter controls the depth of every tree, avoiding overfitting and being essential to controlling the model's complexity.

Gamma: Based on the decrease in loss it determines when a node in the tree will split. The algorithm becomes more conservative with a higher gamma value hence avoiding splits that don't decreases the loss. It helps in managing tree complexity.

Subsample: Manages the percentage of data that is sampled at random to grow each tree hence lowering variance and enhancing generalization. Setting it too low could result in underfitting.

Colsample Bytree: Establishes the percentage of features that will be sampled at random for growing each tree.

n_estimators: Specifies the number of boosting rounds.

alpha (L1 regularization term) and lambda (L2 regularization term) : Control the strength of L1 and L2 regularization respectively. A higher value results in stronger regularization.

min_child_weight: Influences the tree structure by controlling the minimum amount of data required to create a new node.

scale_pos_weight: Useful in imbalanced class scenarios to control the balance of positive and negative weights.

In [1]:
from sklearn.metrics import accuracy_score
import xgboost as xgb
from sklearn.model_selection import train_test_split
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

In [5]:
dataset = pd.read_csv('Churn_Modelling_xgboost.csv')
X = dataset.iloc[:, 3:13]
y = dataset.iloc[:, 13].values
print(X)
print(y)

      CreditScore Geography  Gender  Age  Tenure    Balance  NumOfProducts  \
0             619    France  Female   42       2       0.00              1   
1             608     Spain  Female   41       1   83807.86              1   
2             502    France  Female   42       8  159660.80              3   
3             699    France  Female   39       1       0.00              2   
4             850     Spain  Female   43       2  125510.82              1   
...           ...       ...     ...  ...     ...        ...            ...   
9995          771    France    Male   39       5       0.00              2   
9996          516    France    Male   35      10   57369.61              1   
9997          709    France  Female   36       7       0.00              1   
9998          772   Germany    Male   42       3   75075.31              2   
9999          792    France  Female   28       4  130142.79              1   

      HasCrCard  IsActiveMember  EstimatedSalary  
0           

Since XGBoost can internally handle categorical features.

The code converts the specified columns to the categorical data type.

While internally representing categories with integers and categorical type retains the semantic meaning of the categories.


In [6]:
X['Geography'] = X['Geography'].astype('category')
X['Gender'] = X['Gender'].astype('category')

We will split our dataset into training and testing for the model training and testing.

test_size=0.25: Means 25% test data and 75% train data used.
random_state=0: Ensures reproducibility
X_train, X_test: Feature sets
y_train, y_test: Target labels

In [7]:
X_train, X_test, y_train, y_test = train_test_split(X,y, test_size=0.25, random_state=0)

We will convert our dataset into DMatrix structure. DMatrix is a special data structure in XGBoost for faster training and less memory use.


To convert our data to DMatrix format , we will use XGBoost's API. It takes both features and labels.

Use enable_categorical = True to handle Pandas categorical columns automatically.

In [8]:
xgb_train = xgb.DMatrix(X_train, y_train, enable_categorical=True)
xgb_test = xgb.DMatrix(X_test, y_test, enable_categorical=True)

We will initialize XGBoost model with hyperparameters like a binary logistic objective, maximum tree depth and learning rate. It then trains the model using the `xgb_train` dataset for 50 boosting rounds.


objective: 'binary:logistic' for binary classification
max_depth: 3 limits tree depth

learning_rate: 0.1 controls step size

xgb.train(...) trains the XGBoost model using specified params and training data

The specified hyperparameters define the model's structure and training behavior, impacting its accuracy and generalization on the given dataset. Adjusting these hyperparameters are necessary for optimal performance in different scenarios.

In [9]:
params = {
    'objective': 'binary:logistic',
    'max_depth': 3,
    'learning_rate': 0.1,
}
n=50
model = xgb.train(params=params,dtrain=xgb_train,num_boost_round=n)

We will predict labels and then converts the predicted probabilities (preds) to integer labels allowing for a straightforward accuracy comparison with the true labels.

In [10]:
preds = model.predict(xgb_test)
preds = np.round(preds)
accuracy= accuracy_score(y_test,preds)
print('Accuracy of the model is:', accuracy*100)

Accuracy of the model is: 86.83999999999999


We can see that we achieved a accuracy of 86.6% which is very good meaning our model is working fine with real world dataset.

