## Adaboosting, Gradient Boosting, XGBoost Classifier

we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement on a company website. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.

This data set contains the following features:

* 'Daily Time Spent on Site': consumer time on site in minutes
* 'Age': cutomer age in years
* 'Area Income': Avg. Income of geographical area of consumer
* 'Daily Internet Usage': Avg. minutes a day consumer is on the internet
* 'Ad Topic Line': Headline of the advertisement
* 'City': City of consumer
* 'Male': Whether or not consumer was male
* 'Country': Country of consumer
* 'Timestamp': Time at which consumer clicked on Ad or closed window
* 'Clicked on Ad': 0 or 1 indicated clicking on Ad

## Import Libraries

In [None]:
import xgboost
import sklearn

print("scikit-learn version:", sklearn.__version__)
print("xgboost version:", xgboost.__version__)

# scikit-learn version: 1.4.0 (last version 1.6.0 but not compatible, 12.13.2025)
# XGBoost 2.1.3 # (last version 12.13.2025)
# These versions must be used together for compatibility, otherwise you will get an error.

In [None]:
sklearn.__version__

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.rcParams["figure.figsize"] = (7,4)
import warnings
warnings.filterwarnings("ignore")
warnings.warn("this will not show")

In [None]:
df = pd.read_csv('advertising2.csv')
df.head()

## Exploratory Data Analysis and Visualization

In [None]:
df.info()

In [None]:
df.describe()

In [None]:
sns.pairplot(df, hue='Clicked on Ad')

## Train | Test Split

In [None]:
for feature in df.select_dtypes("object").columns:
    print(feature, df[feature].nunique())

# We detect unique observation numbers of categorical features.
# We will drop features that contain many unique categorical observations.
# Tree-based models can assign too much weight to features that contain too many unique categorical observations.
# Also, unique categorical observations with a small number of models may not learn anything.

In [None]:
from sklearn.model_selection import train_test_split

In [None]:
cat = df.select_dtypes("object").columns
cat
# we first identify categorical features. We have already decided to drop them above.

In [None]:
list(cat)

In [None]:
cat2 = list(cat) + ['Clicked on Ad']
cat2

# We will add our target to the fetures we will drop and then drop it from X (arguments).

In [None]:
X = df.drop(columns=cat2)
y = df['Clicked on Ad']

In [None]:
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=42)

In [None]:
X_train.head(1)

## Modelling and Model Performance

In [None]:
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
# Since the default base_estimator DecisionTreeClassifier is in the background of Adoboost,
# we imported it to show how to play with the hyperparameters in it.
from sklearn.metrics import confusion_matrix, classification_report,\
                            accuracy_score, recall_score, precision_score,\
                            f1_score, roc_auc_score
from sklearn.model_selection import cross_validate

In [None]:
def eval_metric(model, X_train, y_train, X_test, y_test):
    y_train_pred = model.predict(X_train)
    y_pred = model.predict(X_test)

    print("Test_Set")
    print(confusion_matrix(y_test, y_pred))
    print(classification_report(y_test, y_pred))
    print()
    print("Train_Set")
    print(confusion_matrix(y_train, y_train_pred))
    print(classification_report(y_train, y_train_pred))

In [None]:
# base_estimator=DecisionTreeClassifier(max_depth=1)
ada_model = AdaBoostClassifier(n_estimators=50, random_state=42)

In [None]:
ada_model.fit(X_train,y_train)

In [None]:
eval_metric(ada_model, X_train, y_train, X_test, y_test)

# no overfiting. we will confirm with CV.

In [None]:
model = AdaBoostClassifier(n_estimators=50, random_state=42)

scores = cross_validate(model,
                        X_train,
                        y_train,
                        scoring=["accuracy",
                                 "precision",
                                 "recall",
                                 "f1"],
                       cv = 10,
                       return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

# no overfiting.

## Tree Visualization

In [None]:
from sklearn.tree import plot_tree

In [None]:
model = AdaBoostClassifier(n_estimators=3, random_state=42)
model.fit(X_train,y_train)

# we will only visualize 3 trees and touch on the working logic.

In [None]:
# targets = df["Clicked on Ad"].astype("str")

In [None]:
#features = list(X.columns)
targets = df["Clicked on Ad"].astype("str")
plt.figure(figsize=(15,6),dpi=100)
for i in range(3):
  plt.subplot(1,3,i+1)
  plot_tree(model.estimators_[i],
            filled=True,
            feature_names=X.columns,
            class_names=targets.unique(),
            fontsize=10);

# We need to convert the targets to strings. Otherwise you will get an error.

# adaboostclassifier tree has as many observations as the number of observations in the train set,
# but these observations change in each tree.
# adaboostclassifier reports unpredictable observations to the next tree and requests
# more weight to be given to these observations.
# This weighting is called increasing the number of unpredictable observations in the next tree.
# Some of the predicted observations are not transferred to the next tree.
# The weighting of the observations in the next trees is regulated by the learning rate.

# now let's estimate the following observation in 3 separate trees to understand the working logic.

# Daily Time Spent on Site 68.95
# Age 35.00
# Area Income 61833.90
# Daily Internet Usage 156.09
# Male 0.00

#1 predicted tree 1. The weight coefficient of the tree is 1.117
#2 tree guessed 0. Weight coefficient of wood 0.841
#3 tree guessed 0. Weight coefficient of wood 0.434


# of trees that predict class 0 (2nd and 3rd trees)
# weight totals = 0.8418 + 0.4349 = 1.2767

# Weight totals of trees (1st tree) predicting class 1 = 1.1174

# Since the total weights of trees predicting class 0 are greater than the total weights of trees
# predicting class 1, our model estimates the observation we gave to the model above as class 0.

## Gridsearch

In [None]:
from sklearn.model_selection import GridSearchCV

In [None]:
model = AdaBoostClassifier(random_state=42)

In [None]:
param_grid = {"n_estimators": [20, 30, 100, 200],
              "learning_rate": [0.01, 0.1, 0.2, 0.5, 1.0]}

# There is always a trade off between the number of trees and the learning rate.

In [None]:
ada_grid_model = GridSearchCV(model,
                              param_grid,
                              cv=5,
                              scoring='f1',
                              return_train_score=True)

In [None]:
ada_grid_model.fit(X_train, y_train)

In [None]:
ada_grid_model.best_estimator_

In [None]:
pd.DataFrame(ada_grid_model.cv_results_).loc[ada_grid_model.best_index_, ["mean_test_score", "mean_train_score"]]

In [None]:
y_pred = ada_grid_model.predict(X_test)
y_pred_proba = ada_grid_model.predict_proba(X_test)

ada_f1 = f1_score(y_test, y_pred)
ada_recall = recall_score(y_test, y_pred)
ada_auc = roc_auc_score(y_test, y_pred_proba[:,1])
eval_metric(ada_grid_model, X_train, y_train, X_test, y_test)

## Feature_importances

In [None]:
model = AdaBoostClassifier(n_estimators=100,
                           learning_rate=0.1,
                           random_state=42)
model.fit(X_train, y_train)
model.feature_importances_

feats = pd.DataFrame(index=X.columns,
                     data=model.feature_importances_,
                     columns=['ada_importance'])
ada_imp_feats = feats.sort_values("ada_importance", ascending = False)
ada_imp_feats

In [None]:
#plt.figure(figsize=(12,6))
ax = sns.barplot(data=ada_imp_feats,
                 x=ada_imp_feats.index,
                 y='ada_importance')
ax.bar_label(ax.containers[0],fmt="%.3f");
plt.xticks(rotation=90);

## Evaluating ROC Curves and AUC

In [None]:
from sklearn.metrics import roc_auc_score,\
                            RocCurveDisplay, PrecisionRecallDisplay

In [None]:
RocCurveDisplay.from_estimator(ada_grid_model, X_test, y_test);

## Gradient Boosting Modelling and Model Performance

In [None]:
# It is a tree-based model that uses gradient descent algorithms to optimize gradient boosting algorithm errors.
# Each observation is initialized from a fixed probability. If there are more than 1 classes in the data,
# this probability is greater than 0.5.
# If there are more than 0 classes, this probability is less than 0.5. By subtracting this probability value
# from the 1 and 0 probability values, we find the residual values. The model tries to perfect its estimates
# by bringing these residual values closer to 0.

In [None]:
from sklearn.ensemble import GradientBoostingClassifier

In [None]:
grad_model = GradientBoostingClassifier(random_state=42)

# unlike random forest:
# 1. It is set to max_dept=3 by default in order to be a weak learner.

# 2. How much each tree will contribute to the prediction is arranged with the learning rate hyper parameter.
# default is 0.1.

# 3. Since gradient descent-based model is used in the background, there is a loss hyper parameter in classification.
# And by default it is log_loss. It tries to minimize residuals with the gradient boosting log_loss function.

# 4. Friedman_mse, which is calculated in a similar way to mse, is used as the branching criterion (criterion) of leaves.
# It works like a regression model, as the model tries to minimize residuals in the background.

# 5. With the subsample, it is determined how much observation will be used in each tree. If we make sub_sample=0.8,
# it uses 0.8 observations randomly selected from the train data in each tree. This process is repeated for each tree.
# helps to eliminate overfitting if the subsample is dropped.

# 6. Most other hyper_parameters are the same or similar to random forest.

# The trade off between the number of trees or the learning rate should be well adjusted. Otherwise it goes to overfiting.

In [None]:
grad_model.fit(X_train, y_train)

In [None]:
eval_metric(grad_model, X_train, y_train, X_test, y_test)

In [None]:
model = GradientBoostingClassifier(random_state=42)

scores = cross_validate(model,
                        X_train,
                        y_train,
                        scoring=['accuracy',
                                 'precision',
                                 'recall',
                                 'f1',
                                 'roc_auc'],
                        cv = 10,
                        return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

## Gridsearch

In [None]:
param_grid = {"n_estimators":[100, 200, 300],
              "subsample":[0.5, 1],
              "max_features" : [None, 2, 3, 4],
              "learning_rate": [0.001, 0.01, 0.1],
              'max_depth':[3,4,5,6]} #0.8

In [None]:
gb_model = GradientBoostingClassifier(random_state = 42)

In [None]:
grid = GridSearchCV(gb_model,
                    param_grid,
                    scoring = "f1",
                    verbose=2,
                    n_jobs=-1,
                    return_train_score=True)

grid.fit(X_train, y_train)

In [None]:
grid.best_estimator_

In [None]:
pd.DataFrame(grid.cv_results_).loc[grid.best_index_, ["mean_test_score", "mean_train_score"]]

In [None]:
y_pred = grid.predict(X_test)
y_pred_proba = grid.predict_proba(X_test)

gb_f1 = f1_score(y_test, y_pred)
gb_recall = recall_score(y_test, y_pred)
gb_auc = roc_auc_score(y_test, y_pred_proba[:,1])

eval_metric(grid, X_train, y_train, X_test, y_test)

## Feature importances

In [None]:
model = GradientBoostingClassifier(max_features= 3,
                                   n_estimators = 100,
                                   subsample = 0.5,
                                   random_state=42)
model.fit(X_train, y_train)

model.feature_importances_

feats = pd.DataFrame(index=X.columns,
                     data=model.feature_importances_,
                     columns=['grad_importance'])
grad_imp_feats = feats.sort_values("grad_importance", ascending=False)
grad_imp_feats

In [None]:
#plt.figure(figsize=(12,6))
ax = sns.barplot(data=grad_imp_feats,
                 x=grad_imp_feats.index,
                 y='grad_importance')
ax.bar_label(ax.containers[0],fmt="%.3f")
plt.xticks(rotation=90);

## Evaluating ROC Curves and AUC

In [None]:
RocCurveDisplay.from_estimator(grid, X_test, y_test);

## XG Boosting Modelling and Model Performance

In [None]:
from xgboost import XGBClassifier

In [None]:
xgb = XGBClassifier(random_state=42, use_label_encoder=False)
# use_label_encoder=False for compatibility with sklearn, otherwise you may get a warning
xgb.fit(X_train, y_train)

# Hyper_parameters:
# base_score=0.5 The model starts all observations from the default 0.5 probability value and
# tries to pull this value to 1.0 probability for 1 classes and 0.0 for 0 classes.

# max_depth=6 is the default value of learning_rate=0.300000012. These are some of the parameters
# that have the most impact on overfiting.

# subsample=1 by default uses all observations in all trees. Values around 0.8 should be tried to prevent overifting.
# subsumple=0.8 means randomly pick 80% of observations from train set on all trees and use them on trees.
# Random selection is made again for each tree. This process increases randomness.

# colsample_bytree=1 defaults to 1. It determines the number of features that should be used for each tree.
# If our data consists of 20 features and colsample_bytree=0.5, it will only use 10 randomly selected features
# from 20 featuras for each tree. used to increase randomness.

# colsample_bylevel=1 defaults to 1. If colsample_bylevel=0.5, 5 features randomly selected from among the features
# (10 features) to be used for each tree are used for each leaf division. This process is repeated for each leaf division. used to increase randomness.

# gamma=0 default is 0. Used to prevent overfitting. It can take values between 0 and + infinity.
# Let's interpret it for gamma = 0. If the reduction in loss function is greater than 0 as a result of branching a root,
# continue branching. As soon as it sees that there is no decrease in the loss_funtion in the train data,
# it automatically cuts off the branching. So we can say that it sees an early_stop for branching.
# Overfiting can be eliminated by making minor changes in the gamma value.

# min_child_weight=1 default is 1. Used to prevent overfitting. It takes a value between 0 and + infinity.
# Let's interpret for min_child_weight=1. If the total weight of the observations falling on a new leaf formed
# as a result of the branching of a root (if the sample weights are not done, all the observations are weighted)
# is greater than 1, the branches continue. In other words, while the weight of all observations is 1,
# the branches continue until 1 observation falls on each leaf.

# scale_pos_weight=1 is the weighting parameter. However, it is only used for binary data.
# If the ratio of classes is 1/10, this parameter should be set to 10 to weight the minority class.

# means reg_alpha = lasso, reg_lambda = ridge. ridge is used by default.

# It does not use metrics such as gini, entrop or mse for the branches in each tree,
# instead it uses a parameter called similarity score, in which the regularization parameter is used in the calculation.

In [None]:
eval_metric(xgb, X_train, y_train, X_test, y_test)

In [None]:
model = XGBClassifier(random_state=42, use_label_encoder=False)

scores = cross_validate(model,
                        X_train,
                        y_train,
                        scoring=['accuracy',
                                 'precision',
                                 'recall',
                                 'f1',
                                 'roc_auc'],
                        cv = 10,
                        return_train_score=True)
df_scores = pd.DataFrame(scores, index = range(1, 11))
df_scores.mean()[2:]

# We need to specify the parameter we defined in the fit function in the fit_params parameter in CV.

## Gridsearch

In [None]:
param_grid = {"n_estimators":[50, 100, 200],
              'max_depth':[3,4,5],
              "learning_rate": [0.1, 0.2],
              "subsample":[0.5, 0.8, 1],
              "colsample_bytree":[0.5,0.7, 1]}

In [None]:
xgb_model = XGBClassifier(random_state=42, use_label_encoder=False)

In [None]:
xgb_grid = GridSearchCV(xgb_model,
                        param_grid,
                        scoring="f1",
                        verbose=2,
                        n_jobs=-1,
                        return_train_score=True)

xgb_grid.fit(X_train, y_train)

In [None]:
xgb_grid.best_params_

In [None]:
xgb_grid.best_estimator_

In [None]:
pd.DataFrame(xgb_grid.cv_results_).loc[xgb_grid.best_index_, ["mean_test_score", "mean_train_score"]]

In [None]:
y_pred = xgb_grid.predict(X_test)
y_pred_proba = xgb_grid.predict_proba(X_test)

xgb_f1 = f1_score(y_test, y_pred)
xgb_recall = recall_score(y_test, y_pred)
xgb_auc = roc_auc_score(y_test, y_pred_proba[:,1])

eval_metric(xgb_grid, X_train, y_train, X_test, y_test)

## Feature importances

In [None]:
model = XGBClassifier(n_estimators=50,
                      colsample_bytree=0.7,
                      subsample=0.8,
                      learning_rate=0.1,
                      max_depth= 3,
                      random_state=42,
                      use_label_encoder=False)
model.fit(X_train, y_train)

model.feature_importances_

feats = pd.DataFrame(index=X.columns,
                     data=model.feature_importances_,
                     columns=['xgb_importance'])
xgb_imp_feats = feats.sort_values("xgb_importance", ascending=False)
xgb_imp_feats

In [None]:
ax = sns.barplot(data=xgb_imp_feats,
                 x=xgb_imp_feats.index,
                 y='xgb_importance')
ax.bar_label(ax.containers[0],fmt="%.3f")
plt.xticks(rotation=90);

## Feature importance comparison

In [None]:
pd.concat([ada_imp_feats, grad_imp_feats, xgb_imp_feats], axis=1)

## Evaluating ROC Curves and AUC

In [None]:
RocCurveDisplay.from_estimator(xgb_grid, X_test, y_test);

## Comparing Models

In [None]:
compare = pd.DataFrame({"Model": ["AdaBoost","GradientBoost", "XGBoost"],
                        "F1": [ada_f1, gb_f1, xgb_f1],
                        "Recall": [ada_recall, gb_recall, xgb_recall],
                        "ROC_AUC": [ada_auc, gb_auc, xgb_auc]})


plt.figure(figsize=(14,10))

plt.subplot(311)
compare = compare.sort_values(by="F1", ascending=False)
ax=sns.barplot(x="F1", y="Model", data=compare, palette="Blues_d")
ax.bar_label(ax.containers[0],fmt="%.3f")

plt.subplot(312)
compare = compare.sort_values(by="Recall", ascending=False)
ax=sns.barplot(x="Recall", y="Model", data=compare, palette="Blues_d")
ax.bar_label(ax.containers[0],fmt="%.3f")

plt.subplot(313)
compare = compare.sort_values(by="ROC_AUC", ascending=False)
ax=sns.barplot(x="ROC_AUC", y="Model", data=compare, palette="Blues_d")
ax.bar_label(ax.containers[0],fmt="%.3f")
plt.show()