# XGBoost Algorithm

XGBOOST is an **extremely efficient** and **flexible** **gradient boosting** algorithm. It is widely used for **classification** and **regression** problems. XGBOOST is known for its **speed**, **accuracy**, and **interpretability**. It is particularly useful when dealing with **large datasets** and **complex models**.


XGBOOST is an open-source gradient boosting algorithm that is widely used for classification and regression problems. It is known for its speed, accuracy, and interpretability. XGBOOST is particularly useful when dealing with large datasets and complex models.

XGBOOST is **based on the concept of gradient boosting**, which is an ensemble learning technique that combines multiple weak models to create a strong predictive model. Gradient boosting is an iterative process that involves training multiple models, each of which is responsible for predicting the residuals of the previous model. The final prediction is made by combining the predictions of all the models.

XGBOOST has several key features that make it an extremely efficient and flexible algorithm. These include:

* **Handling missing values**: XGBOOST can handle missing values in the data, which is a common problem in machine learning.
* **Handling categorical variables**: XGBOOST can handle categorical variables, which are variables that take on a limited number of distinct values.
* **Handling large datasets**: XGBOOST is designed to handle large datasets and can scale to meet the needs of big data.
* **Handling complex models**: XGBOOST can handle complex models, including models with non-linear relationships and interactions between variables.
* **Interpretability**: XGBOOST provides a number of features that make it easy to interpret the results of the model, including feature importance and partial dependence plots.

Overall, XGBOOST is a powerful and flexible algorithm that is widely used in machine learning and data science. Its speed, accuracy, and interpretability make it an ideal choice for many applications.


<img src="https://towardsdatascience.com/wp-content/uploads/2022/10/1tSaJ_yv8yDt4XU0ZJEaMeg.jpeg" style="width:70%"/>

Official Website:https://xgboost.readthedocs.io/en/stable/#

**XGBoost Classifier Example**
----------
Car Purchase Dataset

In [1]:
# import basic libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [2]:
# install and import XGBoost librarry, it is not built-in sklearn library
!pip install xgboost


Collecting xgboost
  Using cached xgboost-3.0.5-py3-none-win_amd64.whl.metadata (2.1 kB)
Using cached xgboost-3.0.5-py3-none-win_amd64.whl (56.8 MB)
Installing collected packages: xgboost
Successfully installed xgboost-3.0.5


In [None]:
from xgboost import XGBClassifier

In [4]:
# load the dataset
df = pd.read_csv('2_9_car_purchase_dataset.csv')
df.head()

Unnamed: 0,User ID,Gender,Age,AnnualSalary,Purchased
0,385,Male,35,20000,0
1,681,Male,40,43500,0
2,353,Male,49,74000,0
3,895,Male,40,107500,1
4,661,Male,25,79000,0


In [5]:
# drop User ID column
df.drop('User ID', axis=1, inplace=True)
df.head()

Unnamed: 0,Gender,Age,AnnualSalary,Purchased
0,Male,35,20000,0
1,Male,40,43500,0
2,Male,49,74000,0
3,Male,40,107500,1
4,Male,25,79000,0


**Explare Data - EDA**
-------

In [6]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column        Non-Null Count  Dtype 
---  ------        --------------  ----- 
 0   Gender        1000 non-null   object
 1   Age           1000 non-null   int64 
 2   AnnualSalary  1000 non-null   int64 
 3   Purchased     1000 non-null   int64 
dtypes: int64(3), object(1)
memory usage: 31.4+ KB


In [7]:
df.describe()

Unnamed: 0,Age,AnnualSalary,Purchased
count,1000.0,1000.0,1000.0
mean,40.106,72689.0,0.402
std,10.707073,34488.341867,0.490547
min,18.0,15000.0,0.0
25%,32.0,46375.0,0.0
50%,40.0,72000.0,0.0
75%,48.0,90000.0,1.0
max,63.0,152500.0,1.0


In [8]:
# data is very clean and there are no missing values
df.isnull().sum()

Gender          0
Age             0
AnnualSalary    0
Purchased       0
dtype: int64

In [9]:
# just convert gender to numeric
df['Gender'] = df['Gender'].map({'Male': 0, 'Female': 1})
df.head()

Unnamed: 0,Gender,Age,AnnualSalary,Purchased
0,0,35,20000,0
1,0,40,43500,0
2,0,49,74000,0
3,0,40,107500,1
4,0,25,79000,0


In [11]:
# Split the data into training and test sets
from sklearn.model_selection import train_test_split

X = df.drop('Purchased', axis=1)
y = df['Purchased']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)


X_train shape: (800, 3)
X_test shape: (200, 3)
y_train shape: (800,)
y_test shape: (200,)


In [12]:
# Create a XGBoost classifier and train it
model = XGBClassifier()
model.fit(X_train, y_train)

In [13]:
# prediction with test set
y_pred = model.predict(X_test)
y_pred

array([0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1,
       1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 1, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0,
       0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0,
       1, 0, 0, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0,
       0, 0, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0,
       1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 0, 1,
       0, 1])

In [14]:
# see the metrics and accuracy
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

accuracy = accuracy_score(y_test, y_pred)

precision = precision_score(y_test, y_pred)

recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print("Accuracy:", accuracy)
print("Precision:", precision)
print("Recall:", recall)
print("F1 Score:", f1)

Accuracy: 0.9
Precision: 0.9047619047619048
Recall: 0.8636363636363636
F1 Score: 0.8837209302325582


In [16]:
# see the confusion matrix and classification report
from sklearn.metrics import confusion_matrix, classification_report

confusion_matrix = confusion_matrix(y_test, y_pred)

print("Confusion Matrix:")
print(confusion_matrix)

classification_report = classification_report(y_test, y_pred)

print("Classification Report:")
print(classification_report)

Confusion Matrix:
[[104   8]
 [ 12  76]]
Classification Report:
              precision    recall  f1-score   support

           0       0.90      0.93      0.91       112
           1       0.90      0.86      0.88        88

    accuracy                           0.90       200
   macro avg       0.90      0.90      0.90       200
weighted avg       0.90      0.90      0.90       200



In [17]:
# compare with Gradient Boosting and Adaboost
from sklearn.ensemble import GradientBoostingClassifier, AdaBoostClassifier

gb_model = GradientBoostingClassifier()
gb_model.fit(X_train, y_train)
y_pred_gb = gb_model.predict(X_test)

adaboost_model = AdaBoostClassifier()
adaboost_model.fit(X_train, y_train)
y_pred_adaboost = adaboost_model.predict(X_test)

accuracy_gb = accuracy_score(y_test, y_pred_gb)
accuracy_adaboost = accuracy_score(y_test, y_pred_adaboost)

print("Accuracy with XGBoost:", accuracy)
print("Accuracy with Gradient Boosting:", accuracy_gb)
print("Accuracy with AdaBoost:", accuracy_adaboost)

Accuracy with XGBoost: 0.9
Accuracy with Gradient Boosting: 0.905
Accuracy with AdaBoost: 0.885


In [18]:
# hyperparameter tuning with RandomSearchCV
from sklearn.model_selection import RandomizedSearchCV

param_dist = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 2, 3],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.2],
    'reg_lambda': [1, 2, 3]
}

xgb_model = XGBClassifier()

random_search = RandomizedSearchCV(xgb_model, param_distributions=param_dist, n_iter=10, cv=5, random_state=42)

random_search.fit(X_train, y_train)

print("Best Hyperparameters:", random_search.best_params_)
print("Best Score:", random_search.best_score_)

best_model = random_search.best_estimator_

y_pred_best = best_model.predict(X_test)

accuracy_best = accuracy_score(y_test, y_pred_best)

print("Accuracy of Best Model:", accuracy_best)


Best Hyperparameters: {'subsample': 1.0, 'reg_lambda': 1, 'reg_alpha': 0.1, 'n_estimators': 50, 'min_child_weight': 1, 'max_depth': 3, 'learning_rate': 0.2, 'gamma': 0.2, 'colsample_bytree': 0.8}
Best Score: 0.9025000000000001
Accuracy of Best Model: 0.915


**XGBoost Regressor Example**
-----

In [19]:
# we will work again Ankara House Prices dataset
df = pd.read_csv('2_8_ankara_house_prices_cleaned_dataset_gradient_boosting.csv')
df.head()

Unnamed: 0.1,Unnamed: 0,RoomCount,Floor,Size(m2),District,Price(TL),Age_labeled
0,0,4,3.0,130.0,Akyurt,3450000,0
1,1,5,2.0,175.0,Akyurt,3975000,0
2,2,1,1.0,550.0,Akyurt,3600000,1
3,3,5,4.0,170.0,Akyurt,3705000,7
4,4,4,1.0,110.0,Akyurt,2099999,7


In [20]:
# drop Unnamed: 0 column
df.drop('Unnamed: 0', axis=1, inplace=True)
df.head()

Unnamed: 0,RoomCount,Floor,Size(m2),District,Price(TL),Age_labeled
0,4,3.0,130.0,Akyurt,3450000,0
1,5,2.0,175.0,Akyurt,3975000,0
2,1,1.0,550.0,Akyurt,3600000,1
3,5,4.0,170.0,Akyurt,3705000,7
4,4,1.0,110.0,Akyurt,2099999,7


In [21]:
# label encode District column
from sklearn.preprocessing import LabelEncoder

le = LabelEncoder()
df['District'] = le.fit_transform(df['District'])
df.head()

Unnamed: 0,RoomCount,Floor,Size(m2),District,Price(TL),Age_labeled
0,4,3.0,130.0,0,3450000,0
1,5,2.0,175.0,0,3975000,0
2,1,1.0,550.0,0,3600000,1
3,5,4.0,170.0,0,3705000,7
4,4,1.0,110.0,0,2099999,7


In [22]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10244 entries, 0 to 10243
Data columns (total 6 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   RoomCount    10244 non-null  int64  
 1   Floor        10244 non-null  float64
 2   Size(m2)     10244 non-null  float64
 3   District     10244 non-null  int64  
 4   Price(TL)    10244 non-null  int64  
 5   Age_labeled  10244 non-null  int64  
dtypes: float64(2), int64(4)
memory usage: 480.3 KB


In [23]:
# import XGBOOST regressor
from xgboost import XGBRegressor


In [25]:
# Split the data into training and test sets

X = df.drop('Price(TL)', axis=1)
y = df['Price(TL)']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)

X_train shape: (8195, 5)
X_test shape: (2049, 5)
y_train shape: (8195,)
y_test shape: (2049,)


In [26]:
# train the model
model = XGBRegressor()
model.fit(X_train, y_train)

In [27]:
#prediction with test set
y_pred = model.predict(X_test)
y_pred

array([2769118.2, 2507106.5, 3018220.2, ..., 3199868.8, 2991561.5,
       8025480.5], dtype=float32)

In [28]:
# calculate metrics
from sklearn.metrics import mean_squared_error, mean_absolute_error, r2_score

mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)
print("R-squared:", r2)

Mean Squared Error: 1231857778688.0
Mean Absolute Error: 720881.3125
R-squared: 0.6908290982246399


In [35]:
# hyperparameter tuning with randomized search
from sklearn.model_selection import RandomizedSearchCV

params = {
    'n_estimators': [50, 100, 200, 300],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [3, 4, 5],
    'min_child_weight': [1, 2, 3],
    'gamma': [0, 0.1, 0.2],
    'subsample': [0.8, 1.0],
    'colsample_bytree': [0.8, 1.0],
    'reg_alpha': [0, 0.1, 0.2],
    'reg_lambda': [1, 2, 3]
}


In [36]:
rs=RandomizedSearchCV(model, param_distributions=params, n_iter=10, cv=5, random_state=42)
rs.fit(X_train, y_train)

In [37]:
# find best params and best prediction
print("Best Parameters:", rs.best_params_)
print("Best Score:", rs.best_score_)

y_pred = rs.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)

print("Mean Squared Error:", mse)
print("Mean Absolute Error:", mae)
print("R-squared:", r2)

Best Parameters: {'subsample': 0.8, 'reg_lambda': 3, 'reg_alpha': 0.2, 'n_estimators': 100, 'min_child_weight': 3, 'max_depth': 4, 'learning_rate': 0.2, 'gamma': 0.2, 'colsample_bytree': 0.8}
Best Score: 0.7201737642288208
Mean Squared Error: 1278077960192.0
Mean Absolute Error: 737492.1875
R-squared: 0.6792287826538086


In [38]:
# compare with other regressors# try in a for loop and compare with all Regressor model learned so far 
# #Linear Regression, KNNRegressor, SVR, DesicionTree Rgressor, Random forrest Regressor,Adaboost, Gradient Boosting Regressor


from sklearn.linear_model import LinearRegression
from sklearn.neighbors import KNeighborsRegressor
from sklearn.svm import SVR
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor
from sklearn.ensemble import AdaBoostRegressor
from sklearn.ensemble import GradientBoostingRegressor

from sklearn.metrics import mean_squared_error, r2_score

models = [
    LinearRegression(),
    KNeighborsRegressor(),
    SVR(),
    DecisionTreeRegressor(),
    RandomForestRegressor(),
    AdaBoostRegressor(),
    GradientBoostingRegressor(),
    XGBRegressor()
]

# see all the result in a dataframe
results = []
for model in models:
    model.fit(X_train, y_train)
    y_pred = model.predict(X_test)
    mse = mean_squared_error(y_test, y_pred)
    r2 = r2_score(y_test, y_pred)
    results.append((model.__class__.__name__, mse, r2))

df_results = pd.DataFrame(results, columns=["Model", "MSE", "R2"])
df_results

Unnamed: 0,Model,MSE,R2
0,LinearRegression,2601443000000.0,0.347091
1,KNeighborsRegressor,1777963000000.0,0.553768
2,SVR,4282558000000.0,-0.074834
3,DecisionTreeRegressor,1877749000000.0,0.528724
4,RandomForestRegressor,1262272000000.0,0.683196
5,AdaBoostRegressor,2220093000000.0,0.442802
6,GradientBoostingRegressor,1332699000000.0,0.66552
7,XGBRegressor,1231858000000.0,0.690829


--END--