# ==========================================================
# <h2 style="text-align:center;">XGBoost Classification</h2>
# ==========================================================


# 🌳 Introduction to XGBoost

**XGBoost (Extreme Gradient Boosting)** is one of the most powerful and widely used ensemble learning algorithms, especially in **classification and regression tasks**.

It is based on the principle of **Boosting**, which is an ensemble technique that combines multiple weak learners (usually decision trees) to create a strong learner.

---

## 🚀 Key Concepts of XGBoost

1. **Boosting**  
   - Unlike Bagging (Random Forest), which trains models independently, Boosting trains models sequentially.  
   - Each new tree focuses on correcting the errors of the previous ones.  

2. **Gradient Boosting**  
   - Uses gradient descent to minimize errors.  
   - XGBoost is an optimized version of Gradient Boosting with additional techniques like **regularization** and **parallel processing**.

3. **Advantages of XGBoost**  
   - High performance & accuracy  
   - Handles missing values well  
   - Built-in regularization (to reduce overfitting)  
   - Works well with structured/tabular data  
   - Faster training using parallel computation  

---

## 🏆 Where is XGBoost used?
- Credit risk modeling  
- Customer churn prediction  
- Fraud detection  
- Kaggle competitions (very popular)  


In [3]:
# ==========================================================
# Importing the libraries
# ==========================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt


In [5]:
# ==========================================================
# Importing the dataset
# ==========================================================
dataset = pd.read_csv("../data/Churn_Modelling.csv")

# Independent features (excluding RowNumber, CustomerId, Surname, and last column)
X = dataset.iloc[:, 3:-1].values
y = dataset.iloc[:, -1].values

print("Shape of X:", X.shape)  # ➤ Check dimensions
print("Shape of y:", y.shape)  # ➤ Target vector


Shape of X: (10000, 10)
Shape of y: (10000,)


In [6]:
# ==========================================================
# Encoding categorical data
# ==========================================================
from sklearn.preprocessing import LabelEncoder
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder

# Encode "Gender" column
le = LabelEncoder()
X[:, 2] = le.fit_transform(X[:, 2])

# OneHotEncode "Geography" column
ct = ColumnTransformer(
    transformers=[('encoder', OneHotEncoder(), [1])],
    remainder='passthrough'
)
X = np.array(ct.fit_transform(X))

print("After Encoding Shape:", X.shape)  # ➤ Check new shape


After Encoding Shape: (10000, 12)


In [7]:
# ==========================================================
# Splitting the dataset into Training & Test sets
# ==========================================================
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=0
)

print("Training set shape:", X_train.shape)
print("Test set shape:", X_test.shape)


Training set shape: (8000, 12)
Test set shape: (2000, 12)


In [None]:
# ==========================================================
# Training the XGBoost Classifier
# ==========================================================
from xgboost import XGBClassifier


classifier = XGBClassifier(random_state=0)
classifier.fit(X_train, y_train)

print("✅ XGBoost Model Trained Successfully")


✅ XGBoost Model Trained Successfully


In [9]:
# ==========================================================
# Predicting the Test set results
# ==========================================================
y_pred = classifier.predict(X_test)

print("Sample Predictions:", y_pred[:10])  # ➤ Show first 10 predictions


Sample Predictions: [1 0 0 0 0 1 0 0 0 0]


In [10]:
# ==========================================================
# Model Evaluation (Confusion Matrix & Accuracy)
# ==========================================================
from sklearn.metrics import confusion_matrix, accuracy_score

cm = confusion_matrix(y_test, y_pred)
ac = accuracy_score(y_test, y_pred)

print("Confusion Matrix:\n", cm)
print("Accuracy Score:", ac)


Confusion Matrix:
 [[1496   99]
 [ 192  213]]
Accuracy Score: 0.8545


In [11]:
# ==========================================================
# Bias-Variance Analysis
# ==========================================================
bias = classifier.score(X_train, y_train)   # Training accuracy
variance = classifier.score(X_test, y_test) # Testing accuracy

print("Bias (Train Accuracy):", bias)
print("Variance (Test Accuracy):", variance)


Bias (Train Accuracy): 0.953875
Variance (Test Accuracy): 0.8545


In [12]:
# ==========================================================
# Applying k-Fold Cross Validation
# ==========================================================
from sklearn.model_selection import cross_val_score

accuracies = cross_val_score(estimator=classifier, X=X_train, y=y_train, cv=5)
print("Cross-Validation Accuracy Mean: {:.2f} %".format(accuracies.mean()*100))
print("Cross-Validation Std Dev: {:.2f} %".format(accuracies.std()*100))


Cross-Validation Accuracy Mean: 85.15 %
Cross-Validation Std Dev: 0.48 %


# ✅ Summary of XGBoost Classification

In this notebook, we performed **customer churn prediction** using **XGBoost Classifier**.

### Steps we covered:
1. Imported & preprocessed the dataset  
   - Encoded categorical features (Geography & Gender)  
   - One-hot encoding applied  
2. Split data into training and testing sets  
3. Trained an **XGBoost classifier**  
4. Evaluated the model with  
   - Confusion Matrix  
   - Accuracy Score  
   - Bias-Variance check  
   - k-Fold Cross Validation  

### 🔑 Key Takeaways
- XGBoost is a **Boosting-based Ensemble Learning algorithm**.  
- It reduces **bias and variance** effectively compared to a single Decision Tree.  
- Provides **high accuracy** and is widely used in real-world ML problems.  
- Works very well for **structured data** like customer churn, credit scoring, etc.  
