# Gradient Boosting Classifier - End-to-End ML Project

🎯 Problem Statement:
**Predict whether a patient is likely to be readmitted to the hospital based on their clinical data.**

📋 Simulated Dataset Columns:
age, blood_pressure, cholesterol, blood_sugar, heart_rate

gender (categorical)

smoking_status (categorical)

readmitted (target: 0 = No, 1 = Yes)

🧱 Step-by-Step Structure:
✅ Data Cleaning
✅ Preprocessing (Label Encoding, Scaling)
✅ Train-Test Split
✅ Model Training (Gradient Boosting)
✅ Evaluation (Accuracy, F1, Confusion Matrix)
✅ Hyperparameter Tuning
✅ Pros & Cons

In [1]:
# Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report


In [2]:
# Create Simulated Dataset

np.random.seed(42)

data = {
    'age': np.random.randint(20, 80, 100),
    'blood_pressure': np.random.randint(90, 180, 100),
    'cholesterol': np.random.randint(150, 300, 100),
    'blood_sugar': np.random.randint(70, 180, 100),
    'heart_rate': np.random.randint(60, 100, 100),
    'gender': np.random.choice(['male', 'female'], 100),
    'smoking_status': np.random.choice(['never', 'former', 'current'], 100),
    'readmitted': np.random.choice([0, 1], 100, p=[0.7, 0.3])
}

df = pd.DataFrame(data)
df.head(10)

Unnamed: 0,age,blood_pressure,cholesterol,blood_sugar,heart_rate,gender,smoking_status,readmitted
0,58,167,286,102,71,male,never,0
1,71,176,211,117,82,male,current,0
2,48,151,200,145,74,female,never,0
3,34,129,208,128,87,female,current,0
4,62,174,267,155,93,male,never,0
5,27,169,245,91,61,female,never,0
6,40,171,262,179,91,male,current,1
7,58,142,211,99,82,male,never,0
8,77,113,201,107,81,female,current,0
9,38,115,161,171,84,male,never,1


In [3]:
# Data Cleaning

print(df.isnull().sum())  # No missing values in this synthetic data
print(df.dtypes)


age               0
blood_pressure    0
cholesterol       0
blood_sugar       0
heart_rate        0
gender            0
smoking_status    0
readmitted        0
dtype: int64
age                int64
blood_pressure     int64
cholesterol        int64
blood_sugar        int64
heart_rate         int64
gender            object
smoking_status    object
readmitted         int64
dtype: object


In [4]:
# Preprocessing

# Encode categorical features
df['gender'] = LabelEncoder().fit_transform(df['gender'])
df['smoking_status'] = LabelEncoder().fit_transform(df['smoking_status'])

# Features and target
X = df.drop('readmitted', axis=1)
y = df['readmitted']

# Standard Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [5]:
# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)


In [6]:
# Train Gradient Boosting Classifier

model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1, max_depth=3, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


In [7]:
# Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))



Accuracy: 0.55
Confusion Matrix:
 [[7 5]
 [4 4]]
Classification Report:
               precision    recall  f1-score   support

           0       0.64      0.58      0.61        12
           1       0.44      0.50      0.47         8

    accuracy                           0.55        20
   macro avg       0.54      0.54      0.54        20
weighted avg       0.56      0.55      0.55        20



In [8]:
# Hyperparameter Tuning

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.2],
    'max_depth': [2, 3, 4]
}

grid = GridSearchCV(GradientBoostingClassifier(random_state=42),
                    param_grid, cv=3, scoring='accuracy')

grid.fit(X_train, y_train)
print("Best Parameters:", grid.best_params_)

best_model = grid.best_estimator_
y_pred_tuned = best_model.predict(X_test)


Best Parameters: {'learning_rate': 0.01, 'max_depth': 2, 'n_estimators': 50}


In [9]:
# Final Evaluation After Tuning

print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))
print("Classification Report:\n", classification_report(y_test, y_pred_tuned))


Tuned Accuracy: 0.6
Confusion Matrix:
 [[12  0]
 [ 8  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.60      1.00      0.75        12
           1       0.00      0.00      0.00         8

    accuracy                           0.60        20
   macro avg       0.30      0.50      0.38        20
weighted avg       0.36      0.60      0.45        20



  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


| Advantage                                     | Reason                                   |
| --------------------------------------------- | ---------------------------------------- |
| 🎯 High predictive accuracy                   | Often outperforms other models           |
| 🔁 Works well with mixed data types           | Numeric + categorical                    |
| ❌ Reduces both bias & variance                | Sequential tree training                 |
| ⚖️ Handles class imbalance well (with tuning) | via scale\_pos\_weight or sample weights |


| Limitation                            | Reason                         |
| ------------------------------------- | ------------------------------ |
| 🐢 Slower to train than Random Forest | Sequential, not parallel       |
| 🔧 Sensitive to hyperparameters       | Requires tuning                |
| ❌ Can overfit on noisy data           | Especially with too many trees |


Real-World Use Cases
🏥 Readmission prediction

🏦 Credit scoring

📉 Stock price movement prediction

📦 Customer churn detection


| Step      | Outcome                              |
| --------- | ------------------------------------ |
| Algorithm | Gradient Boosting Classifier         |
| Dataset   | Simulated hospital readmission data  |
| Accuracy  | Depends on tuning (\~80–90% typical) |
| Good For  | Tabular binary classification        |
