# AdaBoost Classifier - End-to-End ML Project

## 🎯 Problem Statement:
**Predict whether a customer will subscribe to a term deposit using bank marketing data.**

- Boosting, specifically AdaBoost (Adaptive Boosting) — a very powerful ensemble method used to convert weak learners (like small decision trees) into a strong predictive model.

📋 Simulated Dataset Columns:
age, balance, day, duration, campaign, pdays, previous

job_type (categorical)

contact (categorical)

subscribed (target: 0 = No, 1 = Yes)

In [7]:
# Import Required Libraries

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix


In [19]:
# Create Sample Dataset

np.random.seed(42)

data = {
    'age': np.random.randint(18, 65, 100),
    'balance': np.random.randint(-2000, 10000, 100),
    'day': np.random.randint(1, 31, 100),
    'duration': np.random.randint(0, 500, 100),
    'campaign': np.random.randint(1, 10, 100),
    'pdays': np.random.randint(-1, 100, 100),
    'previous': np.random.randint(0, 5, 100),
    'job_type': np.random.choice(['admin', 'blue-collar', 'technician'], 100),
    'contact': np.random.choice(['cellular', 'telephone'], 100),
    'subscribed': np.random.choice([0, 1], 100, p=[0.7, 0.3])
}

df = pd.DataFrame(data)
df.head(10)


Unnamed: 0,age,balance,day,duration,campaign,pdays,previous,job_type,contact,subscribed
0,56,695,3,403,4,98,3,blue-collar,telephone,0
1,46,7687,17,151,5,32,3,blue-collar,cellular,0
2,32,3258,5,53,7,50,2,technician,cellular,0
3,60,3618,17,119,1,93,1,technician,telephone,0
4,25,4736,24,160,3,8,4,admin,telephone,0
5,38,-1609,17,407,2,17,4,technician,cellular,0
6,56,3892,27,115,9,56,2,blue-collar,telephone,0
7,36,1561,17,74,6,94,3,blue-collar,telephone,0
8,40,8470,2,112,3,-1,0,admin,cellular,0
9,28,4184,2,455,8,67,3,blue-collar,telephone,0


In [9]:
# Data Cleaning

print(df.isnull().sum())     # Check for missing values
print(df.dtypes)             # Check types



age           0
balance       0
day           0
duration      0
campaign      0
pdays         0
previous      0
job_type      0
contact       0
subscribed    0
dtype: int64
age            int64
balance        int64
day            int64
duration       int64
campaign       int64
pdays          int64
previous       int64
job_type      object
contact       object
subscribed     int64
dtype: object


In [10]:
# Preprocessing

# Encode categorical columns
le1 = LabelEncoder()
le2 = LabelEncoder()
df['job_type'] = le1.fit_transform(df['job_type'])
df['contact'] = le2.fit_transform(df['contact'])

# Features and target
X = df.drop('subscribed', axis=1)
y = df['subscribed']

# Feature Scaling
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)


In [11]:
# Train-Test Split

X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)



In [13]:
# Train AdaBoost Classifier

# Using a decision stump as base estimator
base_estimator = DecisionTreeClassifier(max_depth=1)
model = AdaBoostClassifier(estimator=base_estimator, n_estimators=100, learning_rate=0.8, random_state=42)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)


In [14]:
# Evaluation

print("Accuracy:", accuracy_score(y_test, y_pred))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred))
print("Classification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.7
Confusion Matrix:
 [[14  2]
 [ 4  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.78      0.88      0.82        16
           1       0.00      0.00      0.00         4

    accuracy                           0.70        20
   macro avg       0.39      0.44      0.41        20
weighted avg       0.62      0.70      0.66        20



In [17]:
# Hyperparameter Tuning

param_grid = {
    'n_estimators': [50, 100, 200],
    'learning_rate': [0.01, 0.1, 0.8, 1],
    'estimator__max_depth': [1, 2]
}

grid = GridSearchCV(AdaBoostClassifier(estimator=DecisionTreeClassifier(), random_state=42),
                    param_grid, cv=3, scoring='accuracy')
grid.fit(X_train, y_train)

print("Best Params:", grid.best_params_)
best_model = grid.best_estimator_


Best Params: {'estimator__max_depth': 1, 'learning_rate': 0.01, 'n_estimators': 50}


In [18]:
# Evaluation After Tuning

y_pred_tuned = best_model.predict(X_test)
print("Tuned Accuracy:", accuracy_score(y_test, y_pred_tuned))
print("Confusion Matrix:\n", confusion_matrix(y_test, y_pred_tuned))
print("Classification Report:\n", classification_report(y_test, y_pred_tuned))



Tuned Accuracy: 0.8
Confusion Matrix:
 [[16  0]
 [ 4  0]]
Classification Report:
               precision    recall  f1-score   support

           0       0.80      1.00      0.89        16
           1       0.00      0.00      0.00         4

    accuracy                           0.80        20
   macro avg       0.40      0.50      0.44        20
weighted avg       0.64      0.80      0.71        20



| Advantage                              | Why It’s Great                         |
| -------------------------------------- | -------------------------------------- |
| 🔍 Focuses on difficult cases          | Later models learn from earlier errors |
| 🔁 Reduces bias & variance             | Often better than single models        |
| 📦 Simple to implement                 | Scikit-learn supports it well          |
| 📉 Less overfitting than Random Forest | More conservative updates              |


| Limitation                                   | Why It Hurts                               |
| -------------------------------------------- | ------------------------------------------ |
| ⚠️ Sensitive to noisy data                   | Can overfit outliers                       |
| ❌ Slower than simpler models                 | Especially with many estimators            |
| 🚫 Only for binary/multiclass classification | Not suitable for regression out of the box |


**Real-World Use Cases:**

✅ Email spam detection

✅ Fraud detection in banking

✅ Medical diagnosis

✅ Customer churn prediction

| Step    | Description                                       |
| ------- | ------------------------------------------------- |
| Model   | AdaBoost (Boosted Decision Stumps)                |
| Dataset | Simulated bank marketing dataset                  |
| Goal    | Predict subscription to term deposit              |
| Tuning  | `n_estimators`, `learning_rate`, `depth`          |
| Output  | Accuracy, confusion matrix, classification report |
