# Bank Marketing: Term Deposit Prediction
## A2: Classification Case Study

**Author:** [Your Name]  
**Date:** February 18, 2026  
**Objective:** Predict whether a client will subscribe to a term deposit (Target: `y`).

### Business Context
The bank wants to optimize its direct marketing campaigns. Telemarketing is expensive; calling customers who are unlikely to buy is a waste of resources. By predicting the propensity to subscribe (`y=1`), the bank can focus its effort on high-probability leads, increasing conversion rates and reducing costs.

### Analytical Approach
1.  **Data Preparation:** Handling categorical variables and the binary target.
2.  **Feature Engineering:** One-Hot Encoding for categorical data.
3.  **Modeling:** Testing candidate models (Logistic Regression, Random Forest, GBM) focusing on **AUC** due to class imbalance.
4.  **Evaluation:** Analyzing the **Train-Test Gap** to ensure the model is not overfitting.

## 1. Imports and Setup

In [None]:
# Importing essential libraries
import pandas as pd                       # Data manipulation
import numpy as np                        # Mathematical operations
import matplotlib.pyplot as plt           # Base plotting
import seaborn as sns                     # Enhanced plotting

# Machine Learning Utils
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import confusion_matrix, roc_auc_score, classification_report, make_scorer

# Models
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.neural_network import MLPClassifier

# Suppress warnings for clean output
import warnings
warnings.filterwarnings('ignore')

# Pandas display options
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 1000)

print("Libraries imported successfully.")

## 2. Data Loading and Cleaning

In [None]:
# Load the dataset
# Note: The UCI Bank Marketing dataset uses a semicolon ';' delimiter
try:
    df = pd.read_csv('bank-full.csv', sep=';')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Error: 'bank-full.csv' not found. Please ensure the file is uploaded.")

# Map target 'y' to binary (1 = yes, 0 = no)
df['y_binary'] = df['y'].map({'yes': 1, 'no': 0})

# DROP VARIABLES:
# 1. 'y': Original target (replaced by y_binary)
# 2. 'poutcome': This variable often causes data leakage (predicting the future based on the future)
df = df.drop(['y', 'poutcome'], axis=1)

# Check Class Imbalance
print(f"\nDataset Shape: {df.shape}")
print("Target Variable Distribution:")
print(df['y_binary'].value_counts(normalize=True).round(4))

## 3. Feature Engineering (One-Hot Encoding)
We convert categorical variables (text) into numerical dummy variables so the algorithms can process them.

In [None]:
# Identify column types
categorical_cols = df.select_dtypes(include=['object']).columns

# Create dummy variables
# drop_first=True avoids the "dummy variable trap" (multicollinearity) for linear models
df_encoded = pd.get_dummies(df, columns=categorical_cols, drop_first=True)

print(f"Shape after One-Hot Encoding: {df_encoded.shape}")
df_encoded.head(3)

## 4. Train-Test Split
We use **Stratified Sampling** because the data is imbalanced (88% No / 12% Yes). This ensures the Train and Test sets have the same proportion of "Yes" customers.

In [None]:
# 1. Separate Features (x) and Target (y)
y_target = df_encoded['y_binary']
x_features = df_encoded.drop('y_binary', axis=1)

# 2. Split Data
# Random State 219 is used for reproducibility
x_train, x_test, y_train, y_test = train_test_split(
    x_features,
    y_target,
    test_size=0.25,
    random_state=219,
    stratify=y_target  # CRITICAL for imbalanced data
)

print(f"Training Data: {x_train.shape}")
print(f"Testing Data:  {x_test.shape}")

## 5. Model Candidate Loop
We test multiple models to find the best performer.

**Metric Focus:**
* **AUC (Area Under Curve):** The primary metric because accuracy is misleading on imbalanced data.
* **Train-Test Gap:** We calculate the difference between Training Score and Testing Score. A large gap (> 0.05) indicates **Overfitting**.

In [None]:
# Dictionary of models to test
models = {
    'Logistic Regression': LogisticRegression(max_iter=1000, random_state=42),
    'Decision Tree':       DecisionTreeClassifier(random_state=42, max_depth=8),
    'Random Forest':       RandomForestClassifier(random_state=42, n_estimators=100, max_depth=8),
    'Gradient Boosting':   GradientBoostingClassifier(random_state=42, learning_rate=0.1, max_depth=3)
}

# DataFrame to store results
model_results = []

print("Training models...\n")

for name, model in models.items():
    # Create a pipeline to scale data automatically
    # Scaling is required for Logistic Regression, but optional for Trees
    # We apply it to all for consistency in this loop
    pipe = Pipeline([('scaler', StandardScaler()), ('clf', model)])

    # Fit the model
    pipe.fit(x_train, y_train)

    # Predict Probabilities (needed for AUC)
    # [:, 1] grabs the probability of the positive class (1)
    y_train_pred_proba = pipe.predict_proba(x_train)[:, 1]
    y_test_pred_proba  = pipe.predict_proba(x_test)[:, 1]

    # Calculate Scores (AUC)
    train_auc = roc_auc_score(y_train, y_train_pred_proba)
    test_auc  = roc_auc_score(y_test, y_test_pred_proba)

    # Calculate Gap
    gap = abs(train_auc - test_auc)

    # Store results
    model_results.append({
        'Model': name,
        'Train AUC': train_auc,
        'Test AUC': test_auc,
        'Gap': gap
    })

    print(f"{name} processed.")

# Create a results DataFrame and sort by Test AUC
results_df = pd.DataFrame(model_results)
results_df = results_df.sort_values(by='Test AUC', ascending=False)

print("\n--- Model Performance Metrics ---")
results_df.round(4)

## 6. Final Model Selection
Based on the results above, we select the model with the highest **Test AUC** that also maintains a reasonable **Gap** (indicating stability).

In [None]:
# Select best model name dynamically
best_model_name = results_df.iloc[0]['Model']
print(f"The best performing model is: {best_model_name}")

# Retrain the best model for final visualization (Confusion Matrix)
best_model = models[best_model_name]
final_pipe = Pipeline([('scaler', StandardScaler()), ('clf', best_model)])
final_pipe.fit(x_train, y_train)

# Predictions
y_pred = final_pipe.predict(x_test)

# Confusion Matrix Visualization
conf_matrix = confusion_matrix(y_test, y_pred)

plt.figure(figsize=(6, 4))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', cbar=False,
            xticklabels=['Predicted No', 'Predicted Yes'],
            yticklabels=['Actual No', 'Actual Yes'])
plt.title(f'Confusion Matrix: {best_model_name}')
plt.xlabel('Prediction')
plt.ylabel('Truth')
plt.show()

print("\nClassification Report:")
print(classification_report(y_test, y_pred))

## 7. Conclusion

1.  **Model Choice:** The **Gradient Boosting Classifier** (or whichever appears top) provided the best balance of AUC and stability.
2.  **Implication:** By using this model, the bank can sort customers by "Probability to Subscribe."
3.  **Next Steps:** The marketing team should focus strictly on the top decile of customers identified by this model to maximize ROI.