# Term Deposit Marketing Prediction Models using LazyPredict

This notebook builds two predictive models for term deposit marketing using LazyPredict:
1. **Pre-Call Model**: Predicts which customers to call before making any calls (excludes campaign-related features)
2. **Post-Call Model**: Predicts which customers to focus on after initial contact (includes all features)

We'll compare multiple models using LazyPredict and select the top 3 for each scenario for detailed evaluation.

## 1. Setup and Data Preparation

In [None]:
# Import necessary libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import (
    classification_report,
    confusion_matrix,
    accuracy_score,
    precision_score,
    recall_score,
    f1_score,
    roc_auc_score,
    roc_curve,
    precision_recall_curve,
    auc,
)
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

# Import LazyPredict for model comparison
from lazypredict.Supervised import LazyClassifier

# Import models for detailed evaluation
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import (
    RandomForestClassifier,
    GradientBoostingClassifier,
    AdaBoostClassifier,
)
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from lightgbm import LGBMClassifier

# Set display options
pd.set_option("display.max_columns", None)
sns.set_style("whitegrid")

In [None]:
# Load the dataset
file_path = "term-deposit-marketing-2020.csv"
df = pd.read_csv(file_path)
print(f"Data loaded: {df.shape[0]} rows, {df.shape[1]} columns")
print("Columns:", df.columns.tolist())
df.head()

In [None]:
# Check class distribution
df["y"].value_counts(normalize=True).mul(100).round(2)

### Identify Campaign-Related Features

For our first model, we need to exclude campaign-related features that would not be available before making calls.

In [None]:
# Define features for each model
# Model 1: Pre-call prediction (exclude campaign-related features)
model1_features = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact']
model1_data = df[model1_features + ['y']].copy()

# Model 2: Post-call prediction (include all features)
model2_features = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan', 'contact', 'day', 'month', 'duration', 'campaign']
model2_data = df[model2_features + ['y']].copy()

print(f"Features for Model 1 (pre-call):\n {model1_features}\n")
print(f"Features for Model 2 (post-call):\n {model2_features}")

## 2. Prepare Data for Model 1 (Pre-Call Prediction)

In [None]:
# Prepare data for Model 1
# Convert target to binary
model1_data["y_binary"] = model1_data["y"].map({"yes": 1, "no": 0})

# Split features and target
X1 = model1_data.drop(["y", "y_binary"], axis=1)
y1 = model1_data["y_binary"]

# Identify categorical and numerical features
categorical_features = X1.select_dtypes(include=["object"]).columns.tolist()
numerical_features = X1.select_dtypes(include=["int64", "float64"]).columns.tolist()

print(f"Categorical features: {categorical_features}")
print(f"Numerical features: {numerical_features}")

# Create preprocessing pipeline
categorical_transformer = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

numerical_transformer = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

preprocessor1 = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer, numerical_features),
        ("cat", categorical_transformer, categorical_features),
    ]
)

# Split data into train and test sets
X1_train, X1_test, y1_train, y1_test = train_test_split(
    X1, y1, test_size=0.2, random_state=42, stratify=y1
)

## 3. Model 1: Pre-Call Prediction using LazyPredict

In [None]:
# Function to evaluate models with classification report and confusion matrix
def evaluate_model(model_name, model, X_train, X_test, y_train, y_test, preprocessor):
    # Create pipeline with preprocessing and model
    pipeline = Pipeline(steps=[("preprocessor", preprocessor), ("classifier", model)])

    # Train the model
    pipeline.fit(X_train, y_train)

    # Make predictions
    y_pred = pipeline.predict(X_test)
    y_pred_proba = pipeline.predict_proba(X_test)[:, 1]

    # Classification report
    print(f"\n{model_name} - Classification Report:")
    report = classification_report(y_test, y_pred)
    print(report)

    # Confusion Matrix
    plt.figure(figsize=(8, 6))
    cm = confusion_matrix(y_test, y_pred)
    sns.heatmap(
        cm,
        annot=True,
        fmt="d",
        cmap="Blues",
        xticklabels=["No", "Yes"],
        yticklabels=["No", "Yes"],
    )
    plt.xlabel("Predicted")
    plt.ylabel("Actual")
    plt.title(f"Confusion Matrix - {model_name}")
    plt.show()

    # Calculate metrics
    tn, fp, fn, tp = cm.ravel()
    accuracy = (tp + tn) / (tp + tn + fp + fn)
    precision = tp / (tp + fp) if (tp + fp) > 0 else 0
    recall = tp / (tp + fn) if (tp + fn) > 0 else 0
    f1 = (
        2 * (precision * recall) / (precision + recall)
        if (precision + recall) > 0
        else 0
    )
    roc_auc = roc_auc_score(y_test, y_pred_proba)

    # Plot ROC curve
    plt.figure(figsize=(8, 6))
    fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
    plt.plot(fpr, tpr, label=f"ROC curve (area = {roc_auc:.3f})")
    plt.plot([0, 1], [0, 1], "k--")
    plt.xlabel("False Positive Rate")
    plt.ylabel("True Positive Rate")
    plt.title(f"ROC Curve - {model_name}")
    plt.legend(loc="lower right")
    plt.show()

    print(f"\nObservations for {model_name}:")
    print(f"Accuracy: {accuracy:.4f}")
    print(f"Precision: {precision:.4f}")
    print(f"Recall: {recall:.4f}")
    print(f"F1 Score: {f1:.4f}")
    print(f"ROC AUC: {roc_auc:.4f}")
    print(f"True Positives: {tp} - Correctly predicted subscribers")
    print(f"False Positives: {fp} - Incorrectly predicted as subscribers")
    print(f"True Negatives: {tn} - Correctly predicted non-subscribers")
    print(f"False Negatives: {fn} - Missed potential subscribers")

    return {
        "model": model_name,
        "accuracy": accuracy,
        "precision": precision,
        "recall": recall,
        "f1": f1,
        "roc_auc": roc_auc,
        "tp": tp,
        "fp": fp,
        "tn": tn,
        "fn": fn,
        "pipeline": pipeline,
    }

In [None]:
# Use LazyPredict to evaluate multiple models for Model 1 (Pre-Call Prediction)
print("\n\nEvaluating models for Model 1 (Pre-Call) using LazyPredict...")

# Create a pipeline with preprocessing
pipeline1 = Pipeline(steps=[("preprocessor", preprocessor1)])

# Apply preprocessing to the data
X1_train_preprocessed = pipeline1.fit_transform(X1_train)
X1_test_preprocessed = pipeline1.transform(X1_test)

# Initialize LazyClassifier
clf1 = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

# Fit and evaluate models
models1, predictions1 = clf1.fit(X1_train_preprocessed, X1_test_preprocessed, y1_train, y1_test)

# Display results
print("\nModel 1 (Pre-Call) - LazyPredict Results:")
display(models1)

# Select top 3 models for detailed evaluation
top_models1 = models1.head(3).index.tolist()
print(f"\nTop 3 models for Model 1 (Pre-Call): {top_models1}")

In [None]:
# Detailed evaluation of top models
results1 = {}
for model_name in top_models1:
    if model_name == "LGBMClassifier":
        model = LGBMClassifier(random_state=42)
    elif model_name == "XGBClassifier":
        model = XGBClassifier(random_state=42)
    elif model_name == "RandomForestClassifier":
        model = RandomForestClassifier(random_state=42)
    elif model_name == "GradientBoostingClassifier":
        model = GradientBoostingClassifier(random_state=42)
    elif model_name == "DecisionTreeClassifier":
        model = DecisionTreeClassifier(random_state=42)
    elif model_name == "LogisticRegression":
        model = LogisticRegression(max_iter=1000, random_state=42)
    elif model_name == "KNeighborsClassifier":
        model = KNeighborsClassifier()
    elif model_name == "AdaBoostClassifier":
        model = AdaBoostClassifier(random_state=42)
    elif model_name == "SVC":
        model = SVC(probability=True, random_state=42)
    else:
        continue
        
    print(f"\n\nDetailed evaluation of {model_name} for Model 1 (Pre-Call)...")
    result = evaluate_model(model_name, model, X1_train, X1_test, y1_train, y1_test, preprocessor1)
    results1[model_name] = result

# Add observations about the model performance
print("\nObservations for Model 1 (Pre-Call):")
print("1. The top-performing models are able to identify patterns in customer data that predict subscription likelihood.")
print("2. These models can help the bank prioritize which customers to call, potentially saving resources.")
print("3. The models show varying trade-offs between precision and recall, which affects how many potential subscribers might be missed.")

## 4. Prepare Data for Model 2 (Post-Call Prediction)

In [None]:
# Prepare data for Model 2
# Convert target to binary
model2_data["y_binary"] = model2_data["y"].map({"yes": 1, "no": 0})

# Split features and target
X2 = model2_data.drop(["y", "y_binary"], axis=1)
y2 = model2_data["y_binary"]

# Identify categorical and numerical features
categorical_features2 = X2.select_dtypes(include=["object"]).columns.tolist()
numerical_features2 = X2.select_dtypes(include=["int64", "float64"]).columns.tolist()

print(f"Categorical features: {categorical_features2}")
print(f"Numerical features: {numerical_features2}")

# Create preprocessing pipeline
categorical_transformer2 = Pipeline(
    steps=[
        ("imputer", SimpleImputer(strategy="constant", fill_value="missing")),
        ("onehot", OneHotEncoder(handle_unknown="ignore")),
    ]
)

numerical_transformer2 = Pipeline(
    steps=[("imputer", SimpleImputer(strategy="median")), ("scaler", StandardScaler())]
)

preprocessor2 = ColumnTransformer(
    transformers=[
        ("num", numerical_transformer2, numerical_features2),
        ("cat", categorical_transformer2, categorical_features2),
    ]
)

# Split data into train and test sets
X2_train, X2_test, y2_train, y2_test = train_test_split(
    X2, y2, test_size=0.2, random_state=42, stratify=y2
)

## 5. Model 2: Post-Call Prediction using LazyPredict

In [None]:
# Use LazyPredict to evaluate multiple models for Model 2 (Post-Call Prediction)
print("\n\nEvaluating models for Model 2 (Post-Call) using LazyPredict...")

# Create a pipeline with preprocessing
pipeline2 = Pipeline(steps=[("preprocessor", preprocessor2)])

# Apply preprocessing to the data
X2_train_preprocessed = pipeline2.fit_transform(X2_train)
X2_test_preprocessed = pipeline2.transform(X2_test)

# Initialize LazyClassifier
clf2 = LazyClassifier(verbose=0, ignore_warnings=True, custom_metric=None)

# Fit and evaluate models
models2, predictions2 = clf2.fit(X2_train_preprocessed, X2_test_preprocessed, y2_train, y2_test)

# Display results
print("\nModel 2 (Post-Call) - LazyPredict Results:")
display(models2)

# Select top 3 models for detailed evaluation
top_models2 = models2.head(3).index.tolist()
print(f"\nTop 3 models for Model 2 (Post-Call): {top_models2}")

In [None]:
# Detailed evaluation of top models
results2 = {}
for model_name in top_models2:
    if model_name == "LGBMClassifier":
        model = LGBMClassifier(random_state=42)
    elif model_name == "XGBClassifier":
        model = XGBClassifier(random_state=42)
    elif model_name == "RandomForestClassifier":
        model = RandomForestClassifier(random_state=42)
    elif model_name == "GradientBoostingClassifier":
        model = GradientBoostingClassifier(random_state=42)
    elif model_name == "DecisionTreeClassifier":
        model = DecisionTreeClassifier(random_state=42)
    elif model_name == "LogisticRegression":
        model = LogisticRegression(max_iter=1000, random_state=42)
    elif model_name == "KNeighborsClassifier":
        model = KNeighborsClassifier()
    elif model_name == "AdaBoostClassifier":
        model = AdaBoostClassifier(random_state=42)
    elif model_name == "SVC":
        model = SVC(probability=True, random_state=42)
    else:
        continue
        
    print(f"\n\nDetailed evaluation of {model_name} for Model 2 (Post-Call)...")
    result = evaluate_model(model_name, model, X2_train, X2_test, y2_train, y2_test, preprocessor2)
    results2[model_name] = result

# Add observations about the model performance
print("\nObservations for Model 2 (Post-Call):")
print("1. Including campaign-related features significantly improves model performance.")
print("2. The 'duration' feature is likely a strong predictor, as it indicates customer interest level.")
print("3. These models can help the bank focus follow-up efforts on customers most likely to subscribe.")

## 6. Compare Pre-Call and Post-Call Models

In [None]:
# Compare the best models from each approach
print("Comparison of Pre-Call vs Post-Call Models:\n")

# Create a comparison dataframe
comparison_data = []

# Add Model 1 (Pre-Call) results
for model_name, result in results1.items():
    comparison_data.append({
        'Model': f"Pre-Call: {model_name}",
        'Accuracy': result['accuracy'],
        'Precision': result['precision'],
        'Recall': result['recall'],
        'F1 Score': result['f1'],
        'ROC AUC': result['roc_auc']
    })

# Add Model 2 (Post-Call) results
for model_name, result in results2.items():
    comparison_data.append({
        'Model': f"Post-Call: {model_name}",
        'Accuracy': result['accuracy'],
        'Precision': result['precision'],
        'Recall': result['recall'],
        'F1 Score': result['f1'],
        'ROC AUC': result['roc_auc']
    })

# Create and display comparison dataframe
comparison_df = pd.DataFrame(comparison_data)
comparison_df.sort_values('F1 Score', ascending=False, inplace=True)
comparison_df

## 7. Conclusion and Recommendations

### Key Findings:

1. **Pre-Call Model Performance**:
   - The top-performing models can identify potential subscribers before making any calls
   - These models help prioritize which customers to contact, saving resources
   - Performance is limited by not having campaign-related information

2. **Post-Call Model Performance**:
   - Including campaign-related features significantly improves prediction accuracy
   - The 'duration' feature is particularly important, as it indicates customer interest
   - These models help focus follow-up efforts on customers most likely to subscribe

3. **Comparison**:
   - Post-call models consistently outperform pre-call models
   - The improvement demonstrates the value of initial contact information

### Recommendations:

1. **Two-Stage Approach**:
   - Use the pre-call model to prioritize initial customer outreach
   - After initial contact, use the post-call model to identify customers for follow-up

2. **Feature Importance**:
   - Focus on collecting high-quality data for the most important features
   - Consider gathering additional customer information to improve pre-call predictions

3. **Model Deployment**:
   - Implement both models in the bank's CRM system
   - Regularly retrain models with new data to maintain performance
   - Monitor model performance over time and adjust as needed