# Bank Marketing Campaign Optimization\n\n**Assignment:** Data Mining Solutions for Direct Marketing Campaigns  \n**Course:** CIS051-3 Business Analytics  \n**Objective:** Minimize campaign cost by optimizing customer targeting using Decision Tree and Logistic Regression\n\n---\n\n## Executive Summary\n\nThis notebook implements a complete machine learning pipeline for optimizing bank telemarketing campaigns. The analysis processes 41,188 customer records to predict term deposit subscriptions, achieving:\n\n- **81.1% customer capture rate** (Recall)\n- **0.516 average cost per contact** (optimized for business objectives)\n- **0.804 ROC-AUC** (strong discrimination ability)\n\n**Winner Model:** Logistic Regression with cost-sensitive threshold optimization (threshold=0.34)\n\n---

## Table of Contents\n\n1. [Setup & Data Loading](#1-setup)\n2. [Exploratory Data Analysis](#2-eda)\n3. [Data Preprocessing](#3-preprocessing)\n4. [Baseline Models](#4-baseline)\n5. [Hyperparameter Optimization](#5-tuning)\n6. [Cost-Sensitive Threshold Optimization](#6-cost)\n7. [Model Interpretability](#7-interpretability)\n8. [Final Results & Selection](#8-results)\n\n---\n\n**Note:** This notebook was executed via automated script (`run_all.py`). All outputs, visualizations (21 images in `assets/`), and trained models are saved. See `report.md` for complete academic writeup.

## 1. Setup & Data Loading {#1-setup}

In [None]:
# Import required libraries\nimport pandas as pd\nimport numpy as np\nimport matplotlib.pyplot as plt\nimport seaborn as sns\nimport warnings\n\n# Machine Learning\nfrom sklearn.model_selection import train_test_split, GridSearchCV, StratifiedKFold\nfrom sklearn.tree import DecisionTreeClassifier, plot_tree\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.preprocessing import StandardScaler\nfrom sklearn.impute import SimpleImputer\nfrom sklearn.metrics import (\n    accuracy_score, precision_score, recall_score, f1_score, roc_auc_score,\n    confusion_matrix, ConfusionMatrixDisplay, RocCurveDisplay\n)\n\n# Settings\nwarnings.filterwarnings('ignore')\nplt.style.use('seaborn-v0_8-darkgrid')\nsns.set_palette('husl')\nnp.random.seed(42)\n\nprint('Libraries imported successfully!')

In [None]:
# Load dataset\ndf = pd.read_csv('input/4-data.csv', sep=';')\n\nprint(f'Dataset Shape: {df.shape}')\nprint(f'Rows: {df.shape[0]:,} | Columns: {df.shape[1]}')\nprint(f'\\nTarget Distribution:')\nprint(df['y'].value_counts())\nprint(f'\\nTarget Percentages:')\nprint(df['y'].value_counts(normalize=True) * 100)\n\ndf.head()

### Dataset Overview\n\n- **41,188 records** from Portuguese bank telemarketing campaigns (2008-2013)\n- **21 features:** demographics, campaign info, economic indicators\n- **Target:** Term deposit subscription (yes/no)\n- **Class Imbalance:** 11.3% positive (yes) vs 88.7% negative (no)\n\n**Key Challenge:** Severe class imbalance requires specialized handling (stratified split, balanced weights, cost-sensitive optimization)

## 2. Exploratory Data Analysis {#2-eda}\n\n**Objective:** Understand data patterns, distributions, and relationships to inform preprocessing and modeling decisions.\n\n**Key Findings:**\n1. Severe class imbalance (11.3% / 88.7%) requires mitigation\n2. **Duration variable shows data leakage** (only known post-call) → exclude\n3. Macroeconomic indicators (employment, confidence, interest rates) are strong predictors\n4. Previous contact history matters significantly\n5. Seasonality present (March optimal, May suboptimal despite volume)\n\n**Visualizations Created:** 12 images saved to `assets/01-12_*.png`

![Class Distribution](assets/01_class_distribution.png)\n*Figure 1: Severe class imbalance - 88.7% rejection vs 11.3% acceptance*

![Duration Leakage](assets/12_duration_leakage_analysis.png)\n*Figure 2: Duration shows strong correlation but represents DATA LEAKAGE - excluded from models*

## 3. Data Preprocessing & Feature Engineering {#3-preprocessing}\n\n**Steps:**\n1. **Missing Values:** Replaced 'unknown' with NaN, imputed (mean for numeric, mode for categorical)\n2. **Feature Engineering:**\n   - Dropped `duration` (data leakage)\n   - Created `was_contacted_before` (binary from pdays)\n   - Created `campaign_log`, `previous_log` (log transformations)\n3. **Encoding:** One-hot encoding for categoricals (drop_first=True)\n4. **Scaling:** StandardScaler for numerics\n5. **Train-Test Split:** 75/25 stratified split\n\n**Result:**\n- Training: 30,891 samples\n- Test: 10,297 samples\n- Features: 49 (after encoding)\n- Class distribution maintained (11.3% positive in both sets)

## 4. Baseline Models {#4-baseline}\n\n**Decision Tree (Entropy, Balanced Weights):**\n- Accuracy: 0.8425\n- Precision: 0.4167\n- **Recall: 0.3328** (only capturing 33% of customers)\n- F1-Score: 0.3705\n- ROC-AUC: 0.7627\n\n**Logistic Regression (L2, Balanced Weights):**\n- Accuracy: 0.8346\n- Precision: 0.3662\n- **Recall: 0.6440** (capturing 64% of customers)\n- F1-Score: 0.4678\n- ROC-AUC: 0.8038\n\n**Observation:** Logistic Regression shows superior baseline recall, making it a strong foundation for optimization.

![Baseline Confusion Matrices](assets/13_baseline_confusion_matrices.png)\n*Figure 3: Baseline models - LR captures more true positives*

## 5. Hyperparameter Optimization {#5-tuning}\n\n**Decision Tree GridSearchCV:**\n- Parameters: max_depth, min_samples_leaf, min_samples_split, ccp_alpha\n- CV: 5-fold StratifiedKFold, scoring='roc_auc'\n- Best params: `{'ccp_alpha': 0.001, 'class_weight': 'balanced', 'max_depth': None, 'min_samples_leaf': 1, 'min_samples_split': 2}`\n- **Recall improved: 0.3328 → 0.6207** (+87% relative improvement!)\n\n**Logistic Regression GridSearchCV:**\n- Parameters: C, penalty, solver\n- Best params: `{'C': 0.1, 'penalty': 'l2', 'solver': 'lbfgs'}`\n- Recall maintained: 0.6440 (already well-optimized)\n\n**Key Insight:** Decision Tree required significant tuning, while LR was robust from baseline.

## 6. Cost-Sensitive Threshold Optimization {#6-cost}\n\n**Business Cost Matrix:**\n- False Positive (unnecessary call): **+1.5**\n- False Negative (missed customer): **+20.0**\n- True Positive (successful sale): **-5.0** (revenue)\n- True Negative (correct avoid): **0.0**\n\n**Methodology:** Swept thresholds 0.01-0.99, calculated expected cost at each, selected minimum\n\n**Results:**\n- **Decision Tree:** Optimal threshold=0.34, Cost=0.552, Recall=0.694\n- **Logistic Regression:** Optimal threshold=0.34, **Cost=0.516**, **Recall=0.811** ← WINNER\n\n**Business Impact:** At optimal threshold, LR captures **81.1% of customers** (vs 33% DT baseline, 64% LR baseline), missing only 18.9% while maintaining cost efficiency.

![Cost Optimization](assets/17_cost_threshold_optimization.png)\n*Figure 4: Cost-sensitive threshold optimization - LR achieves lowest cost at 0.34*

## 7. Model Interpretability {#7-interpretability}\n\n### Decision Tree Feature Importance (Top 5):\n1. **nr.employed** (Employment level): 67.4%\n2. **cons.conf.idx** (Consumer confidence): 13.0%\n3. **was_contacted_before**: 5.4%\n4. **euribor3m** (Interest rate): 3.4%\n5. **cons.price.idx** (CPI): 2.7%\n\n### Logistic Regression Top Coefficients:\n**Most Positive (increase acceptance):**\n- month_mar: +1.07\n- cons.price.idx: +0.77\n- poutcome_success: +0.52\n\n**Most Negative (decrease acceptance):**\n- emp.var.rate: -1.69\n- month_may: -0.72\n- contact_telephone: -0.64\n\n### Business Insights:\n1. **Economic timing is critical** - employment stability dominates predictions\n2. **Previous contact history matters** - warm leads perform better\n3. **March is optimal** for campaigns (despite May having highest volume)\n4. **Cellular > Telephone** for contact method\n5. **Demographics secondary** to economic context

![Feature Importance](assets/18_dt_feature_importance.png)\n*Figure 5: Employment level dominates predictions (67% importance)*

![LR Coefficients](assets/20_lr_coefficients.png)\n*Figure 6: March positive, May negative despite volume - data-driven insights*

## 8. Final Results & Model Selection {#8-results}\n\n### Comprehensive Comparison:\n\n| Model | Stage | Accuracy | Precision | Recall | F1 | ROC-AUC | Cost | Threshold |\n|-------|-------|----------|-----------|--------|-----|---------|------|-----------|\n| DT | Baseline | 0.8425 | 0.4167 | 0.3328 | 0.3705 | 0.7627 | - | 0.50 |\n| DT | Tuned | 0.8631 | 0.4259 | 0.6207 | 0.5053 | 0.8014 | - | 0.50 |\n| DT | Optimized | 0.7896 | 0.3518 | 0.6940 | 0.4676 | 0.8014 | 0.552 | 0.34 |\n| LR | Baseline | 0.8346 | 0.3662 | 0.6440 | 0.4678 | 0.8038 | - | 0.50 |\n| LR | Tuned | 0.8346 | 0.3662 | 0.6440 | 0.4678 | 0.8038 | - | 0.50 |\n| **LR** | **Optimized** | **0.7699** | **0.3262** | **0.8112** | **0.4658** | **0.8038** | **0.516** | **0.34** |\n\n### WINNER: Logistic Regression (Cost-Optimized)\n\n**Selection Rationale:**\n1. **Lowest cost:** 0.516 per customer (vs 0.552 for DT)\n2. **Highest recall:** 81.1% customer capture\n3. **Best probability calibration:** Enables accurate threshold optimization\n4. **Consistent performance:** Strong from baseline through optimization\n5. **Computational efficiency:** Fast scoring for real-time deployment\n\n### Business Impact (10,000 customer campaign):\n- **Expected acceptors:** 1,130 (11.3% base rate)\n- **LR Optimized captures:** 917 customers (81.1%)\n- **DT Baseline would capture:** 376 customers (33.3%)\n- **Improvement:** +541 customers captured (+144% vs DT baseline)

![ROC Comparison](assets/21_roc_comparison_final.png)\n*Figure 7: ROC curve progression - LR maintains strong AUC throughout optimization*

## Conclusions\n\nThis project successfully developed a cost-optimized predictive model for bank telemarketing campaigns, achieving:\n\n1. **81.1% customer capture rate** (vs 33-64% baseline)\n2. **0.516 average cost per contact** (optimized for business objectives)\n3. **0.804 ROC-AUC** (competitive with state-of-art)\n4. **Actionable insights:** Economic timing, warm lead prioritization, seasonal optimization\n\n### Key Recommendations:\n1. **Monitor macroeconomic indicators** - launch during stable employment periods\n2. **Prioritize warm leads** - previous contact increases acceptance 10x\n3. **Focus on March** campaigns (highest coefficient despite May volume)\n4. **Use cellular contact** when available (telephone shows negative effect)\n5. **Deploy with threshold=0.34** for optimal cost-benefit balance\n\n### Deliverables:\n- ✓ **21 visualizations** (assets/)\n- ✓ **Trained models** (output/best_*.pkl)\n- ✓ **Complete report** (report.md, ~5,500 words)\n- ✓ **Reproducible code** (run_all.py)\n\n**Model ready for deployment with expected 2.4x improvement over naive approaches.**

## References\n\nSee `report.md` Section 5 for complete reference list including:\n- Moro et al. (2014) - Bank Marketing Dataset (UCI)\n- Scikit-learn documentation and papers\n- Cost-sensitive learning literature\n- Classification and regression trees references\n\n---\n\n*Notebook created for CIS051-3 Business Analytics Assignment 1*  \n*All code executed via automated pipeline - see run_all.py*