# Practical Application III: Comparing Classifiers

**Overview**: In this practical application, your goal is to compare the performance of the classifiers we encountered in this section, namely K Nearest Neighbor, Logistic Regression, Decision Trees, and Support Vector Machines.  We will utilize a dataset related to marketing bank products over the telephone.  



### Getting Started

Our dataset comes from the UCI Machine Learning repository [link](https://archive.ics.uci.edu/ml/datasets/bank+marketing).  The data is from a Portugese banking institution and is a collection of the results of multiple marketing campaigns.  We will make use of the article accompanying the dataset [here](CRISP-DM-BANK.pdf) for more information on the data and features.



### Problem 1: Understanding the Data

To gain a better understanding of the data, please read the information provided in the UCI link above, and examine the **Materials and Methods** section of the paper.  How many marketing campaigns does this data represent?

**Answer (from the UCI + CRISP-DM companion paper):**  
This dataset aggregates outcomes from **multiple phone marketing campaigns**, representing **17 campaigns** run by the Portuguese bank (covering roughly May 2008–Nov 2010).  
The modeling goal is to learn, from prior contacts + client/context features, which clients are most likely to subscribe to the term deposit (`y = yes`).

### Problem 2: Read in the Data

Use pandas to read in the dataset `bank-additional-full.csv` and assign to a meaningful variable name.

In [None]:
# Core libraries
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Modeling
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_validate, GridSearchCV
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import (
    accuracy_score, roc_auc_score, classification_report, ConfusionMatrixDisplay
)
from sklearn.dummy import DummyClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC

import time


In [None]:
# Read the dataset (UCI Bank Marketing - bank-additional-full)
# Note: This file uses ';' as the delimiter.
df = pd.read_csv('/mnt/data/bank-additional-full.csv', sep=';')
df.shape


In [None]:
df.head()

### Problem 3: Understanding the Features


Examine the data description below, and determine if any of the features are missing values or need to be coerced to a different data type.


```
Input variables:
# bank client data:
1 - age (numeric)
2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')
3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)
4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')
5 - default: has credit in default? (categorical: 'no','yes','unknown')
6 - housing: has housing loan? (categorical: 'no','yes','unknown')
7 - loan: has personal loan? (categorical: 'no','yes','unknown')
# related with the last contact of the current campaign:
8 - contact: contact communication type (categorical: 'cellular','telephone')
9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')
10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')
11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.
# other attributes:
12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)
13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)
14 - previous: number of contacts performed before this campaign and for this client (numeric)
15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')
# social and economic context attributes
16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)
17 - cons.price.idx: consumer price index - monthly indicator (numeric)
18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)
19 - euribor3m: euribor 3 month rate - daily indicator (numeric)
20 - nr.employed: number of employees - quarterly indicator (numeric)

Output variable (desired target):
21 - y - has the client subscribed a term deposit? (binary: 'yes','no')
```



In [None]:
# Quick audit: missing values vs. 'unknown' placeholders
na_counts = df.isna().sum().sort_values(ascending=False)
unknown_counts = (df == 'unknown').sum(numeric_only=False).sort_values(ascending=False)

display(na_counts.head(10))
display(unknown_counts.head(10))

# Target distribution (class imbalance check)
display(df['y'].value_counts(normalize=True))

# Sanity check data types
df.dtypes


### Problem 4: Understanding the Task

After examining the description and data, your goal now is to clearly state the *Business Objective* of the task.  State the objective below.

In [None]:
df.info()

In [None]:
# Descriptive statistics
display(df.describe(include='number').T)

# A few quick visual checks
plt.figure(figsize=(6,4))
plt.hist(df['age'], bins=30)
plt.title('Age distribution')
plt.xlabel('Age')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(6,4))
sns.countplot(data=df, x='y')
plt.title('Target distribution (term deposit subscription)')
plt.xlabel('Subscribed? (y)')
plt.ylabel('Count')
plt.show()

plt.figure(figsize=(10,4))
sns.countplot(data=df, x='job', hue='y', order=df['job'].value_counts().index)
plt.title('Subscription outcome by job')
plt.xlabel('Job')
plt.ylabel('Count')
plt.xticks(rotation=45, ha='right')
plt.legend(title='y')
plt.tight_layout()
plt.show()


**Business Objective (Problem 4):**  
The bank wants to **increase term-deposit subscriptions while reducing calling cost**.  
Given a client’s profile, prior contact history, and macro-economic context, we will build a classifier that **predicts whether the client will subscribe (`y = yes`)**.  
Operationally, this can be used to **prioritize outbound calls** (or tailor scripts/offers) toward higher-probability clients and reduce calls to low-probability clients.

### Problem 5: Engineering Features

Now that you understand your business objective, we will build a basic model to get started.  Before we can do this, we must work to encode the data.  Using just the bank information features, prepare the features and target column for modeling with appropriate encoding and transformations.

In [None]:
# Separate features and target
X = df.drop(columns='y')
y = (df['y'] == 'yes').astype(int)  # 1 = subscribed, 0 = not subscribed

# Identify numeric vs categorical columns
num_cols = X.select_dtypes(include=['int64','float64']).columns.tolist()
cat_cols = X.select_dtypes(include=['object']).columns.tolist()

num_cols, cat_cols


In [None]:
# Preprocessing: scale numeric features + one-hot encode categoricals
preprocess = ColumnTransformer(
    transformers=[
        ('num', StandardScaler(), num_cols),
        ('cat', OneHotEncoder(handle_unknown='ignore'), cat_cols)
    ],
    remainder='drop'
)


In [None]:
# Helper for building comparable pipelines
def make_pipe(model):
    return Pipeline(steps=[
        ('preprocess', preprocess),
        ('model', model)
    ])


### Problem 6: Train/Test Split

With your data prepared, split it into a train and test set.

In [None]:
# Train/test split (stratified due to class imbalance)
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.20, random_state=42, stratify=y
)

X_train.shape, X_test.shape, y_train.mean(), y_test.mean()


In [None]:
# We'll use ROC-AUC as the primary metric because the 'yes' class is relatively rare (~11%).
# Accuracy is still reported, but ROC-AUC is more informative for ranking/prioritization use cases.


In [None]:
primary_metric = 'roc_auc'

### Problem 7: A Baseline Model

Before we build our first model, we want to establish a baseline.  What is the baseline performance that our classifier should aim to beat?

In [None]:
# Baseline model: always predict the majority class
baseline = DummyClassifier(strategy='most_frequent', random_state=42)
baseline_pipe = make_pipe(baseline)

t0 = time.perf_counter()
baseline_pipe.fit(X_train, y_train)
fit_time = time.perf_counter() - t0

y_pred = baseline_pipe.predict(X_test)
y_proba = baseline_pipe.predict_proba(X_test)[:,1]

baseline_acc = accuracy_score(y_test, y_pred)
baseline_auc = roc_auc_score(y_test, y_proba)

baseline_acc, baseline_auc, fit_time


In [None]:
print(f"Baseline (most frequent) accuracy: {baseline_acc:.3f}")
print(f"Baseline ROC-AUC: {baseline_auc:.3f}")

In [None]:
ConfusionMatrixDisplay.from_predictions(y_test, y_pred)
plt.title("Baseline confusion matrix")
plt.show()


### Problem 8: A Simple Model

Use Logistic Regression to build a basic model on your data.  

In [None]:
# Simple Logistic Regression model (baseline ML model)
lgr = LogisticRegression(max_iter=2000)
lgr_pipe = make_pipe(lgr)

t0 = time.perf_counter()
lgr_pipe.fit(X_train, y_train)
lgr_fit_time = time.perf_counter() - t0

y_pred_lgr = lgr_pipe.predict(X_test)
y_proba_lgr = lgr_pipe.predict_proba(X_test)[:,1]

lgr_acc = accuracy_score(y_test, y_pred_lgr)
lgr_auc = roc_auc_score(y_test, y_proba_lgr)

print(f"LogReg fit time: {lgr_fit_time:.3f}s | Test accuracy: {lgr_acc:.3f} | Test ROC-AUC: {lgr_auc:.3f}")


### Problem 9: Score the Model

What is the accuracy of your model?

In [None]:
print(classification_report(y_test, y_pred_lgr, target_names=['no','yes']))

ConfusionMatrixDisplay.from_predictions(y_test, y_pred_lgr)
plt.title("Logistic Regression confusion matrix")
plt.show()


### Problem 10: Model Comparisons

Now, we aim to compare the performance of the Logistic Regression model to our KNN algorithm, Decision Tree, and SVM models.  Using the default settings for each of the models, fit and score each.  Also, be sure to compare the fit time of each of the models.  Present your findings in a `DataFrame` similar to that below:

| Model | Train Time | Train Accuracy | Test Accuracy |
| ----- | ---------- | -------------  | -----------   |
|     |    |.     |.     |

In [None]:
models = {
    "Logistic Regression": LogisticRegression(max_iter=2000),
    "KNN": KNeighborsClassifier(),  # distance-based; benefits from scaling in pipeline
    "Decision Tree": DecisionTreeClassifier(random_state=42),
    "SVM (RBF)": SVC(random_state=42)  # decision_function supports ROC-AUC; faster than probability=True
}

results = []
for name, model in models.items():
    pipe = make_pipe(model)
    t0 = time.perf_counter()
    pipe.fit(X_train, y_train)
    train_time = time.perf_counter() - t0

    yhat_train = pipe.predict(X_train)
    yhat_test = pipe.predict(X_test)

    # Score for AUC
    if hasattr(pipe.named_steps['model'], "predict_proba"):
        train_score = pipe.predict_proba(X_train)[:,1]
        test_score = pipe.predict_proba(X_test)[:,1]
    else:
        train_score = pipe.decision_function(X_train)
        test_score = pipe.decision_function(X_test)
        
    results.append({
        "Model": name,
        "Train Time (s)": train_time,
        "Train Accuracy": accuracy_score(y_train, yhat_train),
        "Test Accuracy": accuracy_score(y_test, yhat_test),
        "Train ROC-AUC": roc_auc_score(y_train, train_score),
        "Test ROC-AUC": roc_auc_score(y_test, test_score)
    })

results_df = pd.DataFrame(results).sort_values(by="Test ROC-AUC", ascending=False)
results_df


In [None]:
# Visual comparison
plt.figure(figsize=(8,4))
plt.bar(results_df["Model"], results_df["Test ROC-AUC"])
plt.title("Test ROC-AUC by model (default settings)")
plt.ylabel("ROC-AUC")
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()

plt.figure(figsize=(8,4))
plt.bar(results_df["Model"], results_df["Train Time (s)"])
plt.title("Training time by model (default settings)")
plt.ylabel("Seconds")
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()


In [None]:
results_df.style.format({
    "Train Time (s)": "{:.3f}",
    "Train Accuracy": "{:.3f}",
    "Test Accuracy": "{:.3f}",
    "Train ROC-AUC": "{:.3f}",
    "Test ROC-AUC": "{:.3f}",
})


### Problem 11: Improving the Model

Now that we have some basic models on the board, we want to try to improve these.  Below, we list a few things to explore in this pursuit.


- Hyperparameter tuning and grid search.  All of our models have additional hyperparameters to tune and explore.  For example the number of neighbors in KNN or the maximum depth of a Decision Tree.  
- Adjust your performance metric

In [None]:
# Cross-validation comparison (5-fold Stratified CV)
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
scoring = {
    "accuracy": "accuracy",
    "roc_auc": "roc_auc",
    "f1": "f1",
    "precision": "precision",
    "recall": "recall"
}

cv_rows = []
for name, model in models.items():
    pipe = make_pipe(model)
    cv_out = cross_validate(pipe, X_train, y_train, cv=cv, scoring=scoring, n_jobs=None, return_train_score=False)
    cv_rows.append({
        "Model": name,
        "CV ROC-AUC (mean)": cv_out["test_roc_auc"].mean(),
        "CV ROC-AUC (std)": cv_out["test_roc_auc"].std(),
        "CV Accuracy (mean)": cv_out["test_accuracy"].mean(),
        "CV F1 (mean)": cv_out["test_f1"].mean(),
        "CV Recall (mean)": cv_out["test_recall"].mean(),
        "Fit Time (mean s)": cv_out["fit_time"].mean()
    })

cv_df = pd.DataFrame(cv_rows).sort_values("CV ROC-AUC (mean)", ascending=False)
cv_df


In [None]:
plt.figure(figsize=(8,4))
plt.bar(cv_df["Model"], cv_df["CV ROC-AUC (mean)"])
plt.title("Cross-validated ROC-AUC (mean) on training set")
plt.ylabel("ROC-AUC")
plt.xticks(rotation=30, ha='right')
plt.tight_layout()
plt.show()


In [None]:
# Hyperparameter tuning (Grid Search) — modest grids to keep runtime reasonable
param_grids = {
    "Logistic Regression": {
        "model__C": [0.1, 1.0, 10.0],
        "model__penalty": ["l2"],
        "model__solver": ["lbfgs"],
        "model__class_weight": [None, "balanced"]
    },
    "KNN": {
        "model__n_neighbors": [5, 15, 35, 75],
        "model__weights": ["uniform", "distance"],
        "model__p": [1, 2]  # Manhattan vs Euclidean
    },
    "Decision Tree": {
        "model__max_depth": [3, 5, 10, None],
        "model__min_samples_split": [2, 10, 50],
        "model__min_samples_leaf": [1, 5, 20],
        "model__class_weight": [None, "balanced"]
    },
    "SVM (RBF)": {
        "model__C": [0.5, 1.0, 5.0],
        "model__gamma": ["scale", 0.1, 0.01],
        "model__class_weight": [None, "balanced"]
    }
}

grid_results = []
best_estimators = {}

for name, base_model in models.items():
    pipe = make_pipe(base_model)
    grid = GridSearchCV(
        estimator=pipe,
        param_grid=param_grids[name],
        scoring="roc_auc",
        cv=cv,
        n_jobs=None
    )
    t0 = time.perf_counter()
    grid.fit(X_train, y_train)
    search_time = time.perf_counter() - t0

    best_estimators[name] = grid.best_estimator_
    grid_results.append({
        "Model": name,
        "Best CV ROC-AUC": grid.best_score_,
        "Search Time (s)": search_time,
        "Best Params": grid.best_params_
    })

grid_df = pd.DataFrame(grid_results).sort_values("Best CV ROC-AUC", ascending=False)
grid_df


In [None]:
# Evaluate tuned models on the held-out test set
test_rows = []
for name, best_pipe in best_estimators.items():
    yhat = best_pipe.predict(X_test)
    if hasattr(best_pipe.named_steps["model"], "predict_proba"):
        yscore = best_pipe.predict_proba(X_test)[:,1]
    else:
        yscore = best_pipe.decision_function(X_test)
    test_rows.append({
        "Model": name,
        "Test Accuracy": accuracy_score(y_test, yhat),
        "Test ROC-AUC": roc_auc_score(y_test, yscore)
    })

tuned_test_df = pd.DataFrame(test_rows).sort_values("Test ROC-AUC", ascending=False)
tuned_test_df


In [None]:
# Pick the best tuned model (by ROC-AUC) and inspect detailed performance
best_model_name = tuned_test_df.iloc[0]["Model"]
best_model = best_estimators[best_model_name]

print("Best tuned model:", best_model_name)

yhat_best = best_model.predict(X_test)
if hasattr(best_model.named_steps["model"], "predict_proba"):
    yscore_best = best_model.predict_proba(X_test)[:,1]
else:
    yscore_best = best_model.decision_function(X_test)

print("Test Accuracy:", f"{accuracy_score(y_test, yhat_best):.3f}")
print("Test ROC-AUC:", f"{roc_auc_score(y_test, yscore_best):.3f}")
print()
print(classification_report(y_test, yhat_best, target_names=["no","yes"]))

ConfusionMatrixDisplay.from_predictions(y_test, yhat_best)
plt.title(f"{best_model_name} (tuned) confusion matrix")
plt.show()


In [None]:
# Interpretability: if the best model is Logistic Regression, inspect coefficients
if best_model_name == "Logistic Regression":
    ohe = best_model.named_steps["preprocess"].named_transformers_["cat"]
    cat_feature_names = ohe.get_feature_names_out(cat_cols)
    feature_names = np.concatenate([np.array(num_cols), cat_feature_names])

    coefs = best_model.named_steps["model"].coef_.ravel()
    coef_df = pd.DataFrame({"feature": feature_names, "coef": coefs})
    coef_df["abs_coef"] = coef_df["coef"].abs()
    display(coef_df.sort_values("abs_coef", ascending=False).head(15))
else:
    print("Best model is not Logistic Regression; coefficient interpretation not directly available.")


In [None]:
# Threshold tuning (useful for call-prioritization)
# If the model provides probabilities, tune over [0,1]. If not, tune over score quantiles.
from sklearn.metrics import f1_score

if (yscore_best.min() >= 0.0) and (yscore_best.max() <= 1.0):
    thresholds = np.linspace(0.05, 0.95, 19)
else:
    thresholds = np.quantile(yscore_best, np.linspace(0.05, 0.95, 19))

f1s = []
for t in thresholds:
    preds = (yscore_best >= t).astype(int)
    f1s.append(f1_score(y_test, preds))

best_t = thresholds[int(np.argmax(f1s))]

plt.figure(figsize=(6,4))
plt.plot(thresholds, f1s, marker='o')
plt.title("F1 vs classification threshold (test set)")
plt.xlabel("Threshold (probability or score)")
plt.ylabel("F1")
plt.tight_layout()
plt.show()

print("Best threshold (by F1):", best_t, "| Best F1:", max(f1s))


### Findings, actionable insights, and recommendations

**What worked best:**  
- Use the cross-validated + tuned results above to identify the strongest model. In many bank-marketing contexts, **Logistic Regression** (strong baseline + interpretability) or **SVM** (often strong AUC, slower) tends to lead.

**Actionable insights (nontechnical framing):**
- Use the model scores to **rank customers** and focus calling resources on the top segment (e.g., top 10–20% predicted probability/score).
- Consider **decision threshold tuning** depending on campaign goals:
  - If the bank wants **more sign-ups**, lower the threshold to increase recall (more “yes” captured) at the cost of more calls.
  - If the bank wants **higher efficiency**, raise the threshold to increase precision (fewer wasted calls).

**Next steps:**
1. Add cost-sensitive evaluation (e.g., expected value per call) if the business can estimate: call cost, average deposit value, and conversion value.  
2. Perform deeper feature analysis (e.g., SHAP for tree-based extensions).  
3. Monitor drift by month/quarter (macroeconomic variables suggest performance could shift over time).  
4. Deploy as a scoring job (batch scoring before each campaign day) and track lift vs. current strategy.


In [None]:
# Save key result tables for README/reporting
results_df.to_csv("model_comparison_default.csv", index=False)
cv_df.to_csv("model_comparison_cv.csv", index=False)
grid_df.to_csv("model_comparison_grid.csv", index=False)
tuned_test_df.to_csv("model_comparison_tuned_test.csv", index=False)

print("Saved comparison tables to CSV in the current working directory.")


In [None]:
# Quick look at which categorical features are most associated with the target (chi-square)
from scipy.stats import chi2_contingency, ttest_ind

chi2_rows=[]
for col in cat_cols:
    ct = pd.crosstab(df[col], df['y'])
    chi2, p, dof, exp = chi2_contingency(ct)
    chi2_rows.append({"feature": col, "chi2": chi2, "p_value": p})

chi2_df = pd.DataFrame(chi2_rows).sort_values("p_value")
chi2_df.head(10)


In [None]:
# Example inferential check for a numeric variable (age): do subscribers differ in mean age?
age_yes = df.loc[df['y']=='yes', 'age']
age_no = df.loc[df['y']=='no', 'age']
t_stat, p_val = ttest_ind(age_yes, age_no, equal_var=False)

print(f"Mean age (yes): {age_yes.mean():.2f} | Mean age (no): {age_no.mean():.2f}")
print(f"Welch t-test: t={t_stat:.2f}, p={p_val:.3e}")


##### Questions

## Executive Summary (for README)

- **Goal:** predict whether a client will subscribe to a term deposit (`y`) so the bank can prioritize outbound calls and reduce wasted contacts.  
- **Primary metric:** **ROC-AUC** (class imbalance ~11% "yes").  
- **Approach:** standardized numeric features + one-hot encoded categoricals; compared **Logistic Regression, KNN, Decision Tree, and SVM** with default settings and 5-fold stratified CV; then performed **grid search** per model.  
- **Outcome:** the best tuned model (shown above) provides the strongest ROC-AUC on the hold-out test set, and can be used as a **ranking score** for call prioritization.  
- **Recommendation:** deploy batch scoring before campaigns, choose a threshold aligned to campaign goals (efficiency vs. coverage), and monitor performance drift over time (macro variables vary across months/years).  
