# Term Deposit Predictor
by Jackson Lu, Daniel Yorke, Charlene Chin , and Mohammed Ibrahim 2025/11/21



# Summary
This project focuses on predicting whether clients will subscribe to a term deposit using the Bank Marketing dataset. A logistic regression model was developed, incorporating all available predictor variables after appropriate preprocessing. The model was evaluated using shuffled cross-validation with an emphasis on the F1 score balance precision and recall. The analysis was conducted using Python and key libraries such as NumPy, pandas, and scikit-learn, with all code documented for reproducibility.
Our final classifier performed fairly well on an unseen test data set, achieving an accuracy of 0.844, f1-score of 0.551, and roc-auc score of 0.91. This indicates that the model is reasonably effective at identifying clients who will subscribe to a term deposit, although there is room for improvement, particularly in recall. Further refinements could involve exploring additional features, tuning hyperparameters, or experimenting with alternative modeling techniques to enhance predictive performance.

# Introduction

Financial institutions rely heavily on effective marketing strategies to identify which clients are most likely to subscribe to long-term financial products such as term deposits. These products support both customer financial planning and bank stability, yet subscription rates are often low due to ineffective targeting. Traditional marketing approaches depend heavily on human judgment, intuition, and repeated client contact, which can be costly, time-consuming, and inconsistent in effectiveness. As a result, developing more objective and data-driven methods for understanding and predicting client behaviour has become increasingly important.

In this project, we ask whether a machine learning algorithm can accurately predict whether a bank client will subscribe to a term deposit based on demographic attributes, financial information, and past marketing interactions. This question is important because traditional marketing strategies tend to rely on broad outreach rather than individualized prediction, leading to inefficiencies and potential client fatigue. Furthermore, understanding which client characteristics are associated with subscription behavior may support more personalized communication strategies and improve customer experience. If a machine learning classifier such as logistic regression can reliably predict subscription outcomes, it may enable more data-driven, scalable, and cost-effective marketing decisions, ultimately improving the performance of future campaigns.

# Methods 
## Data

The dataset used in this project is the Bank Marketing dataset, created by By Sérgio Moro, P. Cortez, P. Rita. in 2014 at the University of Minho in Portugal as part of a series of direct marketing campaigns conducted by a Portuguese banking institution. The data is publicly available through the UCI Machine Learning Repository and contains information on client demographics, financial status, and details related to previous marketing contacts. The dataset can be found [here](https://archive.ics.uci.edu/ml/datasets/bank+marketing).

The dataset contains 45,211 observations and 17 columns in total, comprising 16 predictor variables and 1 binary target variable (y) indicating whether the client subscribed to a term deposit. Each record represents a client who was contacted during a marketing campaign. The predictor variables capture a mix of demographic, financial, and campaign-related information. Among these, several features contain missing values (e.g., job, education, contact, and poutcome), requiring appropriate imputation or handling during preprocessing. Missing categorical values were imputed with a constant placeholder (“unknown”), and numerical features were standardized using StandardScaler to ensure comparability across variables. The target variable y is binary (yes or no), with only around 11–12% of the clients subscribing to a term deposit, resulting in a class imbalance that must be considered in model evaluation. Together, these attributes provide a rich and diverse feature set for assessing whether logistic regression can effectively capture the patterns associated with successful term-deposit subscriptions.

This dataset is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) license.

## Analysis

A logistic regression classifier was developed to model the probability that a client would subscribe to a term deposit (y). All predictor variables from the original dataset were included after appropriate preprocessing, which involved encoding categorical features with OneHotEncoder and scaling numerical features using StandardScaler. The dataset was randomly divided into a training set (80%) and a test set (20%) to enable unbiased performance evaluation.

Prior exploratory analysis examined the distributions of all input variables in the training set, with plots colored by the binary outcome (“yes” or “no”). Most numerical predictors—such as previous, pdays, campaign, duration, age, and balance—displayed substantial overlap between the two classes. However, some features, particularly duration, showed clear differences: clients who subscribed tended to have significantly longer call durations. This observation is consistent with findings from the original dataset documentation, confirming duration as a strong predictor of subscription. Other variables, such as campaign, previous, and pdays, were highly right-skewed with long tails, while categorical variables (e.g., job, marital status, education, and contact type) appeared to carry complementary contextual information about clients. These exploratory patterns were visualized in Figure 1, which displays feature distributions by subscription status. Figure 2 presents the correlation matrix among numerical predictors.

Correlation matrices (both Pearson and Spearman) were also examined to assess relationships among predictors. Overall, correlations between numerical features were weak, indicating low multicollinearity, which supports the use of logistic regression as an interpretable linear model. Some moderate associations were found among pdays, previous, and campaign, reflecting their shared connection to marketing contact history.

Model evaluation was conducted using stratified 5-fold cross-validation to address class imbalance. Performance was primarily assessed using the F1-score, which balances precision and recall, along with accuracy and ROC-AUC for comprehensive evaluation. Across the five folds, the model achieved a mean accuracy of 0.844, a mean F1-score of 0.551, and a mean ROC-AUC of 0.910. Training and test results were closely aligned, indicating minimal overfitting. These results suggest that the logistic regression model provides strong discriminatory ability, though recall could be improved by further class rebalancing or feature engineering.

All analysis was conducted in Python (Van Rossum & Drake, 2009) using NumPy (Harris et al., 2020), pandas (McKinney, 2010), scikit-learn (Pedregosa et al., 2011), and Altair for visualization. All code for data processing, modeling, and figure generation is documented within this notebook for reproducibility.


# Results and Discussion

The results demonstrate that logistic regression can effectively distinguish clients likely to subscribe to a term deposit, achieving strong performance across multiple evaluation metrics. The identification of duration as the most influential predictor aligns with expectations—longer calls typically indicate higher engagement and interest in the product. The moderate F1-score, however, reflects difficulty in recalling all positive cases, which was anticipated due to the dataset’s pronounced class imbalance (only around 11–12% subscribed).

These findings highlight the model’s practical potential: banks could apply such a model to prioritize high-probability clients, improving campaign efficiency while reducing unnecessary contact costs. The high ROC-AUC value (0.91) suggests that even a simple, interpretable model can meaningfully support decision-making in marketing strategy.

Future work could explore whether non-linear models (e.g., tree-based or ensemble methods) further improve recall, or whether feature engineering on time-related or interaction variables enhances predictive performance. In addition, investigating the relative influence of demographic versus campaign-related features could deepen understanding of what drives client subscription behavior.


In [None]:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import StandardScaler, LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.impute import SimpleImputer
from sklearn.model_selection import cross_val_score, cross_validate
from sklearn.model_selection import StratifiedKFold
import pandas as pd
import numpy as np
import os
import warnings
warnings.filterwarnings('ignore', category=FutureWarning)
import altair as alt
import altair_ally as ally
from altair import datum

# Enable Altair to render in Jupyter
alt.data_transformers.enable('json', prefix='../data/altair/')


In [None]:
from ucimlrepo import fetch_ucirepo 
  
# fetch dataset 
bank_marketing = fetch_ucirepo(id=222) 
  
# data (as pandas dataframes) 
X = bank_marketing.data.features 
y = bank_marketing.data.targets 

In [None]:
# Define the folder path
folder_path = '../data/'
altair_path = '../data/altair/'

# Ensure the directory exists (create it if it doesn't)
os.makedirs(folder_path, exist_ok=True)
os.makedirs(altair_path, exist_ok=True)

# Define file paths
features_file_path = os.path.join(folder_path, 'bank_marketing_features.csv')
targets_file_path = os.path.join(folder_path, 'bank_marketing_targets.csv')

# Export the DataFrames to CSV
X.to_csv(features_file_path, index=False) # index=False prevents pandas from writing row indices to the file
y.to_csv(targets_file_path, index=False)

df = pd.concat([X, y], axis=1)

In [None]:
# to ignore warning messages from python ally
warnings.filterwarnings(
    "ignore",
    message="You passed a `<class 'narwhals.stable.v1.DataFrame'>` to `is_pandas_dataframe`.",
    category=UserWarning,
    module="altair.utils.data"
)

In [None]:
df.head()

In [None]:
df.info()

In [None]:
df.describe()

Here we are showing different features distributions 

In [None]:
ally.alt.data_transformers.enable('vegafusion')
ally.dist(df, color='y')


Figure 1. Key Feature Distributions

Here we are showing correlations between different features. 

In [None]:
ally.corr(df)

Figure 2. Feature Correlations

We created piplines to carry out transformation on numerical and categorical features separately. The numerical features were standardized using StandardScaler, while the categorical features were encoded using OneHotEncoder. The final pipeline combined these preprocessing steps with the LogisticRegression model.

In [None]:
# Simple pipeline example
numeric_pipeline = make_pipeline(
    SimpleImputer(strategy='median'),
    StandardScaler()
)

categorical_pipeline = make_pipeline(
    SimpleImputer(strategy='constant', fill_value='unknown'),
    OneHotEncoder(drop='first')
)

In [None]:
# First, let's prepare the data
# Handle categorical variables in features
categorical_columns = X.select_dtypes(include=['object']).columns.tolist()
numerical_columns = X.select_dtypes(include=['int64', 'float64']).columns.tolist()

print(f"Categorical columns: {categorical_columns}")
print(f"Numerical columns: {numerical_columns}")

In [None]:
# Create preprocessing pipeline
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_pipeline, numerical_columns),
        ('cat', categorical_pipeline, categorical_columns)
    ])

In [None]:
full_pipeline = make_pipeline(
    preprocessor,
    LogisticRegression(random_state=522, max_iter=2000, class_weight="balanced")
)

In [None]:
# Prepare target variable
# LabelEncoder just creates a simple mapping - no statistics involved
label_encoder = LabelEncoder()
y_encoded = label_encoder.fit_transform(y.values.ravel())

# What it does:
# 'no'  → 0
# 'yes' → 1

print(f"Target classes: {label_encoder.classes_}")
# Note: ravel() converts your 2D DataFrame column (522, 1) into a 1D array (522,) so LabelEncoder can process it properly!

In [None]:
# Split the data
# 'stratify=y_encoded' ensures that your train and test sets have the same class distribution as your original dataset.
X_train, X_test, y_train, y_test = train_test_split(
    X, y_encoded, test_size=0.2, random_state=522, stratify=y_encoded
)

print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

In [None]:
# Use stratified CV for imbalanced data
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=522)
cv_results = cross_validate(
    full_pipeline,
    X,
    y_encoded,
    cv=skf,  # ← Use stratified splits!
    scoring={'accuracy': 'accuracy', 'f1': 'f1', 'peri''roc_auc': 'roc_auc'},
    return_train_score=True,
    n_jobs=-1
)


In [None]:
pd.DataFrame(cv_results).agg(['mean', 'std']).round(3).T

Table 1. Cross-validation performance metrics for logistic regression model

Our prediction model performed quite well on test data, with a final overall accuracy of 0.844 and F1 score of 0.551. The ROC-AUC score of 0.91 indicates that the model is effective at distinguishing between clients who will and will not subscribe to a term deposit. However, there is room for improvement in identifying all potential subscribers, as some were missed by the model.

# References
Bera, Suman, Deeparnab Chakrabarty, Nicolas Flores, and Maryam Negahbani. 2019. “Fair Algorithms for Clustering.” <https://www.semanticscholar.org/paper/Fair-Algorithms-for-Clustering-Bera-Chakrabarty/34a46c62cb3a7809db4ed7d0c1a651f538b9fe87>

Ziko, Imtiaz, Eric Granger, Jing Yuan, and Ismail Ayed. 2019. “Clustering with Fairness Constraints: A Flexible and Scalable Approach.” <https://www.semanticscholar.org/paper/Clustering-with-Fairness-Constraints%3A-A-Flexible-Ziko-Granger/d56841fe68f2a913583a40edf541efeaed0a7e5b>

Lamy, Alexandre, Ziyuan Zhong, Aditya Menon, and Nakul Verma. 2019. “Noise-Tolerant Fair Classification.” <https://www.semanticscholar.org/paper/Noise-tolerant-fair-classification-Lamy-Zhong/c4ac496bf57410638260196a25d8ae3366ea03c7>

Iosifidis, Vasileios, and Eirini Ntoutsi. 2019. “AdaFair: Cumulative Fairness Adaptive Boosting.” <https://www.semanticscholar.org/paper/AdaFair%3A-Cumulative-Fairness-Adaptive-Boosting-Iosifidis-Ntoutsi/18fe4800f3c85f315d79063d6b0fe38c7610ad45>

Vaz, Afonso, Rafael Izbicki, and Rafael Stern. 2018. “Quantification under Prior Probability Shift: The Ratio Estimator and Its Extensions.” <https://www.semanticscholar.org/paper/Quantification-under-prior-probability-shift%3A-the-Vaz-Izbicki/50adf7b8fd1274149a195ef4a7b4ab9f84b3dd13>

Zhu, Zining, Jekaterina Novikova, and Frank Rudzicz. 2018. “Semi-supervised Classification by Reaching Consensus among Modalities.” <https://www.semanticscholar.org/paper/Semi-supervised-classification-by-reaching-among-Zhu-Novikova/072956b72ddc23f276b18da0c9a6ccc5ed5067e8>

Yoon, Jinsung, William R. Zame, and Mihaela van der Schaar. 2017. “ToPs: Ensemble Learning with Trees of Predictors.” <https://www.semanticscholar.org/paper/ToPs%3A-Ensemble-Learning-With-Trees-of-Predictors-Yoon-Zame/05268691d4bf6b84e71ae421a3af0ab27cd3d8f1>

Ross, Stéphane, Paul Mineiro, and John Langford. 2014. “Normalized Online Learning.” <https://www.semanticscholar.org/paper/Normalized-Online-Learning-Ross-Mineiro/1d127af1174a3f0f36e9181348eaa731d3cca67b>