# Term Depost Marketing

## Background
#### We are a small startup is focusing mainly on providing machine learning solutions in the European banking market. We work on a variety of problems including fraud detection, sentiment classification and customer intention prediction and classification.

#### We are interested in developing a robust machine learning system that leverages information coming from call center data.

#### Ultimately, we are looking for ways to improve the success rate for calls made to customers for any product that our clients offer. Towards this goal we are working on designing an ever evolving machine learning product that offers high success outcomes while offering interpretability for our clients to make informed decisions.

## Goals
#### - Predict if a customer will subscribe to a term deposit.
#### - Find out which customers are more likely to buy the investment product.
#### - Determine which features make the customer buy.  




# ------------------------------------------------------------------------
## Setup

In [None]:
#install libraries
!pip install pycaret
!pip install imbalanced-learn
!pip install optuna lightgbm
!pip install duckdb
!pip install nbconvert --upgrade

In [None]:
# import libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import gdown
import warnings
import logging
import random
import lightgbm as lgb
import requests
import sys
import optuna
import duckdb as dd
import umap
from imblearn.under_sampling import RandomUnderSampler
from imblearn.combine import SMOTEENN, SMOTETomek
from sklearn import preprocessing
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler, MinMaxScaler
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.model_selection import GridSearchCV
from sklearn.manifold import TSNE
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, BaggingClassifier, VotingClassifier, StackingClassifier, GradientBoostingClassifier
from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis
from lightgbm import LGBMClassifier
from sklearn.metrics import accuracy_score,classification_report, confusion_matrix, make_scorer, recall_score, precision_score, f1_score
from sklearn.feature_selection import RFE
from pycaret.datasets import get_data
from pycaret.classification import setup, compare_models, predict_model
warnings.filterwarnings('ignore', category=UserWarning)

In [None]:
# code to suppress lightGBM when running Lazy Classifer
class CustomLogger:
    def __init__(self):
        self.logger = logging.getLogger('lightgbm_custom')
        self.logger.setLevel(logging.ERROR)

    def info(self, message):
        self.logger.info(message)

    def warning(self, message):
        pass# Suppress warnings by not doing anything pass

    def error(self, message):
        self.logger.error(message)
# Register the custom logger
lgb.register_logger(CustomLogger())

In [None]:
# connect to personal google drive
from google.colab import drive
drive.mount('/content/drive')

In [None]:
# define function to download file from google drive
def download_file_from_google_drive(id, destination):
    URL = "https://docs.google.com/uc?export=download"

    session = requests.Session()

    response = session.get(URL, params={'id': id}, stream=True)
    token = get_confirm_token(response)

    if token:
        params = {'id': id, 'confirm': token}
        response = session.get(URL, params=params, stream=True)

    save_response_content(response, destination)

def get_confirm_token(response):
    for key, value in response.cookies.items():
        if key.startswith('download_warning'):
            return value
    return None

def save_response_content(response, destination):
    CHUNK_SIZE = 32768

    with open(destination, "wb") as f:
        for chunk in response.iter_content(CHUNK_SIZE):
            if chunk:  # filter out keep-alive new chunks
                f.write(chunk)

file_id = '1EW-XMnGfxn-qzGtGPa3v_C63Yqj2aGf7'
destination = 'term-deposit-marketing-2020.csv'
download_file_from_google_drive(file_id, destination)
df=pd.read_csv(destination)
df.head()

In [None]:
# setup duck db connection
conn = dd.connect()

# Exploratory Data Analysis

In [None]:
# check for missing values
df.isnull().sum()

In [None]:
# check data types
df.dtypes

In [None]:
print(df.describe())
print(df.describe(include=['object']))

In [None]:
# filter df for subscribed
subscribed_df = conn.execute('''
    SELECT *
    FROM df
    WHERE y = 'yes'
      ''').fetchdf()

# get statistical summary for continuous features
print('subscribed:')
print(subscribed_df.describe())

# filter df for unsubscribed
not_subscribed_df = conn.execute('''
    SELECT *
    FROM df
    WHERE y = 'no'
      ''').fetchdf()

# get statistical summary for continuous features
print('not subscribed:')
print(not_subscribed_df.describe())

In [None]:
#  get statistical summary for categorical features
print('subscribed:')
print(subscribed_df.describe(include=object))
print('not subscribed:')
print(not_subscribed_df.describe(include=object))

In [None]:
# list unique values and counts in categorical columns
for column in ['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact']:
    unique_values = df[column].value_counts()
    print(f"Value counts in column '{column}':")
    print(unique_values)
    print("\n")

# Phase 1:  Determine customers that are more like to buy the investment product.

## Data Wrangling

In [None]:
# replace yes, no with 1 and 0
df.replace(['no', 'yes'], [0,1], inplace=True)
df.head()

In [None]:
# create a month number column
month_map = {'jan': 1, 'feb': 2, 'mar': 3, 'apr': 4, 'may': 5, 'jun': 6,
             'jul': 7, 'aug': 8, 'sep': 9, 'oct': 10, 'nov': 11, 'dec': 12}

# Apply the mapping to the 'month' column
df['month_number'] = df['month'].map(month_map)
# Remove month column
df.drop('month', axis=1, inplace=True)
df.head()

In [None]:
#split data between target 'Y' and variables 'X'
X = df.drop('y',axis=1)
Y = df['y']
print(Y)
print(X.head())

In [None]:
# Split data out customer related data
X_customer_cols = ['age', 'job', 'marital', 'education', 'default', 'balance', 'housing', 'loan']
X_customer_raw = X[X_customer_cols]
X_customer_raw.head()

## Data Visualization

In [None]:
# Replace this with your actual data and labels
y_counts = df['y'].value_counts()
labels = y_counts.index
sizes = y_counts.values


plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)  # Use autopct for percentage display
plt.title('Pie Chart of Y')

# Add a legend with total values
total = sum(sizes)
legend_labels = [f'{label}: Total = {size}' for label, size in zip(labels,sizes)]
plt.legend(legend_labels, loc="best")


plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()


#### Pie chart shows highly imbalance data with ~ 93% unsubsribed



### Subscription status by categorical features

In [None]:
plt.figure(figsize=(12, 20))
for i, col in enumerate(['job', 'marital', 'education', 'default', 'housing', 'loan']):
    plt.subplot(5, 2, i + 1)
    sns.countplot(data=df, x=col, hue='y')
    plt.title(f'Subscription status count by {col}')
    plt.xlabel(col)
    plt.ylabel('Count')
    plt.xticks(rotation=80)
plt.tight_layout()
plt.show()

### Subscription status by continous features

In [None]:
plt.figure(figsize=(12, 20))
for i, col in enumerate(['age', 'balance']):
    plt.subplot(5, 2, i + 1)
    sns.boxplot(y=col, x='y', data=df, hue='y')
    plt.xlabel('Subscribed')
    plt.ylabel(col)
    plt.xticks(rotation=80)
plt.tight_layout()
plt.show()

## Data Rebalancing and Model Selection

In [None]:
X_customer_raw

In [None]:
# convert categorical columns with one hot encoding
X_customer = pd.get_dummies(data=X_customer_raw, columns=(['job', 'marital', 'education']), drop_first=True)

# scaling of continous features
continous_cols = ['age', 'balance']
transform = preprocessing.StandardScaler()
X_customer[continous_cols] = transform.fit_transform(X_customer[continous_cols])
# convert binary values to float type
X_customer = X_customer.astype(float)
X_customer.head()

In [None]:
# convert target values to float type

Y = Y.astype(float)
Y.head()

In [None]:
# good seeds 4134
# seed = random.randint(1000,9999)
seed = 4134
print(seed)

In [None]:
# Splitting training and testing sets before rebalancing to preserve
X_train, X_test, y_train, y_test = train_test_split(X_customer,Y, test_size=0.2, random_state=seed)
print(f'X_train shape:',X_train.shape)
print(f'y_train shape:',y_train.shape)
print(f'X_test shape:',X_test.shape)
print(f'y_test shape:',y_test.shape)

### Method 1:  Random Undersampler

In [None]:
rus = RandomUnderSampler(random_state=seed)
X_rus, y_rus = rus.fit_resample(X_train, y_train)
print(f'X_train resample:',X_rus.shape)
print(f'y_train resample:',y_rus.shape)

In [None]:
# Replace this with your actual data and labels
y_rus_counts = y_rus.value_counts()
labels = y_rus_counts.index
sizes = y_rus_counts.values


plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)  # Use autopct for percentage display
plt.title('Pie Chart of Y after rebalancing')

# Add a legend with total values
total = sum(sizes)
legend_labels = [f'{label}: Total = {size}' for label, size in zip(labels,sizes)]
plt.legend(legend_labels, loc="best")


plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

### Model Comparison after rebalacing with RUS



In [None]:
# Initialize PyCaret
clf = setup(data = pd.concat([X_rus, pd.DataFrame(y_rus, columns=['y'])], axis=1), target = 'y', fold = 5, session_id=seed)

# Compare models
best_model = compare_models()

# Evaluate the best model on the test set (optional)
predict_model(best_model, data=pd.concat([X_test, pd.DataFrame(y_test, columns=['y'])], axis=1));

### Method 2:  SMOTE-ENN

In [None]:
sme = SMOTEENN(random_state=seed)
X_sme, y_sme = sme.fit_resample(X_train, y_train)
print(f'X_train resample:',X_sme.shape)
print(f'y_train resample:',y_sme.shape)

In [None]:
# Replace this with your actual data and labels
y_sme_counts = y_sme.value_counts()
labels = y_sme_counts.index
sizes = y_sme_counts.values


plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)  # Use autopct for percentage display
plt.title('Pie Chart of Y after rebalancing')

# Add a legend with total values
total = sum(sizes)
legend_labels = [f'{label}: Total = {size}' for label, size in zip(labels,sizes)]
plt.legend(legend_labels, loc="best")


plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

### Model Comparison after rebalacing with SMOTE-ENN

In [None]:
# Initialize PyCaret
clf = setup(data = pd.concat([X_sme, pd.DataFrame(y_sme, columns=['y'])], axis=1), target = 'y', fold = 5, session_id=seed)

# Compare models
best_model = compare_models()

# Evaluate the best model on the test set (optional)
predict_model(best_model, data=pd.concat([X_test, pd.DataFrame(y_test, columns=['y'])], axis=1));

### Method 3:  SMOTE-Tomek

In [None]:
smt = SMOTETomek(random_state=seed)
X_smt, y_smt = smt.fit_resample(X_train, y_train)
print(f'X_train resample:',X_smt.shape)
print(f'y_train resample:',y_smt.shape)

In [None]:
# Replace this with your actual data and labels
y_smt_counts = y_smt.value_counts()
labels = y_smt_counts.index
sizes = y_smt_counts.values


plt.figure(figsize=(8, 6))  # Adjust figure size as needed
plt.pie(sizes, labels=labels, autopct='%1.1f%%', startangle=90)  # Use autopct for percentage display
plt.title('Pie Chart of Y after rebalancing')

# Add a legend with total values
total = sum(sizes)
legend_labels = [f'{label}: Total = {size}' for label, size in zip(labels,sizes)]
plt.legend(legend_labels, loc="best")


plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle.
plt.show()

### Model Comparison after rebalacing with SMOTE-Tomek

In [None]:
# Initialize PyCaret
clf = setup(data = pd.concat([X_smt, pd.DataFrame(y_smt, columns=['y'])], axis=1), target = 'y', fold = 5, session_id=seed)

# Compare models
best_model = compare_models()

# Evaluate the best model on the test set (optional)
predict_model(best_model, data=pd.concat([X_test, pd.DataFrame(y_test, columns=['y'])], axis=1));

## Best peforming rebalancing method and model combination
####  The Extra Trees classifer using SMOTE-ENN rebalanced data had the highest pefromance with an F1-score of 0.1655.

## Model Optimatization and Evaluation

In [None]:
# define a confusion matrix plotter for visualizing classification report results
def plot_confusion_matrix(y,y_predict):
    "this function plots the confusion matrix"
    from sklearn.metrics import confusion_matrix

    cm = confusion_matrix(y, y_predict)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, fmt='d'); #annot=True to annotate cells
    ax.set_xlabel('Predicted labels')
    ax.set_ylabel('True labels')
    ax.set_title('Confusion Matrix');
    ax.xaxis.set_ticklabels(['not subsribed', 'subscribed']); ax.yaxis.set_ticklabels(['not subsribed', 'subscribed'])
    plt.show()

## Extra Trees Classifier with Smote-ENN re-balanced data





In [None]:
# optimizing Extra trees classfier with optuna.  Class weight set to balanced to adjust class weights.

optuna.logging.set_verbosity(optuna.logging.WARNING)
def objective(trial):
    n_estimators = trial.suggest_int("n_estimators", 50, 500)
    max_depth = trial.suggest_int("max_depth", 2, 32, log=True)
    min_samples_split = trial.suggest_int("min_samples_split", 2, 20)
    min_samples_leaf = trial.suggest_int("min_samples_leaf", 1, 20)
    max_features = trial.suggest_categorical("max_features", ['sqrt', 'log2'])

    model = ExtraTreesClassifier(
        class_weight='balanced',
        n_estimators=n_estimators,
        max_depth=max_depth,
        min_samples_split=min_samples_split,
        min_samples_leaf=min_samples_leaf,
        max_features=max_features,
        random_state=seed
    )
    model.fit(X_sme, y_sme)
    y_pred = model.predict(X_test)
    f1 = f1_score(y_test, y_pred, pos_label=1)
    return f1

sampler = optuna.samplers.TPESampler(seed=seed)
study = optuna.create_study(sampler=sampler, direction="maximize")
study.optimize(objective, n_trials=100, timeout=600)

print("Number of finished trials:", len(study.trials))
print("Best trial parameters:", study.best_trial.params)
print("Best trial value:", study.best_trial.value)

best_params_et = study.best_trial.params
et_best = ExtraTreesClassifier(**best_params_et, class_weight='balanced', random_state=seed)
et_best.fit(X_sme, y_sme)
y_pred = et_best.predict(X_test)

report = classification_report(y_test, y_pred, target_names=['Not Subscribed', 'Subscribed'])
print(report)
plot_confusion_matrix(y_test, y_pred)

## Final Model Results
#### After optimization recall of subscribers improved to 0.54 while maintaining precision of 0.11.   The model successfully captured 54% of subscribers by reaching out to 66% less customers.  If the customer quantity was maintained at 8000 calls focusing only on likely subscribers as predicted by the model, 910 subscribers could be reached compared to the current 578, a 57% increase.

In [None]:
cm = confusion_matrix(y_test, y_pred)

# Convert the confusion matrix to a DataFrame for easier handling
cm_df = pd.DataFrame(cm, columns=['Predicted Not Subscribed', 'Predicted Subscribed'],
                     index=['Actual Not Subscribed', 'Actual Subscribed'])

# Save the DataFrame to a CSV file
cm_df.to_csv('confusion_matrix.csv')



## Customer segmentation

In [None]:
X_test.head()

In [None]:
# Splitting training and testing sets before rebalancing to preserve
X_train_raw, X_test_raw, y_train_raw, y_test_raw = train_test_split(X,Y, test_size=0.2, random_state=seed)
X_test_raw.head()

In [None]:
# add y_pred to X using X index

df_pred = X_test_raw

df_pred['y_pred'] = y_pred
df_pred['y_actual'] = y_test_raw
print(df_pred.shape)
print(df_pred.head())



In [None]:
# sum call durations for predicted subscribers and total
true_pos_sub = cm[1,1]
actual_sub = df_pred['y_actual'].sum()

print(f"Predicted Subscribers: {true_pos_sub}")
print(f"Actual Subscribers: {actual_sub}")

# sum call durations for predicted subscribers and total
campaigns_pred_sub = df_pred.loc[df_pred['y_pred'] == 1, 'campaign'].sum()
campaigns_total = df_pred['campaign'].sum()

print(f"Sum of campaigns where y_pred = 1: {campaigns_pred_sub}")
print(f"Sum of campaigns for all y_pred: {campaigns_total}")

sub_rate_pred = campaigns_pred_sub/true_pos_sub
sub_rate_actual = campaigns_total/actual_sub

print(f"Campaigns per subcriber predicted: {sub_rate_pred}")
print(f"Campaigns per subscriber actual: {sub_rate_actual}")

sub_rate_improvement = (sub_rate_actual - sub_rate_pred)/sub_rate_actual
print(f"Improvement in campaigns per subscriber: {sub_rate_improvement}")



In [None]:
# add y_pred to X expanded using X index
df_pred_exp = []
df_pred_exp = X_test
df_pred_exp['campaign'] = df_pred['campaign']
df_pred_exp['age'] = df_pred['age']
df_pred_exp['balance'] = df_pred['balance']
df_pred_exp['y_pred'] = y_pred
df_pred_exp.head()

In [None]:
# filter df for subscribed
predicted_sub = conn.execute('''
    SELECT *
    FROM df_pred_exp
    WHERE y_pred = 1
      ''').fetchdf()

predicted_sub = predicted_sub.drop('y_pred', axis=1)
predicted_sub.head()


In [None]:
# Select relevant features for clustering
features_for_clustering = ['age', 'balance']  # You can add more features if needed
X_cluster = predicted_sub[features_for_clustering]

# Scale the features
scaler = StandardScaler()
X_cluster_scaled = scaler.fit_transform(X_cluster)

# Determine the optimal number of clusters using the elbow method (optional)
n_samples = X_cluster_scaled.shape[0]
inertia = []
for i in range(1, min(11, n_samples)):  # Limit clusters to less than or equal to number of samples
    kmeans = KMeans(n_clusters=i, random_state=seed)
    kmeans.fit(X_cluster_scaled)
    inertia.append(kmeans.inertia_)

plt.plot(range(1, min(11, n_samples)), inertia, marker='o') #update range for plot
plt.title('Elbow Method for Optimal K')
plt.xlabel('Number of clusters')
plt.ylabel('Inertia')
plt.show()



#### Using the elbow method, the optimal number of clusters is 3.

In [None]:
# Based on the elbow method or other analysis, choose the number of clusters
n_clusters = 4  # Replace with the optimal number of clusters

# Perform K-means clustering
kmeans = KMeans(n_clusters=n_clusters, random_state=seed)
kmeans.fit(X_cluster_scaled)

# Add cluster labels to the DataFrame
predicted_sub['cluster'] = kmeans.labels_

# show count of each
count = predicted_sub.groupby('cluster').size()
print(count)
# Select only numeric columns for calculating the mean
numeric_predicted_sub = predicted_sub.select_dtypes(include=np.number)

print(numeric_predicted_sub.groupby('cluster').mean()) # Calculate mean for numeric columns only

# You can also visualize the clusters using scatter plots or other methods.
# For example:
#sns.scatterplot(x='age', y='balance', hue='cluster', data=predicted_sub)
#plt.show()

#print(predicted_sub.head())


In [None]:
sns.countplot(data=predicted_sub, x='cluster', hue='cluster')
plt.title('Count of Each Cluster')
plt.xlabel('Cluster')
plt.ylabel('Count')
plt.show()


# Customer segments
#### 4 customer segments were identified.  With 1 segment containing only 1 customer, it will be ignored for analysis.  

#### Common between the 3 remaining cluster are majority have education secondary or above.  No personal loans or credit in default. On average have a positive balance.  It approx. 3 calls to turn the customers into subscribers.

#### Young/Unmarried:  Average age of 32 yrs, 72% single, higher rate of having a home loan, with an average balance of ~$900.

#### Older/Married:  Average age of 51 yrs, 64% married, lower rate of having a home loan, with an average balance of ~$1200.

#### Middle aged/College Educated:  Middle aged with average age of 40 yrs, ~50% are in management, 67% college educated, with an average balance of $12000.


## Cluster visualization using T-SNE

In [None]:
X_values = predicted_sub.values
# Perform t-SNE dimensionality reduction
tsne = TSNE(n_components=2, random_state=seed)
X_tsne = tsne.fit_transform(X_values)

# Create a DataFrame with t-SNE coordinates and cluster labels
tsne_df = pd.DataFrame({'x': X_tsne[:, 0], 'y': X_tsne[:, 1], 'cluster': predicted_sub['cluster']})

# Plot the t-SNE visualization, colored by cluster
plt.figure(figsize=(8, 6))
sns.scatterplot(x='x', y='y', hue='cluster', data=tsne_df, palette='viridis')
plt.title('t-SNE Visualization of Predicted Subscribers, Colored by Cluster')
plt.show()


## Cluster visualization using UMAP

In [None]:
# Apply UMAP
reducer = umap.UMAP(n_components=2, random_state=seed)  # Reduce to 2D for visualization
X_umap = reducer.fit_transform(X_values)

# Create a DataFrame with UMAP coordinates and cluster labels
umap_df = pd.DataFrame({'x': X_umap[:, 0], 'y': X_umap[:, 1], 'cluster': predicted_sub['cluster']})

# Plot the UMAP visualization, colored by cluster
plt.figure(figsize=(8, 6))
sns.scatterplot(x='x', y='y', hue='cluster', data=umap_df, palette='viridis')
plt.title('UMAP Visualization of Predicted Subscribers, Colored by Cluster')
plt.show()

## Conclusions.

#### A predictive model was developed improving the customer subscription rate from 7% to 11%, a 57% increase.  The rate of campaigns needed per subsriber improved from 40 to 26, a 35% reduction. Three distinct groups were identified as likely customers, young/unmarried, older/married, middle-aged/college educated.  Common between the groups was a high rate of education level secondary or above, positive balances, and no credit default or personal loans.