# Banking Dataset - Marketing Targets

### Content

- The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

    - train.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)
    - test.csv: 4521 rows and 18 columns with 10% of the examples (4521), randomly selected from train.csv

#### Dataset Link : https://www.kaggle.com/datasets/rashmiranu/banking-dataset-classification

In [4]:
# Import required libraries

import pandas as pd
import numpy as np
from autoviz.AutoViz_Class import AutoViz_Class
from ydata_profiling import ProfileReport
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier 
from sklearn.metrics import accuracy_score,precision_score,recall_score
from sklearn.model_selection import RandomizedSearchCV
import pickle

In [5]:
# Read the dataset

df = pd.read_csv('data/train.csv', sep=';')
df.sample(5)

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
35484,35,blue-collar,married,primary,no,1033,yes,no,cellular,7,may,244,2,-1,0,unknown,no
39114,36,services,single,secondary,no,165,yes,no,cellular,18,may,182,2,-1,0,unknown,no
29906,45,entrepreneur,married,secondary,no,242,no,yes,cellular,4,feb,510,1,198,4,failure,no
10073,49,blue-collar,married,primary,no,238,no,no,unknown,11,jun,204,1,-1,0,unknown,no
22086,36,management,single,tertiary,no,1246,no,no,cellular,21,aug,63,2,-1,0,unknown,no


In [6]:
# Checking the information about the dataset

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


- There are no missing values in the above dataframe
- Lets check for duplicates

In [7]:
# check for duplicates
print(df.duplicated().sum())
# len(df[df.duplicated()])

0


- As there are no duplicate entries found there is not need to drop the duplicates
    - df.drop_duplicates()

- The above dataset looks mostly clean, so need for further cleaning of dataset
- Lets try to visualize the dataframe using automated python libraries

In [8]:
# Automated plots for easy visuzlization

AV = AutoViz_Class()
profile_autoviz = AV.AutoViz('data/train.csv', sep=';', depVar='y', dfte='Data', header=0, verbose=1, lowess=False,
               chart_format='html',save_plot_dir='AutoViz_Plots')

Shape of your Data Set loaded: (45211, 17)
#######################################################################################
######################## C L A S S I F Y I N G  V A R I A B L E S  ####################
#######################################################################################
Classifying variables in data set...
    Number of Numeric Columns =  0
    Number of Integer-Categorical Columns =  7
    Number of String-Categorical Columns =  6
    Number of Factor-Categorical Columns =  0
    Number of String-Boolean Columns =  3
    Number of Numeric-Boolean Columns =  0
    Number of Discrete String Columns =  0
    Number of NLP String Columns =  0
    Number of Date Time Columns =  0
    Number of ID Columns =  0
    Number of Columns to Delete =  0
    16 Predictors classified...
        No variables removed since no ID or low-information variables found in data set

################ Binary_Classification problem #####################




Saving scatterplots in HTML format
  0%|          | 0/7 [00:00<?, ?it/s]



 14%|█▍        | 1/7 [00:00<00:04,  1.27it/s]



 29%|██▊       | 2/7 [00:01<00:04,  1.22it/s]



 43%|████▎     | 3/7 [00:02<00:03,  1.24it/s]



 57%|█████▋    | 4/7 [00:03<00:02,  1.26it/s]



 71%|███████▏  | 5/7 [00:03<00:01,  1.26it/s]



 86%|████████▌ | 6/7 [00:04<00:00,  1.26it/s]



                                             





Saving pair_scatters in HTML format
                                               



Saving distplots_cats in HTML format
                                             

Saving distplots_nums in HTML format
                                             

KDE plot is erroring due to problems with DynamicMaps. Hence it is skipped


Saving violinplots in HTML format


No date vars could be found in data set


Saving heatmaps in HTML format


Saving cat_var_plots in HTML format
                                               



Time to run AutoViz (in seconds) = 134


In [9]:
# Another report generating tool to analyze/visualize the reports in pandas
# Pandas profiling
profile_pandas = ProfileReport(df)
profile_pandas.to_file('pandas_profile.html')

Summarize dataset: 100%|██████████| 75/75 [00:03<00:00, 21.64it/s, Completed]                  
Generate report structure: 100%|██████████| 1/1 [00:01<00:00,  1.12s/it]
Render HTML: 100%|██████████| 1/1 [00:00<00:00,  2.02it/s]
Export report to file: 100%|██████████| 1/1 [00:00<00:00, 193.35it/s]


In [10]:
# Lets try to convert the non-numerical columns to numerical columns for better analysis
# There are multiple methods to convert the non-numerical columns to numerical columns (one-hot encoding, label-encoding)

le = LabelEncoder()

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = le.fit_transform(df[col])

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype
---  ------     --------------  -----
 0   age        45211 non-null  int64
 1   job        45211 non-null  int64
 2   marital    45211 non-null  int64
 3   education  45211 non-null  int64
 4   default    45211 non-null  int64
 5   balance    45211 non-null  int64
 6   housing    45211 non-null  int64
 7   loan       45211 non-null  int64
 8   contact    45211 non-null  int64
 9   day        45211 non-null  int64
 10  month      45211 non-null  int64
 11  duration   45211 non-null  int64
 12  campaign   45211 non-null  int64
 13  pdays      45211 non-null  int64
 14  previous   45211 non-null  int64
 15  poutcome   45211 non-null  int64
 16  y          45211 non-null  int64
dtypes: int64(17)
memory usage: 5.9 MB


In [11]:
# Let us check a sample of the dataframe after performing the encoding

df.head()

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3,0
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3,0
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3,0
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3,0
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3,0


In [12]:
# Let us try to understand the distribution of the dataset using the describe function

df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
age,45211.0,40.93621,10.618762,18.0,33.0,39.0,48.0,95.0
job,45211.0,4.339762,3.272657,0.0,1.0,4.0,7.0,11.0
marital,45211.0,1.167725,0.60823,0.0,1.0,1.0,2.0,2.0
education,45211.0,1.224813,0.747997,0.0,1.0,1.0,2.0,3.0
default,45211.0,0.018027,0.133049,0.0,0.0,0.0,0.0,1.0
balance,45211.0,1362.272058,3044.765829,-8019.0,72.0,448.0,1428.0,102127.0
housing,45211.0,0.555838,0.496878,0.0,0.0,1.0,1.0,1.0
loan,45211.0,0.160226,0.36682,0.0,0.0,0.0,0.0,1.0
contact,45211.0,0.640242,0.897951,0.0,0.0,0.0,2.0,2.0
day,45211.0,15.806419,8.322476,1.0,8.0,16.0,21.0,31.0


- Columns like balance, housing, contact, duration is not normally distributed based on the describe function and also on the automized plots
- So let us try to perform zscalar method for scaling the dataframe
- Before scaling, let us split the dataframe to train and validation

In [13]:
independent_features = df.drop('y', axis=1)
target_feature = df[['y']]

target_feature.value_counts()

y
0    39922
1     5289
Name: count, dtype: int64

- Target feature is highly imbalanced, so applying smote for balancing the dataset

In [14]:
count = target_feature.value_counts()
print('Before sampling: \n', count)

smote = SMOTE()
independent_features_sampled, target_feature_sampled = smote.fit_resample(independent_features, target_feature)

count_sam = target_feature_sampled.value_counts()
print('After sampling: \n', count_sam)

Before sampling: 
 y
0    39922
1     5289
Name: count, dtype: int64
After sampling: 
 y
0    39922
1    39922
Name: count, dtype: int64


In [15]:
# Train_test_split (As the dataset is not very large, let us take the test_size to be 15% of the entire dataframe)
# As we have sampled, startify is not necessary

X_train, X_val, y_train, y_val = train_test_split(independent_features_sampled, target_feature_sampled, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  shuffle=True, 
                                                  stratify=target_feature_sampled)

print(X_train.shape, X_val.shape)
print(y_train.shape, y_val.shape)

(67867, 16) (11977, 16)
(67867, 1) (11977, 1)


In [16]:
# Let us scale the features using z-scalar technique
sc = StandardScaler() 
X_train_sc = sc.fit_transform(X_train) # Equivalent to X_train_ = (X_train - X_train.mean()) / X_train.std()
X_val_sc = sc.fit_transform(X_val)

# Checking the max and min value of the series
print(X_train_sc.max(), X_train_sc.min())         # Gives the max and min values of all the features combined
print(X_val_sc.max(), X_val_sc.min())


120.73224665568677 -3.0234804242900983
33.18048272312725 -2.344391283168157


- Now that we have scaled, we can apply the dataset on the algorithm
- As it is a binary classification problem we will start with logistic regression and try all the classification algorithms

In [17]:
def get_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, average='micro')
    recall = recall_score(y_true, y_pred, average='micro')
    return {'accuracy': round(acc, 2), 'precision': round(prec, 2), 'recall': round(recall, 2)}

In [18]:
# Model Building

# Logistic Regression
LG = LogisticRegression()
LG.fit(X_train_sc, y_train)
y_pred_lg = LG.predict(X_val_sc)

# Gaussian NaiveBayes
NB = GaussianNB()
NB.fit(X_train_sc, y_train)
y_pred_nb = NB.predict(X_val_sc)

# # Support Vector Classifier
SVM = SVC(C=0.8, kernel='linear', probability=True)
SVM.fit(X_train_sc, y_train)
y_pred_svm = SVM.predict(X_val_sc)

# KNN
n = list(np.arange(3,20,2))
acc = []
for k in n:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_sc, y_train)
    # predict the response
    y_pred = knn.predict(X_val_sc)
    # evaluate accuracy
    scores = accuracy_score(y_val, y_pred)
    acc.append(scores)
# changing to misclassification error
MSE = [1 - x for x in acc]
# determining the best 'k' value
optimal_k = n[MSE.index(min(MSE))]
# Training with optimal 'k'
KNN = KNeighborsClassifier(n_neighbors=optimal_k)
KNN.fit(X_train_sc, y_train)
y_pred_knn = KNN.predict(X_val_sc)

# Decision Tree Classifier
DT = DecisionTreeClassifier()
DT.fit(X_train_sc, y_train)
y_pred_dt = DT.predict(X_val_sc)

# Bagging Classifier
BG = BaggingClassifier()
BG.fit(X_train_sc, y_train)
y_pred_bg = BG.predict(X_val_sc)

# Random Forest Classifier
RF = RandomForestClassifier()
RF.fit(X_train_sc, y_train)
y_pred_rf = RF.predict(X_val_sc)

# Gradient Boosting Classifier
GB = GradientBoostingClassifier()
GB.fit(X_train_sc, y_train)
y_pred_gb = GB.predict(X_val_sc)

# Ada Boost Classifier
AB = AdaBoostClassifier()
AB.fit(X_train_sc, y_train)
y_pred_ab = AB.predict(X_val_sc)


In [28]:
# Get the metrics for all the defined models

LG_model = get_metrics(y_val, y_pred_lg)
NB_model = get_metrics(y_val, y_pred_nb)
SVM_model = get_metrics(y_val, y_pred_svm)
KNN_model = get_metrics(y_val, y_pred_knn)
DT_model = get_metrics(y_val, y_pred_dt)
BG_model = get_metrics(y_val, y_pred_bg)
RF_model = get_metrics(y_val, y_pred_rf)
GB_model = get_metrics(y_val, y_pred_gb)
AB_model = get_metrics(y_val, y_pred_ab)

model_metrics = {'Model_Name': ['LG_model', 'NB_model', 'SVM_model', 'KNN_model', 'DT_model', 'BG_model', 'RF_model', 'GB_model', 'AB_model'],
                 'Model_Classifier': ['LogisticRegression', 'GaussianNB', 'SVC', 'KNeighborsClassifier', 'DecisionTreeClassifier', 'BaggingClassifier',
                                      'RandomForestClassifier', 'GradientBoostingClassifier', 'AdaBoostClassifier'],
                 'Model_Metrics': [LG_model, NB_model, SVM_model, KNN_model, DT_model, BG_model, RF_model, GB_model, AB_model]}

model_df = pd.DataFrame(model_metrics)

model_df

Unnamed: 0,Model_Name,Model_Classifier,Model_Metrics
0,LG_model,LogisticRegression,"{'accuracy': 0.85, 'precision': 0.85, 'recall': 0.85}"
1,NB_model,GaussianNB,"{'accuracy': 0.74, 'precision': 0.74, 'recall': 0.74}"
2,SVM_model,SVC,"{'accuracy': 0.85, 'precision': 0.85, 'recall': 0.85}"
3,KNN_model,KNeighborsClassifier,"{'accuracy': 0.9, 'precision': 0.9, 'recall': 0.9}"
4,DT_model,DecisionTreeClassifier,"{'accuracy': 0.88, 'precision': 0.88, 'recall': 0.88}"
5,BG_model,BaggingClassifier,"{'accuracy': 0.91, 'precision': 0.91, 'recall': 0.91}"
6,RF_model,RandomForestClassifier,"{'accuracy': 0.93, 'precision': 0.93, 'recall': 0.93}"
7,GB_model,GradientBoostingClassifier,"{'accuracy': 0.9, 'precision': 0.9, 'recall': 0.9}"
8,AB_model,AdaBoostClassifier,"{'accuracy': 0.88, 'precision': 0.88, 'recall': 0.88}"


In [26]:
# Extract accuracy values from dictionaries
model_df['accuracy'] = model_df['Model_Metrics'].apply(lambda x: x.get('accuracy'))

# Find the model with the highest accuracy
best_model_index = model_df['accuracy'].idxmax()
best_model_name = model_df.loc[best_model_index, 'Model_Name']
best_model_classifier = model_df.loc[best_model_index, 'Model_Classifier']
best_accuracy = model_df['accuracy'].max()

print(f"Model with highest accuracy: {best_model_name}, {best_model_classifier}, (Accuracy: {best_accuracy:.2f})")

Model with highest accuracy: RF_model, RandomForestClassifier, (Accuracy: 0.93)


In [None]:
# Out of all the methods Random Forest Classifier performs better on the pre-processed dataset
# Let us apply some hyper parameter tuning on the Random Forest technique

# Hyper parameter tuning

def hyper_parameter_tuning(X_train, y_train):
    # define random parameters grid
    n_estimators = [5,21,51,101] # number of trees in the random forest
    max_features = ['auto', 'sqrt'] # number of features in consideration at every split
    max_depth = [int(x) for x in np.linspace(10, 120, num = 12)] # maximum number of levels allowed in each decision tree
    min_samples_split = [2, 6, 10] # minimum sample number to split a node
    min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf node
    bootstrap = [True, False] # method used to sample data points

    random_grid = {'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth,
                    'min_samples_split': min_samples_split,
                    'min_samples_leaf': min_samples_leaf,
                    'bootstrap': bootstrap
                  }
    
    classifier = RandomForestClassifier()
    model_tuning = RandomizedSearchCV(estimator = classifier, param_distributions = random_grid,
                   n_iter = 100, cv = 5, verbose=2, random_state=35, n_jobs = 1)
    model_tuning.fit(X_train, y_train)

    print ('Random grid: ', random_grid, '\n')
    # print the best parameters
    print ('Best Parameters: ', model_tuning.best_params_, ' \n')

    best_params = model_tuning.best_params_
    
    n_estimators = best_params['n_estimators']
    min_samples_split = best_params['min_samples_split']
    min_samples_leaf = best_params['min_samples_leaf']
    max_features = best_params['max_features']
    max_depth = best_params['max_depth']
    bootstrap = best_params['bootstrap']
    
    model_tuned = RandomForestClassifier(n_estimators = n_estimators, min_samples_split = min_samples_split,
                                         min_samples_leaf= min_samples_leaf, max_features = max_features,
                                         max_depth= max_depth, bootstrap=bootstrap) 
    model_tuned.fit( X_train, y_train)

    return model_tuned,best_params

In [None]:
Tuned_model, best_params = hyper_parameter_tuning(X_train_sc, y_train)

In [None]:
# Predicting the output with the tuned model

y_pred = Tuned_model.predict(X_val_sc)
get_metrics(y_val, y_pred)

In [None]:
# Save the model to a file
with open('model/tuned_model.pkl', 'wb') as file:
    pickle.dump(Tuned_model, file)

In [None]:
# Load the test dataset and then pre=process it before making the prediction

df_test = pd.read_csv('data/test.csv', sep=';')

# Label Encoding
le = LabelEncoder()
for col in df_test.columns:
    if df_test[col].dtype == 'object':
        df_test[col] = le.transform(df_test[col])

# Convering the dataframe to series
X_test = df_test.drop('y', axis=1)
y_test = df_test[['y']]

# Scaling
# Let us scale the features using z-scalar technique
sc = StandardScaler() 
X_test_sc = sc.fit_transform(X_test) # Equivalent to X_train_ = (X_train - X_train.mean()) / X_train.std()

In [None]:
# Loading the saved model
with open('model/tuned_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

y_pred = loaded_model.predict(X_test_sc)
get_metrics(y_test, y_pred)