# Banking Dataset - Marketing Targets

### Content

- The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

    - train.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)
    - test.csv: 4521 rows and 18 columns with 10% of the examples (4521), randomly selected from train.csv

#### Dataset Link : https://www.kaggle.com/datasets/rashmiranu/banking-dataset-classification

In [None]:
# Import required libraries

import pandas as pd
import numpy as np
from autoviz.AutoViz_Class import AutoViz_Class
from ydata_profiling import ProfileReport
from sklearn.preprocessing import LabelEncoder
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier, AdaBoostClassifier, BaggingClassifier 


from sklearn.metrics import accuracy_score,precision_score,recall_score

from sklearn.model_selection import RandomizedSearchCV

import pickle




In [None]:
# Read the dataset

df = pd.read_csv('data/train.csv', sep=';')
df.sample(5)

In [None]:
# Checking the information about the dataset

df.info()

- There are no missing values in the above dataframe
- Lets check for duplicates

In [None]:
# check for duplicates
print(df.duplicated().sum())
# len(df[df.duplicated()])

- As there are no duplicate entries found there is not need to drop the duplicates
    - df.drop_duplicates()

- The above dataset looks mostly clean, so need for further cleaning of dataset
- Lets try to visualize the dataframe using automated python libraries

In [None]:
# Automated plots for easy visuzlization

AV = AutoViz_Class()
profile_autoviz = AV.AutoViz('data/train.csv', sep=';', depVar='y', dfte='Data', header=0, verbose=1, lowess=False,
               chart_format='html',save_plot_dir='AutoViz_Plots')

In [None]:
# Another report generating tool to analyze/visualize the reports in pandas
# Pandas profiling
profile_pandas = ProfileReport(df)
profile_pandas.to_file('pandas_profile.html')

In [None]:
# Lets try to convert the non-numerical columns to numerical columns for better analysis
# There are multiple methods to convert the non-numerical columns to numerical columns (one-hot encoding, label-encoding)

le = LabelEncoder()

for col in df.columns:
    if df[col].dtype == 'object':
        df[col] = le.fit_transform(df[col])

df.info()

In [None]:
# Let us check a sample of the dataframe after performing the encoding

df.head()

In [None]:
# Let us try to understand the distribution of the dataset using the describe function

df.describe().transpose()

- Columns like balance, housing, contact, duration is not normally distributed based on the mean, median and mode
- So let us try to perform zscalar method for scaling the dataframe
- Before scaling, let us split the dataframe to train and validation

In [None]:
independent_features = df.drop('y', axis=1)
target_feature = df[['y']]

target_feature.value_counts()

- Target feature is highly imbalanced, so applying smote for balancing the dataset

In [None]:
count = target_feature.value_counts()
print('Before sampling: \n', count)

smote = SMOTE()
independent_features_sampled, target_feature_sampled = smote.fit_resample(independent_features, target_feature)

count_sam = target_feature_sampled.value_counts()
print('After sampling: \n', count_sam)

In [None]:
# Train_test_split (As the dataset is not very large, let us take the test_size to be 15% of the entire dataframe)
# As we have sampled, startify is not necessary

X_train, X_val, y_train, y_val = train_test_split(independent_features_sampled, target_feature_sampled, 
                                                  test_size=0.15, 
                                                  random_state=42, 
                                                  shuffle=True, 
                                                  stratify=target_feature_sampled)

print(X_train.shape, X_val.shape)
print(y_train.shape, y_val.shape)

In [None]:
# Let us scale the features using z-scalar technique
sc = StandardScaler() 
X_train_sc = sc.fit_transform(X_train) # Equivalent to X_train_ = (X_train - X_train.mean()) / X_train.std()
X_val_sc = sc.fit_transform(X_val)

# Checking the max and min value of the series
print(X_train_sc.max(), X_train_sc.min())         # Gives the max and min values of all the features combined
print(X_val_sc.max(), X_val_sc.min())


In [None]:
# Sample of scaled series
X_train_sc

- Now that we have scaled, we can apply the dataset on the algorithm
- As it is a binary classification problem we will start with logistic regression and try all the classification algorithms

In [None]:
def get_metrics(y_true, y_pred):
    acc = accuracy_score(y_true, y_pred)
    prec = precision_score(y_true, y_pred, average='micro')
    recall = recall_score(y_true, y_pred, average='micro')
    return {'accuracy': round(acc, 2), 'precision': round(prec, 2), 'recall': round(recall, 2)}

In [None]:
# Model Building

# Logistic Regression
LG = LogisticRegression()
LG.fit(X_train_sc, y_train)
y_pred_lg = LG.predict(X_val_sc)

# Gaussian NaiveBayes
NB = GaussianNB()
NB.fit(X_train_sc, y_train)
y_pred_nb = NB.predict(X_val_sc)

# # Support Vector Classifier
SVM = SVC(C=0.8, kernel='linear', probability=True)
SVM.fit(X_train_sc, y_train)
y_pred_svm = SVM.predict(X_val_sc)

# KNN
n = list(np.arange(3,20,2))
acc = []
for k in n:
    knn = KNeighborsClassifier(n_neighbors=k)
    knn.fit(X_train_sc, y_train)
    # predict the response
    y_pred = knn.predict(X_val_sc)
    # evaluate accuracy
    scores = accuracy_score(y_val, y_pred)
    acc.append(scores)
# changing to misclassification error
MSE = [1 - x for x in acc]
# determining the best 'k' value
optimal_k = n[MSE.index(min(MSE))]
# Training with optimal 'k'
KNN = KNeighborsClassifier(n_neighbors=optimal_k)
KNN.fit(X_train_sc, y_train)
y_pred_knn = KNN.predict(X_val_sc)

# Decision Tree Classifier
DT = DecisionTreeClassifier()
DT.fit(X_train_sc, y_train)
y_pred_dt = DT.predict(X_val_sc)

# Bagging Classifier
BG = BaggingClassifier()
BG.fit(X_train_sc, y_train)
y_pred_bg = BG.predict(X_val_sc)

# Random Forest Classifier
RF = RandomForestClassifier()
RF.fit(X_train_sc, y_train)
y_pred_rf = RF.predict(X_val_sc)

# Gradient Boosting Classifier
GB = GradientBoostingClassifier()
GB.fit(X_train_sc, y_train)
y_pred_gb = GB.predict(X_val_sc)

# Ada Boost Classifier
AB = AdaBoostClassifier()
AB.fit(X_train_sc, y_train)
y_pred_ab = AB.predict(X_val_sc)


In [None]:
# Get the metrics for all the defined models

LG_model = get_metrics(y_val, y_pred_lg)
NB_model = get_metrics(y_val, y_pred_nb)
SVM_model = get_metrics(y_val, y_pred_svm)
KNN_model = get_metrics(y_val, y_pred_knn)
DT_model = get_metrics(y_val, y_pred_dt)
BG_model = get_metrics(y_val, y_pred_bg)
RF_model = get_metrics(y_val, y_pred_rf)
GB_model = get_metrics(y_val, y_pred_gb)
AB_model = get_metrics(y_val, y_pred_ab)

model_metrics = {'Model_Name': ['LG_model', 'NB_model', 'SVM_model', 'KNN_model', 'DT_model', 'BG_model', 'RF_model', 'GB_model', 'AB_model'],
                 'Model_Metrics': [LG_model, NB_model, SVM_model, KNN_model, DT_model, BG_model, RF_model, GB_model, AB_model]}

model_df = pd.DataFrame(model_metrics)

model_df

In [None]:
# Out of all the methods Random Forest Classifier performs better on the pre-processed dataset
# Let us apply some hyper parameter tuning on the Random Forest technique

# Hyper parameter tuning

def hyper_parameter_tuning(X_train, y_train):
    # define random parameters grid
    n_estimators = [5,21,51,101] # number of trees in the random forest
    max_features = ['auto', 'sqrt'] # number of features in consideration at every split
    max_depth = [int(x) for x in np.linspace(10, 120, num = 12)] # maximum number of levels allowed in each decision tree
    min_samples_split = [2, 6, 10] # minimum sample number to split a node
    min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf node
    bootstrap = [True, False] # method used to sample data points

    random_grid = {'n_estimators': n_estimators,
                    'max_features': max_features,
                    'max_depth': max_depth,
                    'min_samples_split': min_samples_split,
                    'min_samples_leaf': min_samples_leaf,
                    'bootstrap': bootstrap
                  }
    
    classifier = RandomForestClassifier()
    model_tuning = RandomizedSearchCV(estimator = classifier, param_distributions = random_grid,
                   n_iter = 100, cv = 5, verbose=2, random_state=35, n_jobs = 1)
    model_tuning.fit(X_train, y_train)

    print ('Random grid: ', random_grid, '\n')
    # print the best parameters
    print ('Best Parameters: ', model_tuning.best_params_, ' \n')

    best_params = model_tuning.best_params_
    
    n_estimators = best_params['n_estimators']
    min_samples_split = best_params['min_samples_split']
    min_samples_leaf = best_params['min_samples_leaf']
    max_features = best_params['max_features']
    max_depth = best_params['max_depth']
    bootstrap = best_params['bootstrap']
    
    model_tuned = RandomForestClassifier(n_estimators = n_estimators, min_samples_split = min_samples_split,
                                         min_samples_leaf= min_samples_leaf, max_features = max_features,
                                         max_depth= max_depth, bootstrap=bootstrap) 
    model_tuned.fit( X_train, y_train)

    return model_tuned,best_params

In [None]:
Tuned_model, best_params = hyper_parameter_tuning(X_train_sc, y_train)

In [None]:
# Predicting the output with the tuned model

y_pred = Tuned_model.predict(X_val_sc)
get_metrics(y_val, y_pred)

In [None]:
# Save the model to a file
with open('model/tuned_model.pkl', 'wb') as file:
    pickle.dump(Tuned_model, file)

In [None]:
# Load the test dataset and then pre=process it before making the prediction

df_test = pd.read_csv('data/test.csv', sep=';')

# Label Encoding
le = LabelEncoder()
for col in df_test.columns:
    if df_test[col].dtype == 'object':
        df_test[col] = le.transform(df_test[col])

# Convering the dataframe to series
X_test = df_test.drop('y', axis=1)
y_test = df_test[['y']]

# Scaling
# Let us scale the features using z-scalar technique
sc = StandardScaler() 
X_test_sc = sc.fit_transform(X_test) # Equivalent to X_train_ = (X_train - X_train.mean()) / X_train.std()

In [None]:
# Loading the saved model
with open('model/tuned_model.pkl', 'rb') as file:
    loaded_model = pickle.load(file)

y_pred = loaded_model.predict(X_test_sc)
get_metrics(y_test, y_pred)