## Bank Marketing Campaign
## Specialization: Data Science
## Data Glacier Virtual Internship
### Presented by the Greeks
### Galanakis Michalis, Konioris Aggelos, Moysiadis Giorgos

#### At first, we import all the libraries that will be utilized

In [1]:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.decomposition import PCA
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix, f1_score, recall_score
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.neighbors import KNeighborsClassifier as kNN
from sklearn.preprocessing import LabelEncoder, MinMaxScaler
from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier

import warnings
warnings.filterwarnings('ignore')

# for reproducibility
SEED = 0

#### Before starting our model buliding analysis, we run the code from the previous assignment for handling missing values

In [2]:
bank_additional_full = pd.read_csv("bank-additional-full.csv", delimiter = ';')
df = bank_additional_full 
df.replace('unknown', np.nan, inplace = True)
df['loan'].fillna(df['loan'].value_counts().index[0], inplace = True)
df['marital'].fillna(df['marital'].value_counts().index[0], inplace = True)
df['default'].fillna(df['default'].value_counts().index[0], inplace = True)

def na_randomfill(function):
    na = pd.isnull(function)   
    number_null = na.sum()        
    if number_null == 0:
        return function             
    fill_values = function[~na].sample(n = number_null, replace = True, random_state = 0)
    fill_values.index = function.index[na]
    return function.fillna(fill_values)

df = df.apply(na_randomfill)

#### This step provides the transformation of the qualitative variables into quantitative, so we can use them in our models

In [3]:
obj_column = df.dtypes[df.dtypes == 'object'].index

labelencoder_X = LabelEncoder()
for column in obj_column:
    df[column] = labelencoder_X.fit_transform(df[column])

df.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,...,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y
0,56,3,1,0,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
1,57,7,1,3,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
2,37,7,1,3,0,1,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
3,40,0,1,1,0,0,0,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0
4,56,7,1,3,0,0,1,1,6,1,...,1,999,0,1,1.1,93.994,-36.4,4.857,5191.0,0


#### Dimensionality reduction - You can think of this method as taking many features and combining similar or redundant features together to form a new, smaller feature set.

In [4]:
pca = PCA(n_components = 3)
pca_feats = pca.fit_transform(df)

#### After creating and transforming our features we are going to see which are the ones that PCA saw that are the principal components

In [5]:
n_pcs = pca.components_.shape[0]
most_important = [np.abs(pca.components_[i]).argmax() for i in range(n_pcs)]

initial_feature_names = list(df.columns.values)

most_important_names = [initial_feature_names[most_important[i]] for i in range(n_pcs)]
dic = {'PC{}'.format(i + 1): most_important_names[i] for i in range(n_pcs)}
pc_df = pd.DataFrame(sorted(dic.items()))
pc_df

Unnamed: 0,0,1
0,PC1,duration
1,PC2,pdays
2,PC3,nr.employed


#### Split our data in to train and test sets.

In [6]:
X_train, X_test, y_train, y_test = train_test_split(df.drop('y', axis = 1), df['y'], test_size = .2, 
                                                    random_state = SEED, stratify = df['y'])

#### Rescale our data from their default range to 0-1 range.

In [7]:
scaler = MinMaxScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)

# Model building and training
#### Classification Models

- Logistic Regression : model using logistic function to predict the result
- Decision Tree Classifier : model using series decision nodes to predict the result
- Random Forest Classifier : model using multiple decision tree classifier to predict the result
- Support Vector Classifer : model using vectors to predict the result
- Gradient Boosting Classifier : model using sequence of sub-models to sequentially correct predecessor's error and improve its performance
- KNearest Classifier : model using nearest datapoints to predict the result

#### Training the classification models

In [8]:
models = [LogisticRegression(random_state = SEED),
          DecisionTreeClassifier(random_state = SEED),
          RandomForestClassifier(random_state = SEED),
          SVC(random_state = SEED),
          XGBClassifier(verbosity = 0, random_state = SEED),
          kNN()]

results = {}

for model in models:
    fit = model.fit(X_train, y_train)
    test_pred = fit.predict(X_test)
    results[fit.__class__.__name__] = [
        round(fit.score(X_test, y_test), 2),
        round(f1_score(y_test, test_pred), 2),
        round(cross_val_score(model, X_test, y_test, cv = 5).mean(), 2),
        round(recall_score(y_test, test_pred), 2),
        confusion_matrix(y_test, test_pred)]

#### Finally, we made a data frame with the outcomes along with the confusion matrixs of each mdodel

In [9]:
index = ['accuracy', 'f1_score', 'cross_val_score', 'recall', 'confusion_matrix']
results_df = pd.DataFrame(data = results, index = index, columns = list(results.keys()))
results_df

Unnamed: 0,LogisticRegression,DecisionTreeClassifier,RandomForestClassifier,SVC,XGBClassifier,KNeighborsClassifier
accuracy,0.91,0.89,0.91,0.9,0.91,0.89
f1_score,0.48,0.52,0.57,0.35,0.59,0.36
cross_val_score,0.91,0.89,0.91,0.9,0.91,0.89
recall,0.38,0.52,0.5,0.23,0.55,0.28
confusion_matrix,"[[7132, 178], [575, 353]]","[[6860, 450], [444, 484]]","[[7055, 255], [460, 468]]","[[7204, 106], [711, 217]]","[[7018, 292], [422, 506]]","[[7077, 233], [669, 259]]"


#### With f1_score, we calculate a measure between precision and recall, where an F1 score reaches its best value at 1 and worst score at 0
#### With cross_val_score, we evalute the score by cross-validation  
#### With recall, we calculate the true positive rate was found by the model. The formula is: recall = tp/(tp+fn)