## Bank Marketing Dataset :

### Overview

> The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution.<br><br>
The classification goal is to predict if the client will subscribe a term deposit (variable y).

 __Source : https://archive.ics.uci.edu/ml/datasets/Bank+Marketing#__

The data is related with direct marketing campaigns of a Portuguese banking institution.<br><br>
The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required,<br><br>
in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed.

## What is a Term Deposit ?


A Term deposit is a deposit that a bank or a financial institurion offers with a fixed rate <br><br>(often better than just opening
deposit account) in which your money will be returned back at a specific maturity time.

__The classification goal is to predict if the client will subscribe (yes / no) a term deposit (variable 'y').__

### Read The Data

Header | Data Type | Definition
---|---------|---------
`Age`| int64 | Age of customer
`Job` | object | Job of customer 
`Martial` | object | Martial status of customer  
`Education` | Object |Customer education level 
`Default` | Object |  Has credit in default? 
`Housing` | object | If costumer has housing loan 
`Loan` | object | Has Personal Loan
`Balance` | int64 |Customer's individual balance
`Contact` | object | Communication type
`Month` | object |  Last contact month of year 
`Day` | int64 | Last contact day of the week
`Duration` | int64 |Last contact duration, in seconds
`Campaign` | int64 | Number of contacts performed during this campaign and for this client
`Pdays` | int64 | Number of days that passed by after the client was last contacted from a previous campaign 
`Previous` | int64 | Number of contacts performed before this campaign and for this client
`Poutcome` | object |outcome of the previous marketing campaign 
`Y` | object | has the client subscribed a term deposit 

> __Here _y_ is the target variable.__

## Libraries

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

import random
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import train_test_split
from sklearn.model_selection import RandomizedSearchCV
from sklearn.metrics import accuracy_score, classification_report
from sklearn.metrics import confusion_matrix
from sklearn.decomposition import PCA
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import mean_squared_error
from sklearn.metrics import roc_curve
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import average_precision_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
import sklearn
import scipy
from sklearn.ensemble import IsolationForest
from sklearn.neighbors import LocalOutlierFactor
from sklearn.svm import OneClassSVM
from pylab import rcParams
from collections import Counter
from imblearn.under_sampling import NearMiss
from imblearn.over_sampling import SMOTE
from sklearn import metrics

# for Machine learning
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import StackingClassifier

import pickle

# To ignore unwanted warnings
import warnings
warnings.filterwarnings('ignore')

# for styling
plt.style.use('seaborn-whitegrid')

In [None]:
df = pd.read_csv('../input/bank-marketing-dataset-analysis-classification/bank-full.csv')
df.head()

## EDA and Visualization

- Statistical Description of Data
- Plots
- Relationship between the attributes

In [None]:
# shows the statistical summary of numerical columns

df.describe()

In [None]:
# CATEGORIGAL VALUES

cat= df.select_dtypes(include= object)
cat_columns = cat.columns

In [None]:
cat_columns

In [None]:
for feature in cat_columns:
    print('The feature is {} and number of categories are {}'.format(feature,len(df[feature].unique())))

> __Feature _job_ and _month_ has the highest number of categorical values.__

In [None]:
cat_features=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y']

In [None]:
# NUMERICAL VALUES

numeric = df.select_dtypes(include=np.number)
numeric_columns= numeric.columns
numeric_columns

### Outliers and IQR

In [None]:
def Outdet(df):
    Q1=df.quantile(0.25)
    Q3=df.quantile(0.75)
    IQR=Q3-Q1
    LR=Q1-(IQR*1.5)
    UR=Q3+(IQR*1.5)
    return LR,UR

LR,UR=Outdet(df)
print("The Lower Quartile outliers are :\n",LR)
print("The Upper Quartile outliers are :\n ",UR)

In [None]:
# IQR
Q1 = np.percentile(df['age'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['age'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['age'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['age'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['balance'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['balance'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['balance'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['balance'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['day'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['day'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['day'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['day'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['duration'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['duration'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['duration'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['duration'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['campaign'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['campaign'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['campaign'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['campaign'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['pdays'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['pdays'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['pdays'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['pdays'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
# IQR
Q1 = np.percentile(df['previous'], 25,
                   interpolation = 'midpoint')
 
Q3 = np.percentile(df['previous'], 75,
                   interpolation = 'midpoint')
IQR = Q3 - Q1
 
print("Old Shape: ", df.shape)
 
# Upper bound
upper = np.where(df['previous'] >= (Q3+1.5*IQR))
# Lower bound
lower = np.where(df['previous'] <= (Q1-1.5*IQR))
 
''' Detecting the Outliers '''
print("Outliers are:",upper,lower)

In [None]:
num_features=['age', 'balance', 'day', 'duration',
       'campaign', 'pdays', 'previous']

In [None]:
# Boxplot for each numerical feature

fig, axes = plt.subplots(7, 1, figsize=(8, 25))
for i, c in enumerate(num_features):
    f = df[[c]].boxplot(ax=axes[i], vert=False) 

In [None]:
# boxplot to show target distribution with respect numerical features

plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for feature in num_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.boxplot(x="y", y= df[feature], data=df)
    plt.xlabel(feature)
    plotnumber+=1
plt.show()

In [None]:
# boxplot on numerical features to find outliers

plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for num_features in num_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.boxplot(df[num_features])
    plt.xlabel(num_features)
    plotnumber+=1
plt.show()

In [None]:
num_features=['age', 'balance', 'day', 'duration',
       'campaign', 'pdays', 'previous']

In [None]:
# Kernel Density Estimation plot for each numerical feature

fig, axes = plt.subplots(7, 1, figsize=(8, 25))
for i, c in enumerate(num_features):
    f = df[[c]].plot(kind='kde',ax=axes[i])

In [None]:
# Distribution of Continous Numerical Features
# plot a univariate distribution of continues observations

plt.figure(figsize=(20,60), facecolor='white')
plotnumber =1
for num_features in num_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.distplot(df[num_features])
    plt.xlabel(num_features)
    plotnumber+=1
plt.show()

In [None]:
cat_features=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y']

In [None]:
# check count based on categorical features

plt.figure(figsize=(15,80), facecolor='white')
plotnumber =1
for cat_features in cat_features:
    ax = plt.subplot(12,3,plotnumber)
    sns.countplot(y=cat_features,data=df)
    plt.xlabel(cat_features)
    plt.title(cat_features)
    plotnumber+=1
plt.show()

In [None]:
cat_features=['job', 'marital', 'education', 'default', 'housing', 'loan', 'contact',
       'month', 'poutcome', 'y']

In [None]:
# check target label split over categorical features
# Finding out the relationship between categorical variable and dependent variable

for cat_features in cat_features:
    sns.catplot(x='y', col=cat_features, kind='count', data= df)
plt.show()

In [None]:
# Categorical feature(s)
#name : marital
#labels : 0 , 1 , 2 ('married', 'single', 'divorced')

# Pie and count plot for Categorical feature

fig, axes = plt.subplots(1, 2, figsize=(20, 6))
data = df['marital'].value_counts()
barplot = data.plot(kind='pie', ax=axes[0], title='MARITAL', autopct="%.2f", fontsize=14, ylabel='')
countplot = sns.countplot(x='marital', data=df, ax=axes[1])

In [None]:
import scipy.stats as stats

plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(df["age"], dist="norm", plot=plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(df["day"], dist="norm", plot=plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(df["balance"], dist="norm", plot=plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10,6)
stats.probplot(df['previous'], dist ="norm", plot = plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10,6)
stats.probplot(df['pdays'], dist ="norm", plot = plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10,6)
stats.probplot(df['campaign'], dist ="norm", plot = plt)
plt.show()

In [None]:
plt.rcParams["figure.figsize"] = (10, 6)
stats.probplot(df["duration"], dist="norm", plot=plt)
plt.show()

In [None]:
# PAIRPLOT for scatterness and correlation

sns.pairplot(df,hue='y',corner=True)

In [None]:
# Check target label split over categorical features and find the count

df.y.value_counts().to_frame(name='Count')

In [None]:
# Percentage of the count of target label

round(df['y'].value_counts()/len(df)*100,2)

In [None]:
# Count of the Target label

sns.countplot(x='y',data=df)
plt.show()

In [None]:
print("Total number of poeple who opened term deposit: {0}".format(len(df.y[df.y=='yes'])))
print("Total number of poeple who haven't opened term deposit: {0}".format(len(df.y[df.y=='no'])))

- __From the Counts of target label , we say that the dataset is _highly imbalanced_ as it contains only 11.7% of term deposits__ 

__have made by the customers.__

## Data Pre-processing

- Remove null values and duplicates
- Data Cleaning
- Normalization
- Encoding
- Feature Selection
- Correlation Analysis ( Heat Map )

In [None]:
df.info()

In [None]:
# find number of rows and column

df.shape

In [None]:
# find unique values in each object datatype 

for col in df.select_dtypes(include='object').columns:
    print(col)
    print(df[col].unique())

In [None]:
# displays unique values in each columns

for column in df.columns:
    print(column,df[column].nunique())

> __No feature with only one value.__

In [None]:
# find missing values

df.isnull().sum()

In [None]:
# % of null values

df.isna().sum()/len(df)*100

In [None]:
# dropping null values

df.dropna()

> __We do not have any missing values.__

In [None]:
# check for duplicate values 

df1=df.duplicated()
df[~df1]

In [None]:
df.duplicated().sum()

> __There are no dupicates in the dataset.__

In [None]:
# Correlation between numerical features

df.corr()

In [None]:
corr_matt=df.corr()
corr_matt

In [None]:
# Checking for correlation (Pearson's)

fig = plt.figure(figsize=(18,10))
sns.heatmap(corr_matt,annot=True)

> __There is no _high_ _correlation_ , hence we don't want to remove any attributes from the data .__

### Encoding

__Encoding is a technique of converting categorical variables into numerical values so that it could be easily fitted to a machine learning model.__



> __Ordinal Encoding__ : ```LABEL ENCODING```

> __Nominal Encoding__ : ```ONE HOT ENCODING```



__where , ```  Nominal  ``` : the data can only be categorized. ```  Ordinal  ``` : the data can be categorized and ranked.__

 ### __Label Encoding for Categorical attributes__

In [None]:
label_encoding = {
    "y":{"no":0,"yes":1},
    "poutcome":{"unknown":0,"failure":1,"other":2,"success":3},
    "month":{"jan":0,"feb":1,"mar":2,"apr":3,"may":4,"jun":5,"jul":6,"aug":7,"sep":8,"oct":9,"nov":10,"dec":11},
    "contact":{"unknown":0,"cellular":1,"telephone":2},
    "loan":{"no":0,"yes":1},
    "housing":{"no":0,"yes":1},
    "default":{"no":0,"yes":1},
    "education":{"tertiary":0,"secondary":1,"unknown":2,"primary":3},
    "marital":{"married":0,"single":1,"divorced":2},   
    "job":{"management":0,"technician":1,"entrepreneur":2,"blue-collar":3,"unknown":4,"retired":5,"admin.":6,"services":7,"self-employed":8,"unemployed":9,"housemaid":10,"student":11}
}

In [None]:
df1 = df.replace(label_encoding)

In [None]:
df1.head(10)

In [None]:
# Pearson's Correlation of features w.r.t target label

corr_matt1=df1.corr()['y']
corr_matt1

### __MinMax Scaling for Numerical attributes__

__Min-max normalization is one of the most common ways to normalize data. For every feature, the minimum value of that feature gets transformed into a 0, the maximum value gets transformed into a 1, and every other value gets transformed into a decimal between 0 and 1.__

In [None]:
num_features=['age', 'balance', 'day', 'duration',
       'campaign', 'pdays', 'previous']

In [None]:
scaler = MinMaxScaler(feature_range=(0, 1))

df1[num_features] = scaler.fit_transform(df1[num_features]) 

In [None]:
df2 = df1[num_features]
df2

In [None]:
#SPLITTING THE DATA

# get all the features
features = [feat for feat in df1.columns if feat !='y']

x = df2[num_features] # feature set
y = df1['y'] # target

# Splitting data into train and test
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42, stratify=y)

# train and test datasets dimensions
x_train.shape, x_test.shape

In [None]:
x_train[num_features]

In [None]:
x_test[num_features]

## Oversampling

__Over sampling is used when the amount of data collected is insufficient.__

 __When one class of data is the underrepresented minority class in the data sample, over sampling techniques maybe used to duplicate these results for a more balanced amount of positive results in training.__


In [None]:
x_train, x_test, y_train, y_test = train_test_split(x,y,train_size= 0.7,random_state=42)
smote=SMOTE()
x_train_os, y_train_os = smote.fit_resample(x_train, y_train)
print("The number of Classes before fit {}".format(Counter(y_train)))
print("The number of Classes after fit {}".format(Counter(y_train_os)))

In [None]:
x_train

In [None]:
y_train

In [None]:
y_train_os.value_counts()

In [None]:
print('not_deposited :'  , y_train_os.value_counts()[0]/len(y_train_os)*100,'%')
print('deposited: ' , y_train_os.value_counts()[1]/len(y_train_os)*100,'%')

> __Undersampling is not used here because it leads to loss of data.__

## Cross-Validation

__Cross-validation is a statistical method used to estimate the performance of machine learning models.__

> __It's a process to avoid overfitting and underfitting of data.__

### Overfitting

__Overfitting happens when a model learns the detail and noise in the training data to the extent that it negatively__ 

__impacts the performance of the model on new data.__

### Underfitting

__Underfitting destroys the accuracy of our machine learning model. Its occurrence simply means that our model or the algorithm__

__does not fit the data well enough. It usually happens when we have fewer data to build an accurate model and also when__ 

__we try to build a linear model with fewer non-linear data.__

In [None]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import cross_val_score
skfold = StratifiedKFold(n_splits=5)
model = DecisionTreeClassifier(random_state=40)
scores=cross_val_score(model,x_train_os,y_train_os,cv=skfold)
print(np.mean(scores))

In [None]:
scores

## Model Building

### __Hyper parameter tuning__

In [None]:

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score
from sklearn import svm,datasets
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import RandomForestClassifier

model_params = {
    'GaussianNB': {
        'model': GaussianNB(priors= None, var_smoothing = 1e-09), 
        'params': {
            
            } 
    },
    'random_forest' : {
        'model': RandomForestClassifier(),
        'params' : {
            'n_estimators': [1,5,10]
        }
    },
    'logistic_regression' : {
        'model': LogisticRegression(solver='liblinear',multi_class='auto'),
        'params': {
            'C': [1,5,10]
        }
    }
}

scores = []

for model_name, mp in model_params.items():
    clf =  GridSearchCV(mp['model'], mp['params'], cv=5, return_train_score=False)
    clf.fit(x_train_os, y_train_os)
    scores.append({
        'model': model_name,
        'best_score': clf.best_score_,
        'best_params': clf.best_params_
    })
    
dff = pd.DataFrame(scores,columns=['model','best_score','best_params'])
dff


### DescisionTree Classifier

```A decision tree is a type of supervised machine learning used to categorize or make predictions based on how a previous set of questions were answered. The model is a form of supervised learning, meaning that the model is trained and tested on a set of data that contains the desired categorization.```

In [None]:
from sklearn.tree import DecisionTreeClassifier
deseciontree_model=DecisionTreeClassifier(max_depth = 10, random_state = 40)
deseciontree_model.fit(x_train_os, y_train_os)
y_predicted_deseciontree = deseciontree_model.predict(x_test)
y_predicted_deseciontree

In [None]:
deseciontree_model.score(x_test,y_test)
print("Accuracy:",metrics.accuracy_score(y_test,y_predicted_deseciontree))
print("Precision:",metrics.precision_score(y_test,y_predicted_deseciontree))
print("Recall:",metrics.recall_score(y_test, y_predicted_deseciontree))

In [None]:
cm=confusion_matrix(y_test,y_predicted_deseciontree)
cm

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_3 = plot_confusion_matrix(deseciontree_model, x_test, y_test, display_labels =["NO","YES"], cmap = plt.cm.Reds, values_format = '.2f')
plot_3.figure_.suptitle("Confusion Matrix")
plt.show()

### Logistic Regression

```Logistic Regression is a Machine Learning classification algorithm that is used to predict the probability of a categorical dependent variable.```

In [None]:
from sklearn.linear_model import LogisticRegression
modelreg=LogisticRegression(C=10, random_state = 40)
modelreg.fit(x_train_os,y_train_os)
ypred=modelreg.predict(x_test)
ypred

In [None]:
modelreg.score(x_test,y_test)
x_test.shape
y_test.shape

In [None]:
print("Accuracy:",metrics.accuracy_score(y_test, ypred))
print("Precision:",metrics.precision_score(y_test, ypred))
print("Recall:",metrics.recall_score(y_test, ypred))

In [None]:
cm=confusion_matrix(y_test,ypred)
cm

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_3 = plot_confusion_matrix(modelreg, x_test, y_test, display_labels =["NO","YES"], cmap = plt.cm.Reds, values_format = '.2f')
plot_3.figure_.suptitle("Confusion Matrix")
plt.show()

### KNeighborsClassifier

```Using KNeighborsClassifier and then the argument inside determines how many nearest neighbors you want your datapoint to look at. There is no rule of thumb for how many neighbors you should look at.```

In [None]:
from sklearn.neighbors import KNeighborsClassifier
KNN_model = KNeighborsClassifier(n_neighbors = 5, metric = 'minkowski', p = 2)
KNN_model.fit(x_train_os, y_train_os)
y_predicted_KNN = KNN_model.predict(x_test)
y_predicted_KNN 

In [None]:
KNN_model.score(x_test,y_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_predicted_KNN))
print("Precision:",metrics.precision_score(y_test,y_predicted_KNN))
print("Recall:",metrics.recall_score(y_test, y_predicted_KNN))

In [None]:
cm=confusion_matrix(y_test,y_predicted_KNN)
cm

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_3 = plot_confusion_matrix(KNN_model, x_test, y_test, display_labels =["NO","YES"], cmap = plt.cm.Reds, values_format = '.2f')
plot_3.figure_.suptitle("Confusion Matrix")
plt.show()

### Naive Bayes

```Naïve Bayes Classifier is one of the simple and most effective Classification algorithms which helps in building the fast 
machine learning models that can make quick predictions.```

In [None]:
from sklearn.naive_bayes import GaussianNB
naive_bayes_model= GaussianNB()
naive_bayes_model.fit(x_train_os, y_train_os)
y_predicted_naive = naive_bayes_model.predict(x_test)
y_predicted_naive

In [None]:
naive_bayes_model.score(x_test,y_test)
print("Accuracy:",metrics.accuracy_score(y_test, y_predicted_naive))
print("Precision:",metrics.precision_score(y_test,y_predicted_naive))
print("Recall:",metrics.recall_score(y_test, y_predicted_naive))

In [None]:
cm=confusion_matrix(y_test,y_predicted_naive)
cm

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_3 = plot_confusion_matrix(naive_bayes_model, x_test, y_test, display_labels =["NO","YES"], cmap = plt.cm.Reds, values_format = '.2f')
plot_3.figure_.suptitle("Confusion Matrix")
plt.show()

### RandomForest

```A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.```

In [None]:
from sklearn.ensemble import RandomForestClassifier
randomforest_model= RandomForestClassifier(n_estimators = 10)
randomforest_model.fit(x_train_os, y_train_os)
y_predicted_randomforest = randomforest_model.predict(x_test)
y_predicted_randomforest

In [None]:
randomforest_model.score(x_test,y_test)
print("Accuracy:",metrics.accuracy_score(y_test,y_predicted_randomforest))
print("Precision:",metrics.precision_score(y_test,y_predicted_randomforest))
print("Recall:",metrics.recall_score(y_test, y_predicted_randomforest))

In [None]:
cm=confusion_matrix(y_test,y_predicted_randomforest)
cm

In [None]:
from sklearn.metrics import plot_confusion_matrix
plot_3 = plot_confusion_matrix(randomforest_model, x_test, y_test, display_labels =["NO","YES"], cmap = plt.cm.Reds, values_format = '.2f')
plot_3.figure_.suptitle("Confusion Matrix")
plt.show()

In [None]:
# ROC plot
from sklearn.metrics import plot_roc_curve
classifiers = [deseciontree_model, modelreg, KNN_model, naive_bayes_model,randomforest_model]
ax = plt.gca()
for i in classifiers:
    plot_roc_curve(i, x_test , y_test, ax=ax)

In [None]:
roc_auc_score(y_test,y_predicted_randomforest)
fpr,tpr,threshold =roc_curve(y_test,y_predicted_randomforest)

In [None]:
from sklearn.metrics import classification_report,roc_auc_score,roc_curve,auc
auc = auc(fpr,tpr)
plt.figure(figsize=(5,5),dpi=100)
plt.plot(fpr,tpr,linestyle='-',label = "(auc = %0.3f)" % auc)
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.legend()
plt.show()

## Final Analysis

In [None]:
dfnew = pd.DataFrame()
dfnew['Model'] = ['DescisionTree','LogisticRegression','KNeighborsClassifier','GaussianNB','RandomForestClassifier']
dfnew['Accuracy'] = [0.840017693895606,0.784355647301681,0.7976260690061928,0.7615010321439104,0.86670598643468]
dfnew['Precision']=[0.38757861635220126,0.3105909220667999,0.3275187969924812,0.2853396275898243,0.43823529411764706]
dfnew['Recall']=[0.6170212765957447,0.6808510638297872,0.681476846057572,0.6808510638297872,0.4662077596996245]
dfnew

In [None]:
cm = sns.light_palette('seagreen',as_cmap=True)
s = dfnew.style.background_gradient(cmap=cm)
s

In [None]:
plt.figure(figsize=(20,5))
sns.set(style="whitegrid")
ax = sns.barplot(y ='Accuracy',x = 'Model',data = dfnew)

> __Here , we can clearly see that RandomForestClassifier has a better accuracy when compared to other algorithms.__

## PCA and Clustering

- __Principal Component Analysis is an unsupervised learning algorithm that is used for the dimensionality reduction in machine learning.__

- __Clustering is an unsupervised machine learning method of identifying and grouping similar data points in larger datasets without concern for the specific outcome.__

In [None]:
from sklearn.cluster import KMeans
km = KMeans(n_clusters=2)
y_predicted = km.fit_predict(x_test,y_test)
y_predicted

In [None]:
x_test['cluster']=y_predicted
x_test

In [None]:
from sklearn.datasets import load_digits
from sklearn.decomposition import PCA
from sklearn.cluster import KMeans
import numpy as np
 
#Load Data
data = df2
pca = PCA(2)
 
#Transform the data
df3 = pca.fit_transform(data)
 
df3.shape

~~~
The K-means clustering algorithm computes centroids and repeats until the optimal centroid is found.
~~~

In [None]:
error = []
for i in range(1,11):
    kmeans = KMeans(n_clusters=i).fit(df3)
    kmeans.fit(df3)
    error.append(kmeans.inertia_)
import matplotlib.pyplot as plt
plt.plot(range(1,11),error)
plt.title("Elbow Method")
plt.xlabel("No. of clusters")
plt.ylabel("error")
plt.show()

~~~
WCSS is the sum of squared distance between each point and the centroid in a cluster. When we plot the WCSS with the K value, the plot looks like an Elbow. As the number of clusters increases, the WCSS value will start to decrease. 
WCSS value is largest when K = 1.
~~~

> __Here , _K = 2_ as the number of clusters increases from 2 and the WCSS starts to decrease.__

In [None]:
from sklearn.metrics import silhouette_score

km = KMeans(n_clusters=2, random_state=42)
#
# Fit the KMeans model
#
km.fit_predict(df3)
#
# Calculate Silhoutte Score
#
score = silhouette_score(df3, km.labels_, metric='euclidean')
#
# Print the score
#
print('Silhouetter Score: %.3f' % score)

~~~
The value of the silhouette coefﬁcient is between [-1, 1]. A score of 1 denotes the best meaning that the data point is very compact within the cluster to which it belongs and far away from the other clusters. The worst value is -1.
Values near 0 denote overlapping clusters.
~~~

In [None]:
range_n_clusters = [2, 3, 4, 5, 6, 7, 8]
silhouette_avg = []
for num_clusters in range_n_clusters:
    kmeans = KMeans(n_clusters=num_clusters)
    kmeans.fit(df3)
    cluster_labels = kmeans.labels_
    silhouette_avg.append(silhouette_score(df3, cluster_labels))
plt.plot(range_n_clusters,silhouette_avg,'bx-')
plt.xlabel('Values of K') 
plt.ylabel('Silhouette score') 
plt.title('Silhouette analysis For Optimal k')
plt.show()

> __Here , the Silhouette score is maximum at 2 hence we take K = 2.__

In [None]:
#Import required module
from sklearn.cluster import KMeans
 
#Initialize the class object
kmeans = KMeans(n_clusters= 2)
 
#predict the labels of clusters.
label = kmeans.fit_predict(df3)
 
print(label)

In [None]:
#Getting unique labels
 
u_labels = np.unique(label)
 
#plotting the results:
 
for i in u_labels:
    plt.scatter(df3[label == i , 0] , df3[label == i , 1] , label = i)
plt.legend()
plt.show()

In [None]:
#Getting the Centroids
centroids = kmeans.cluster_centers_
u_labels = np.unique(label)
 
#plotting the results:
 
for i in u_labels:
    plt.scatter(df3[label == i , 0] , df3[label == i , 1] , label = i)
plt.scatter(centroids[:,0] , centroids[:,1] , s = 100, color = 'k')
plt.legend()
plt.show()


~~~
The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K.
~~~

## Conclusion 

__The dataset contained 16 features and 1 target variable for binary classification which determines if client will subscribe deposit or not.With the given bank data, we implemented Exploratory Data Analysis, Visualized the data, Machine Learning models and evaluated the model. After pre-processsing the data, then applied various classification algorithms on the data which made it clear that Random Forest Classifier Model performed excellent with high accuracy (87%) compared to other algorithms.__