# Client Subscription Prediction


<img src="Subscription.jpg" height='400px' width='100%'><br/>






## Table of Contents

1. [Problem Statement](#section1)<br>
2. [Importing Packages](#section2)<br>
3. [Loading Data](#section3)<br>
  - 3.1 [Description of the Datasets](#section301)<br>
  - 3.2 [Pandas Profiling before Data Preprocessing](#section302)<br>
4. [Data Preprocessing](#section4)<br>
  - 4.1 [Filling missing values](#section401)<br>
  - 4.2 [Remove highly correlated columns](#section402)<br>
  - 4.3 [Handling Outliers](#section403)<br>
  - 4.4 [Pandas Profiling after Data Preprocessing](#section404)<br>
  - 4.5 [Exploratory Data Analysis](#section405)<br>
5. [Data preparation for model building](#section5)<br>
  - 5.1 [Dummification / One-Hot Encoding](#section501)<br>
  - 5.2 [Standardizing continuos independent variables](#section502)<br>
  - 5.3 [Segregating Train and Test data](#section503)<br>
6. [Ensemble Modelling and Prediction](#section6)<br>  
  - 6.1 [Linear Regression](#section601)
      - 6.1.1 [Using Default Model](#section60102)
          - 6.1.1.1 [Building Model and Predictionl](#section6010201)
          - 6.1.1.2 [Model Evaluation](#section6010202)
      - 6.1.2 [Using RFE](#section60103)
          - 6.1.2.1 [Building Model and Predictionl](#section6010301)
          - 6.1.2.2 [Model Evaluation](#section6010301)
      - 6.1.3 [Using RandomSearchCV](#section60104)
          - 6.1.3.1 [Building Model and Predictionl](#section6010401)
          - 6.1.3.2 [Model Evaluation](#section6010401)
  - 6.2 [Decision Tree](#section602)
      - 6.2.1 [Using Default Model](#section60201)
          - 6.2.1.1 [Building Model and Predictionl](#section6020101)
          - 6.2.1.2 [Model Evaluation](#section6020102)
      - 6.2.2 [Using GridSearchCV](#section60202)
          - 6.2.2.1 [Building Model and Predictionl](#section6020201)
          - 6.2.2.2 [Model Evaluation](#section6020202)
      - 6.2.3 [Using RandomSearchCV](#section60203)
          - 6.2.3.1 [Building Model and Predictionl](#section6020301)
          - 6.2.3.2 [Model Evaluation](#section6020302)
  - 6.3 [Random Forest](#section603)
      - 6.3.1 [Using Default Model](#section60301)
          - 6.3.1.1 [Building Model and Predictionl](#section6030101)
          - 6.1.2.2 [Model Evaluation](#section6030102)
      - 6.3.2 [Using GridSearchCV](#section60302)
          - 6.3.2.1 [Building Model and Predictionl](#section6030201)
          - 6.3.2.2 [Model Evaluation](#section6030202)
      - 6.3.3 [Using RandomSearchCV](#section60303)
          - 6.3.3.1 [Building Model and Predictionl](#section6030301)
          - 6.3.3.2 [Model Evaluation](#section6030302)
7.  [Conclusion](#section7)<br>
    - 7.1 [Choosing Best Model for Prediction](#section701)
    - 7.2 [Final Prediction](#section702)

<a id=section1></a>
# 1. Problem Statement

The data is related with direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe a term deposit (variable y).

Link: https://archive.ics.uci.edu/ml/datasets/Bank+Marketing  




<a id=section2></a>
# 2. Importing Packages

In [0]:
import numpy as np                     

import pandas as pd

# To suppress pandas warnings.
pd.set_option('mode.chained_assignment', None) 

# To display all the data in each column
pd.set_option('display.max_colwidth', -1)         

pd.get_option("display.max_rows",10000)

# To display every column of the dataset in head()
pd.options.display.max_columns = 100               

import warnings
warnings.filterwarnings('ignore')     

In [0]:
import matplotlib.pyplot as plt
%matplotlib inline

# To apply seaborn styles to the plots.
import seaborn as sns
sns.set(style='whitegrid', font_scale=1.3, color_codes=True)      

<a id=section3></a>

# 3. Loading Data

In [0]:
# loading data from csv file to a data frame
df_train = pd.read_csv("bank-train.csv", delimiter=';')
df_test = pd.read_csv("bank-test.csv",delimiter=';')

# Adding new column 'isTestData' so that we can easily separate train and test 
# data during prediction process
df_train['is_test_data'] = 0

df_test['is_test_data'] = 1

# concat train and test data for data pre processing
df_bank = pd.concat([df_train,df_test])

del df_train
del df_test

df_bank.head()

Unnamed: 0,age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y,is_test_data
0,56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,0
1,57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,0
2,37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,0
3,40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,0
4,56,services,married,high.school,no,no,yes,telephone,may,mon,307,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no,0


<a id=section301></a>
## 3.1 Description of the Datasets

#### a. Check shape

In [0]:
#shape of data
df_bank.shape

(45307, 22)

#### b. info

**Input variables:**

**Bank client data:**

1 - age (numeric)

2 - job : type of job (categorical: 'admin.','blue-collar','entrepreneur','housemaid','management','retired','self-employed','services','student','technician','unemployed','unknown')

3 - marital : marital status (categorical: 'divorced','married','single','unknown'; note: 'divorced' means divorced or widowed)

4 - education (categorical: 'basic.4y','basic.6y','basic.9y','high.school','illiterate','professional.course','university.degree','unknown')

5 - default: has credit in default? (categorical: 'no','yes','unknown')

6 - housing: has housing loan? (categorical: 'no','yes','unknown')

7 - loan: has personal loan? (categorical: 'no','yes','unknown')

**Related with the last contact of the current campaign:**

8 - contact: contact communication type (categorical: 'cellular','telephone')

9 - month: last contact month of year (categorical: 'jan', 'feb', 'mar', ..., 'nov', 'dec')

10 - day_of_week: last contact day of the week (categorical: 'mon','tue','wed','thu','fri')

11 - duration: last contact duration, in seconds (numeric). Important note: this attribute highly affects the output target (e.g., if duration=0 then y='no'). Yet, the duration is not known before a call is performed. Also, after the end of the call y is obviously known. Thus, this input should only be included for benchmark purposes and should be discarded if the intention is to have a realistic predictive model.

**Other attributes:**

12 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

13 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric; 999 means client was not previously contacted)

14 - previous: number of contacts performed before this campaign and for this client (numeric)

15 - poutcome: outcome of the previous marketing campaign (categorical: 'failure','nonexistent','success')

**Social and economic context attributes:**

16 - emp.var.rate: employment variation rate - quarterly indicator (numeric)

17 - cons.price.idx: consumer price index - monthly indicator (numeric)

18 - cons.conf.idx: consumer confidence index - monthly indicator (numeric)

19 - euribor3m: euribor 3 month rate - daily indicator (numeric)

20 - nr.employed: number of employees - quarterly indicator (numeric)

**Output variable (desired target):**

21 - y - has the client subscribed a term deposit? (binary: 'yes','no')





In [0]:
df_bank.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 45307 entries, 0 to 4118
Data columns (total 22 columns):
age               45307 non-null int64
job               45307 non-null object
marital           45307 non-null object
education         45307 non-null object
default           45307 non-null object
housing           45307 non-null object
loan              45307 non-null object
contact           45307 non-null object
month             45307 non-null object
day_of_week       45307 non-null object
duration          45307 non-null int64
campaign          45307 non-null int64
pdays             45307 non-null int64
previous          45307 non-null int64
poutcome          45307 non-null object
emp.var.rate      45307 non-null float64
cons.price.idx    45307 non-null float64
cons.conf.idx     45307 non-null float64
euribor3m         45307 non-null float64
nr.employed       45307 non-null float64
y                 45307 non-null object
is_test_data      45307 non-null int64
dtypes: float

**Observations :**  

1. We have no missing values.

#### c. describe

In [0]:
df_bank.describe()

Unnamed: 0,age,duration,campaign,pdays,previous,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,is_test_data
count,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0,45307.0
mean,40.032203,258.148917,2.564835,962.288785,0.174543,0.082166,93.576032,-40.502282,3.621297,5166.985525,0.090913
std,10.411407,258.8642,2.752261,187.370863,0.499364,1.570231,0.578881,4.625101,1.73435,72.380791,0.287489
min,17.0,0.0,1.0,0.0,0.0,-3.4,92.201,-50.8,0.634,4963.6,0.0
25%,32.0,102.0,1.0,999.0,0.0,-1.8,93.075,-42.7,1.344,5099.1,0.0
50%,38.0,180.0,2.0,999.0,0.0,1.1,93.749,-41.8,4.857,5191.0,0.0
75%,47.0,319.0,3.0,999.0,0.0,1.4,93.994,-36.4,4.961,5228.1,0.0
max,98.0,4918.0,56.0,999.0,7.0,1.4,94.767,-26.9,5.045,5228.1,1.0


**Observations :** Looks like there are some outliers. Let's confirm same from pandas profiling in next step.

 <a id=section302></a>
## 3.2 Pandas Profiling before Data Preprocessing

In [0]:
# To install pandas profiling please run this command.

!pip install folium==0.2.1
!pip install pandas-profiling --upgrade

In [0]:
from pandas_profiling import ProfileReport

# Running pandas profiling to get better understanding of data
profile =  ProfileReport(df_bank, title='Pandas Profiling Report before data preprocessing', html={'style':{'full_width':True}})
profile.to_file(output_file="report_after_processing.html")

HBox(children=(FloatProgress(value=0.0, description='variables', max=23.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=121.0, style=ProgressStyl…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…




HBox(children=(FloatProgress(value=0.0, description='missing', max=2.0, style=ProgressStyle(description_width=…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




 <a id=section4></a>
# 4. Data Preprocessing

 <a id=section401></a>
 
## 4.1 Remove highly correlated columns

In [0]:
# extracting feature columns
feature_cols = list(df_bank.columns)
feature_cols.remove('y')
feature_cols.remove('is_test_data')
feature_cols

['age',
 'job',
 'marital',
 'education',
 'default',
 'housing',
 'loan',
 'contact',
 'month',
 'day_of_week',
 'duration',
 'campaign',
 'pdays',
 'previous',
 'poutcome',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'euribor3m',
 'nr.employed']

In [0]:
# extracting highly correlated columns(except target variable) to drop

# Create correlation matrix
corr_matrix = df_bank[feature_cols].corr().abs()

# Select upper triangle of correlation matrix
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(np.bool))

# Find index of feature columns with correlation greater than 0.80
cols_to_drop = [column for column in upper.columns if any(upper[column] > 0.80)]
cols_to_drop

['euribor3m', 'nr.employed']

In [0]:
# lets drop highly correlated columns
df_bank.drop(cols_to_drop, axis=1, inplace=True)

 <a id=section402></a>
## 4.2 Handling Outliers

Concept of outliers is only applicable to continuous variables.

NOTE:

1. Remove ouliers if percentage is less than 2%

2. Fill remaining outliers values with median(continuous) or mode(categorical) depending on data.



In [0]:
# storing columns with continuos datatype in 'continuos_cols' 
categorical_columns = []
continuous_columns = []
cols = list(df_bank.columns)
cols.remove('y')
continuous_columns = list(df_bank.select_dtypes(include='number').columns)
categorical_columns = list(df_bank.columns.difference(continuous_columns))

print(continuous_columns)
print(categorical_columns)

['age', 'duration', 'campaign', 'pdays', 'previous', 'emp.var.rate', 'cons.price.idx', 'cons.conf.idx', 'is_test_data']
['contact', 'day_of_week', 'default', 'education', 'housing', 'job', 'loan', 'marital', 'month', 'poutcome', 'y']


In [0]:
for i in range(0, len(continuous_columns)):
  df_temp = df_bank[continuous_columns[i]]
  sorted(df_temp)
  q1, q3= np.percentile(df_temp,[10,90])
  iqr = q3 - q1
  lower_bound = q1 -(1.5 * iqr) 
  upper_bound = q3 +(1.5 * iqr) 
  true_index = df_temp.loc[(df_temp < lower_bound) & \
            (df_temp > upper_bound)].any()

  print(true_index)

False
False
False
False
False
False
False
False
False


**Observations:** We don't have any outliers in the data.

 <a id=section403></a>
## 4.3 Miscellaneous

#### Drop 'previous' column 

In [0]:
# dropping 'previous' column because it has 86% zeroes
df_bank.drop('previous', axis=1, inplace=True)

#### Replace yes with 1 and no with 0 for target variable

In [0]:
# replace yes with 1 and no with 0 for target variable
df_bank.replace({'y': {"yes": 1,'no':0}},inplace=True)

 <a id=section404></a>
## 4.4 Pandas Profiling after Data Preprocessing

In [0]:
from pandas_profiling import ProfileReport

# Running pandas profiling to get better understanding of data
profile =  ProfileReport(df_bank, title='Pandas Profiling Report after data preprocessing', html={'style':{'full_width':True}})
profile.to_file(output_file="report_after_processing.html")

HBox(children=(FloatProgress(value=0.0, description='variables', max=20.0, style=ProgressStyle(description_wid…




HBox(children=(FloatProgress(value=0.0, description='correlations', max=6.0, style=ProgressStyle(description_w…




HBox(children=(FloatProgress(value=0.0, description='interactions [continuous]', max=64.0, style=ProgressStyle…




HBox(children=(FloatProgress(value=0.0, description='table', max=1.0, style=ProgressStyle(description_width='i…




HBox(children=(FloatProgress(value=0.0, description='missing', max=2.0, style=ProgressStyle(description_width=…









HBox(children=(FloatProgress(value=0.0, description='package', max=1.0, style=ProgressStyle(description_width=…




HBox(children=(FloatProgress(value=0.0, description='build report structure', max=1.0, style=ProgressStyle(des…




 <a id=section5></a>
# 5. Data preparation for model building

 <a id=section501></a>
## 5.1 Dummification / One-Hot Encoding of categorical variables

In [0]:
# lets look at how many unique labels each category has
for i in range(0, len(categorical_columns)):
  print(categorical_columns[i], " - ", df_bank[categorical_columns[i]].nunique())

contact  -  2
day_of_week  -  5
default  -  3
education  -  8
housing  -  3
job  -  12
loan  -  3
marital  -  4
month  -  10
poutcome  -  3
y  -  2


In [0]:
df_bank_cat = pd.get_dummies(data=df_bank[categorical_columns],drop_first=True)
df_bank_cat.columns

Index(['y', 'contact_telephone', 'day_of_week_mon', 'day_of_week_thu',
       'day_of_week_tue', 'day_of_week_wed', 'default_unknown', 'default_yes',
       'education_basic.6y', 'education_basic.9y', 'education_high.school',
       'education_illiterate', 'education_professional.course',
       'education_university.degree', 'education_unknown', 'housing_unknown',
       'housing_yes', 'job_blue-collar', 'job_entrepreneur', 'job_housemaid',
       'job_management', 'job_retired', 'job_self-employed', 'job_services',
       'job_student', 'job_technician', 'job_unemployed', 'job_unknown',
       'loan_unknown', 'loan_yes', 'marital_married', 'marital_single',
       'marital_unknown', 'month_aug', 'month_dec', 'month_jul', 'month_jun',
       'month_mar', 'month_may', 'month_nov', 'month_oct', 'month_sep',
       'poutcome_nonexistent', 'poutcome_success'],
      dtype='object')

In [0]:
df_bank_cat.shape

(45307, 44)

**Observations:**

We have 43* categorical columns after one-hot encoding

 <a id=section502></a>
 ## 5.2 Standardizing continuous variables

In [0]:
continuous_columns.remove('previous')
continuous_columns

['age',
 'duration',
 'campaign',
 'pdays',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx',
 'is_test_data']

In [0]:
from sklearn.preprocessing import StandardScaler

continuous_columns.remove('is_test_data')
# standardizing of data
scaler = StandardScaler().fit(df_bank[continuous_columns])
data = scaler.transform(df_bank[continuous_columns])

In [0]:
# forming dataframe after standardization
df_bank_num= pd.DataFrame(data)
df_bank_num.columns = continuous_columns
df_bank_num.index = df_bank.index
print(df_bank_num.shape)

(45307, 7)


#### Merging all columns together.

In [0]:
# merge categorical and continuos columns
df_bank_sd = pd.concat([df_bank_cat, df_bank_num],axis=1).reindex(df_bank.index)
df_bank_sd.shape

(45307, 51)

In [0]:
# add Is_Test_Data column
df_bank_sd = pd.concat([df_bank_sd, df_bank['is_test_data']],axis=1).reindex(df_bank.index)

df_bank_sd.shape

(45307, 52)

In [0]:
# add target column
# df_bank_sd = pd.concat([df_bank_sd, df_bank['y']],axis=1).reindex(df_bank.index)
# df_bank_sd.shape

 <a id=section604></a>
 ### 5.4 Segregating Train and Test data

In [0]:
df_bank_train = df_bank_sd[df_bank_sd['is_test_data'] == 0]
df_bank_test = df_bank_sd[df_bank_sd['is_test_data'] == 1]

In [0]:
# dropping Is_Test_Data column
 df_bank_train.drop('is_test_data', axis=1, inplace=True)
 df_bank_test.drop('is_test_data', axis=1, inplace=True)

In [0]:
print(df_bank_train.shape)
print(df_bank_test.shape)

(41188, 51)
(4119, 51)


 <a id=section505></a>
 ### 5.5 Splitting Train data further in to train and test data

In [0]:
feature_cols = list(df_bank_train.columns)
feature_cols.remove('y')
feature_cols

['contact_telephone',
 'day_of_week_mon',
 'day_of_week_thu',
 'day_of_week_tue',
 'day_of_week_wed',
 'default_unknown',
 'default_yes',
 'education_basic.6y',
 'education_basic.9y',
 'education_high.school',
 'education_illiterate',
 'education_professional.course',
 'education_university.degree',
 'education_unknown',
 'housing_unknown',
 'housing_yes',
 'job_blue-collar',
 'job_entrepreneur',
 'job_housemaid',
 'job_management',
 'job_retired',
 'job_self-employed',
 'job_services',
 'job_student',
 'job_technician',
 'job_unemployed',
 'job_unknown',
 'loan_unknown',
 'loan_yes',
 'marital_married',
 'marital_single',
 'marital_unknown',
 'month_aug',
 'month_dec',
 'month_jul',
 'month_jun',
 'month_mar',
 'month_may',
 'month_nov',
 'month_oct',
 'month_sep',
 'poutcome_nonexistent',
 'poutcome_success',
 'age',
 'duration',
 'campaign',
 'pdays',
 'emp.var.rate',
 'cons.price.idx',
 'cons.conf.idx']

In [0]:
 X = df_bank_train[feature_cols]
 y = df_bank_train['y']

In [0]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.2)

In [0]:
print('Train cases as below')
print('X_train shape: ',X_train.shape)
print('y_train shape: ',y_train.shape)
print('\nTest cases as below')
print('X_test shape: ',X_test.shape)
print('y_test shape: ',y_test.shape)

Train cases as below
X_train shape:  (32950, 50)
y_train shape:  (32950,)

Test cases as below
X_test shape:  (8238, 50)
y_test shape:  (8238,)


 <a id=section6></a>
 # 6. Ensemble Modelling and Prediction
 
 Ensemble modeling is a process where multiple diverse models are created to predict an outcome, either by using many different modeling algorithms or using different training data sets. The ensemble model then aggregates the prediction of each base model and results in once final prediction for the unseen data.


 <a id=section601></a>
 ## 6.1 Logistic Regression
 
 
Logistic regression is the appropriate regression analysis to conduct when the dependent variable is dichotomous (binary).  Like all regression analyses, the logistic regression is a predictive analysis.  Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables.

Sometimes logistic regressions are difficult to interpret; the Intellectus Statistics tool easily allows you to conduct the analysis, then in plain English interprets the output.

Logistic Regression was used in the biological sciences in early twentieth century. It was then used in many social science applications. Logistic Regression is used when the dependent variable(target) is categorical.

For example,
1. To predict whether an email is spam (1) or (0)
1. Whether the tumor is malignant (1) or not (0)


In [0]:
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import RandomizedSearchCV

def logistic_reg(gridsearch = False):
    logreg = LogisticRegression() 
    if not(gridsearch):
        parameters = {'normalize':[True,False], 'copy_X':[True, False] }
        logreg = RandomizedSearchCV(logreg,parameters, cv = 10,refit = True)                                                    
        return logreg
    else:
        parameters = {'normalize':[True,False], 'copy_X':[True, False]}
        logreg = GridSearchCV(logreg,parameters, cv = 10,refit = True)                                                    
        return logreg

 <a id=section60101></a>
### 6.1.1 Using Default Model

 <a id=section6010101></a>
#### 6.1.1.1 Building Model and Prediction

In [0]:
logreg = LogisticRegression(random_state=101)
logreg.fit(X_train,y_train)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=101, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
 # make predictions on the training set
y_pred_train_lr = logreg.predict(X_train) 

In [0]:
 # make predictions on the testing set
y_pred_test_lr = logreg.predict(X_test)  

 <a id=section6010102></a>
#### 6.1.1.2 Model Evaluation

In [0]:
from sklearn.metrics import classification_report
from sklearn.metrics import f1_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score

In [0]:
print('Report:\n',classification_report(y_train, y_pred_train_lr))
print("F1 Score:",f1_score(y_pred_train_lr,y_train))
print('confusion Matrix:\n',confusion_matrix(y_pred_train_lr,y_train))
print('cross validation:',cross_val_score(logreg, X, y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95     29238
           1       0.67      0.42      0.52      3712

    accuracy                           0.91     32950
   macro avg       0.80      0.70      0.73     32950
weighted avg       0.90      0.91      0.90     32950

F1 Score: 0.5158940397350993
confusion Matrix:
 [[28468  2154]
 [  770  1558]]
cross validation: [0.89366351 0.88674436 0.87921826 0.64428797 0.64416657]


In [0]:
print('Report:\n',classification_report(y_test, y_pred_test_lr))
print("F1 Score:",f1_score(y_pred_test_lr,y_test))
print('confusion Matrix:\n',confusion_matrix(y_pred_test_lr,y_test))
print('cross validation:',cross_val_score(logreg, X, y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95      7310
           1       0.67      0.44      0.53       928

    accuracy                           0.91      8238
   macro avg       0.80      0.70      0.74      8238
weighted avg       0.90      0.91      0.90      8238

F1 Score: 0.5294117647058824
confusion Matrix:
 [[7113  523]
 [ 197  405]]
cross validation: [0.89366351 0.88674436 0.87921826 0.64428797 0.64416657]


 <a id=section60101></a>
 ### 6.1.2 Using RFE

 <a id=section6010201></a>
#### 6.1.2.1 Building Model and Prediction

In [0]:
from sklearn.feature_selection import RFE
rfe = RFE(logreg, 15)
rfe.fit(X_train,y_train)

RFE(estimator=LogisticRegression(C=1.0, class_weight=None, dual=False,
                                 fit_intercept=True, intercept_scaling=1,
                                 l1_ratio=None, max_iter=100,
                                 multi_class='auto', n_jobs=None, penalty='l2',
                                 random_state=101, solver='lbfgs', tol=0.0001,
                                 verbose=0, warm_start=False),
    n_features_to_select=15, step=1, verbose=0)

In [0]:
print(X_train.columns[rfe.support_])
imp_cols = X_train.columns[rfe.support_]
logreg.fit(X_train[imp_cols],y_train)

Index(['contact_telephone', 'default_unknown', 'education_illiterate',
       'job_retired', 'month_aug', 'month_dec', 'month_mar', 'month_may',
       'month_oct', 'month_sep', 'poutcome_nonexistent', 'poutcome_success',
       'duration', 'emp.var.rate', 'cons.price.idx'],
      dtype='object')


LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=101, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

In [0]:
y_pred_train_lr_rfe = logreg.predict(X_train[imp_cols])

In [0]:
y_pred_test_lr_rfe = logreg.predict(X_test[imp_cols])

 <a id=section6010202></a>
#### 6.1.2.2 Model Evaluation

In [0]:
print('Report:\n',classification_report(y_train, y_pred_train_lr_rfe))
print("F1 Score:",f1_score(y_pred_train_lr_rfe,y_train)) 
print('confusion Matrix:\n',confusion_matrix(y_pred_train_lr_rfe,y_train))
print('cross validation:',cross_val_score(logreg, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95     29238
           1       0.66      0.41      0.50      3712

    accuracy                           0.91     32950
   macro avg       0.80      0.69      0.73     32950
weighted avg       0.90      0.91      0.90     32950

F1 Score: 0.5030941629034955
confusion Matrix:
 [[28475  2208]
 [  763  1504]]
cross validation: [0.89524156 0.89900461 0.88783685 0.89122253 0.70596091]


In [0]:
print('Report:\n',classification_report(y_test, y_pred_test_lr_rfe))
print("F1 Score:",f1_score(y_pred_test_lr_rfe,y_test))
print('confusion Matrix:\n',confusion_matrix(y_pred_test_lr_rfe,y_test))
print('cross validation:',cross_val_score(logreg, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.93      0.97      0.95      7310
           1       0.66      0.44      0.53       928

    accuracy                           0.91      8238
   macro avg       0.80      0.70      0.74      8238
weighted avg       0.90      0.91      0.90      8238

F1 Score: 0.5259740259740259
confusion Matrix:
 [[7103  523]
 [ 207  405]]
cross validation: [0.89524156 0.89900461 0.88783685 0.89122253 0.70596091]


In [0]:
## Feature Importance
from sklearn.feature_selection import SelectFromModel
smf = SelectFromModel(logreg)
smf.fit(X_train,y_train)
features = smf.get_support()
feature_name = X_train.columns[features]
feature_name

Index(['contact_telephone', 'default_unknown', 'education_illiterate',
       'job_retired', 'job_self-employed', 'job_student', 'month_aug',
       'month_mar', 'month_may', 'month_nov', 'poutcome_nonexistent',
       'poutcome_success', 'duration', 'emp.var.rate', 'cons.price.idx'],
      dtype='object')

 <a id=section602></a>
 ## 6.2 Decision Trees

 <a id=section60201></a>
### 6.2.1 Using Default Model

 <a id=section6020101></a>
#### 6.2.1.1 Building Model and Prediction

In [0]:
from sklearn.tree import DecisionTreeClassifier

# using default model for building 
dt_cls = DecisionTreeClassifier( class_weight='balanced')
dt_cls.fit(X_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight='balanced', criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')

In [0]:
#prediction on training data
y_pred_train_dt = dt_cls.predict(X_train)

#prediction on testing data
y_pred_test_dt = dt_cls.predict(X_test)

<a id=section6020102></a>
#### 6.2.1.2 Model Evaluation

In [0]:
print('Report:\n',classification_report(y_train, y_pred_train_dt))
print("F1 Score:",f1_score(y_pred_train_dt,y_train))
print('confusion Matrix:\n',confusion_matrix(y_pred_train_dt,y_train))
print('cross validation:',cross_val_score(logreg, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950

F1 Score: 1.0
confusion Matrix:
 [[29238     0]
 [    0  3712]]
cross validation: [0.89524156 0.89900461 0.88783685 0.89122253 0.70596091]


In [0]:
print('Report:\n',classification_report(y_test, y_pred_test_dt))
print("F1 Score:",f1_score(y_pred_test_dt,y_test))
print('confusion Matrix:\n',confusion_matrix(y_pred_test_dt,y_test))
print('cross validation:',cross_val_score(dt_cls, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.94      0.94      0.94      7310
           1       0.50      0.49      0.49       928

    accuracy                           0.89      8238
   macro avg       0.72      0.72      0.72      8238
weighted avg       0.89      0.89      0.89      8238

F1 Score: 0.4948787061994609
confusion Matrix:
 [[6842  469]
 [ 468  459]]
cross validation: [0.89305657 0.68390386 0.81585336 0.2128202  0.57253855]


 <a id=section60202></a>
### 6.2.2 Using GridSearchCV

 <a id=section6020201></a>
#### 6.2.2.1 Building Model and Prediction

In [0]:
from sklearn.model_selection import RandomizedSearchCV

In [0]:
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]


# Create the random grid
random_grid = {
               'max_features': max_features,
               'max_depth': max_depth,
               'min_samples_split': min_samples_split,
               'min_samples_leaf': min_samples_leaf
}


In [0]:
# Instantiate the random search model
dt_cls_rs = RandomizedSearchCV(estimator = dt_cls, param_distributions = random_grid, n_iter = 100, cv = 3, 
                               verbose=2, random_state=42, n_jobs = -1)

In [0]:
dt_cls_rs.fit(X_train, y_train)

Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done  48 tasks      | elapsed:    3.7s
[Parallel(n_jobs=-1)]: Done 290 tasks      | elapsed:   14.1s
[Parallel(n_jobs=-1)]: Done 297 out of 300 | elapsed:   14.4s remaining:    0.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:   14.5s finished


RandomizedSearchCV(cv=3, error_score=nan,
                   estimator=DecisionTreeClassifier(ccp_alpha=0.0,
                                                    class_weight='balanced',
                                                    criterion='gini',
                                                    max_depth=None,
                                                    max_features=None,
                                                    max_leaf_nodes=None,
                                                    min_impurity_decrease=0.0,
                                                    min_impurity_split=None,
                                                    min_samples_leaf=1,
                                                    min_samples_split=2,
                                                    min_weight_fraction_leaf=0.0,
                                                    presort='deprecated',
                                                    random_state=None,
    

In [0]:
#prediction on training data
y_pred_train_dt_rs = dt_cls_rs.predict(X_train)

#prediction on testing data
y_pred_test_dt_rs = dt_cls_rs.predict(X_test)

<a id=section6020202></a>
#### 6.2.2.2 Model Evaluation

In [0]:
print('Report:\n',classification_report(y_train, y_pred_train_dt_rs))
print("F1 Score:",f1_score(y_pred_train_dt_rs,y_train))
print('confusion Matrix:\n',confusion_matrix(y_pred_train_dt_rs,y_train))
print('cross validation:',cross_val_score(dt_cls_rs, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00     29238
           1       1.00      1.00      1.00      3712

    accuracy                           1.00     32950
   macro avg       1.00      1.00      1.00     32950
weighted avg       1.00      1.00      1.00     32950

F1 Score: 1.0
confusion Matrix:
 [[29238     0]
 [    0  3712]]
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s


cross validation: [0.88747269 0.59213401 0.77664482 0.25664684 0.52652665]


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.8s finished


In [0]:
print('Report:\n',classification_report(y_test, y_pred_test_dt_rs))
print("F1 Score:",f1_score(y_pred_test_dt_rs,y_test))
print('confusion Matrix:\n',confusion_matrix(y_pred_test_dt_rs,y_test))
print('cross validation:',cross_val_score(dt_cls_rs, X[imp_cols], y, cv=5))

Report:
               precision    recall  f1-score   support

           0       0.93      0.93      0.93      7310
           1       0.48      0.48      0.48       928

    accuracy                           0.88      8238
   macro avg       0.71      0.71      0.71      8238
weighted avg       0.88      0.88      0.88      8238

F1 Score: 0.4809447128287708
confusion Matrix:
 [[6823  480]
 [ 487  448]]
Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.
[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.7s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.1s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.1s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s
[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    7.0s finished
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 2 concurrent workers.


Fitting 3 folds for each of 100 candidates, totalling 300 fits


[Parallel(n_jobs=-1)]: Done 128 tasks      | elapsed:    3.0s


cross validation: [0.89075018 0.56457878 0.79764506 0.19703776 0.42090567]


[Parallel(n_jobs=-1)]: Done 300 out of 300 | elapsed:    6.9s finished


 <a id=section603></a>
 ## 6.3 Random Forest