# Case Intro
Term deposits are a major source of income for a bank. A term deposit is a cash investment held at a financial institution. Your money is invested for an agreed rate of interest over a fixed amount of time, or term. The bank has various outreach plans to sell term deposits to their customers such as email marketing, advertisements, telephonic marketing, and digital marketing.

Telephonic marketing campaigns still remain one of the most effective way to reach out to people. However, they require huge investment as large call centers are hired to actually execute these campaigns. Hence, it is crucial to identify the customers most likely to convert beforehand so that they can be specifically targeted via call.

The data is related to direct marketing campaigns (phone calls) of a Portuguese banking institution. The classification goal is to predict if the client will subscribe to a term deposit (variable y).

Content
The data is related to the direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed by the customer or not. The data folder contains two datasets:-

Bank.csv: 45,211 rows and 18 columns ordered by date (from May 2008 to November 2010)

Detailed Column Descriptions
bank client data:

1 - age (numeric)

2 - job : type of job (categorical: "admin.","unknown","unemployed","management","housemaid","entrepreneur","student",
"blue-collar","self-employed","retired","technician","services")

3 - marital : marital status (categorical: "married","divorced","single"; note: "divorced" means divorced or widowed)

4 - education (categorical: "unknown","secondary","primary","tertiary")

5 - default: has credit in default? (binary: "yes","no")

6 - balance: average yearly balance, in euros (numeric)

7 - housing: has housing loan? (binary: "yes","no")

8 - loan: has personal loan? (binary: "yes","no")
# related with the last contact of the current campaign:
9 - contact: contact communication type (categorical: "unknown","telephone","cellular")
10 - day: last contact day of the month (numeric)

11 - month: last contact month of year (categorical: "jan", "feb", "mar", …, "nov", "dec")

12 - duration: last contact duration, in seconds (numeric)

# other attributes:
13 - campaign: number of contacts performed during this campaign and for this client (numeric, includes last contact)

14 - pdays: number of days that passed by after the client was last contacted from a previous campaign (numeric, -1 means client was not previously contacted)

15 - previous: number of contacts performed before this campaign and for this client (numeric)

16 - poutcome: outcome of the previous marketing campaign (categorical: "unknown","other","failure","success")

Output variable (desired target):

17 - y - has the client subscribed a term deposit? (binary: "yes","no")

Missing Attribute Values: None


In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
df=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/Bank.csv',sep = ';')
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,management,married,tertiary,no,2143,yes,no,unknown,5,may,261,1,-1,0,unknown,no
1,44,technician,single,secondary,no,29,yes,no,unknown,5,may,151,1,-1,0,unknown,no
2,33,entrepreneur,married,secondary,no,2,yes,yes,unknown,5,may,76,1,-1,0,unknown,no
3,47,blue-collar,married,unknown,no,1506,yes,no,unknown,5,may,92,1,-1,0,unknown,no
4,33,unknown,single,unknown,no,1,no,no,unknown,5,may,198,1,-1,0,unknown,no
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,technician,married,tertiary,no,825,no,no,cellular,17,nov,977,3,-1,0,unknown,yes
45207,71,retired,divorced,primary,no,1729,no,no,cellular,17,nov,456,2,-1,0,unknown,yes
45208,72,retired,married,secondary,no,5715,no,no,cellular,17,nov,1127,5,184,3,success,yes
45209,57,blue-collar,married,secondary,no,668,no,no,telephone,17,nov,508,4,-1,0,unknown,no


In [2]:
print(df.shape)
df.info()
df.isnull().sum()

(45211, 17)
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45211 entries, 0 to 45210
Data columns (total 17 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   age        45211 non-null  int64 
 1   job        45211 non-null  object
 2   marital    45211 non-null  object
 3   education  45211 non-null  object
 4   default    45211 non-null  object
 5   balance    45211 non-null  int64 
 6   housing    45211 non-null  object
 7   loan       45211 non-null  object
 8   contact    45211 non-null  object
 9   day        45211 non-null  int64 
 10  month      45211 non-null  object
 11  duration   45211 non-null  int64 
 12  campaign   45211 non-null  int64 
 13  pdays      45211 non-null  int64 
 14  previous   45211 non-null  int64 
 15  poutcome   45211 non-null  object
 16  y          45211 non-null  object
dtypes: int64(7), object(10)
memory usage: 5.9+ MB


age          0
job          0
marital      0
education    0
default      0
balance      0
housing      0
loan         0
contact      0
day          0
month        0
duration     0
campaign     0
pdays        0
previous     0
poutcome     0
y            0
dtype: int64

In [3]:
#For object check the data 
for cn in df.columns:
  if(df[cn].dtype==object):
    print(df[cn].value_counts())
  

blue-collar      9732
management       9458
technician       7597
admin.           5171
services         4154
retired          2264
self-employed    1579
entrepreneur     1487
unemployed       1303
housemaid        1240
student           938
unknown           288
Name: job, dtype: int64
married     27214
single      12790
divorced     5207
Name: marital, dtype: int64
secondary    23202
tertiary     13301
primary       6851
unknown       1857
Name: education, dtype: int64
no     44396
yes      815
Name: default, dtype: int64
yes    25130
no     20081
Name: housing, dtype: int64
no     37967
yes     7244
Name: loan, dtype: int64
cellular     29285
unknown      13020
telephone     2906
Name: contact, dtype: int64
may    13766
jul     6895
aug     6247
jun     5341
nov     3970
apr     2932
feb     2649
jan     1403
oct      738
sep      579
mar      477
dec      214
Name: month, dtype: int64
unknown    36959
failure     4901
other       1840
success     1511
Name: poutcome, dtype: int64
n

In [4]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [5]:
df=Encoder(df)
df

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome,y
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3,0
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3,0
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3,0
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3,0
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3,1
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3,1
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2,1
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3,0


In [6]:
y = df['y'] #Output
X = df.drop('y',axis=1)
X

Unnamed: 0,age,job,marital,education,default,balance,housing,loan,contact,day,month,duration,campaign,pdays,previous,poutcome
0,58,4,1,2,0,2143,1,0,2,5,8,261,1,-1,0,3
1,44,9,2,1,0,29,1,0,2,5,8,151,1,-1,0,3
2,33,2,1,1,0,2,1,1,2,5,8,76,1,-1,0,3
3,47,1,1,3,0,1506,1,0,2,5,8,92,1,-1,0,3
4,33,11,2,3,0,1,0,0,2,5,8,198,1,-1,0,3
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
45206,51,9,1,2,0,825,0,0,0,17,9,977,3,-1,0,3
45207,71,5,0,0,0,1729,0,0,0,17,9,456,2,-1,0,3
45208,72,5,1,1,0,5715,0,0,0,17,9,1127,5,184,3,2
45209,57,1,1,1,0,668,0,0,1,17,9,508,4,-1,0,3


In [7]:
from sklearn.model_selection import  train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

 


Q1)Using  Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate Accuracy on Test data. Which method gives the best accuracy on test data

In [8]:
from sklearn.metrics import accuracy_score
#Random Forest 
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(random_state=17 )
forest.fit(X_train, y_train)
print('Random Forest: ', end = " ")
print(accuracy_score(y_test, forest.predict(X_test)))

# XGBoost
from xgboost import XGBClassifier
model = XGBClassifier(random_state=17)
model.fit(X_train, y_train)
print('XGBoost      : ', end = " ")
print(accuracy_score(y_test, model.predict(X_test)))

#Light GBM
import lightgbm as lgb
lgb_model = lgb.LGBMClassifier(random_state=17)
lgb_model.fit(X_train, y_train)
print('Light GBM    : ', end = " ")

print(accuracy_score(y_test, lgb_model.predict(X_test)))

#Gradient Boosting
from sklearn.ensemble import GradientBoostingClassifier
gbm_model = GradientBoostingClassifier(random_state=17)
gbm_model.fit(X_train, y_train)
print('Gradient B   : ', end = " ")
print(accuracy_score(y_test, gbm_model.predict(X_test)))

Random Forest:  0.9025037600636999
XGBoost      :  0.90020348580023
Light GBM    :  0.9073697248518092
Gradient B   :  0.9018844554543042


In [9]:
'''All 4 models produced similar results, but LightGBM produced a bit better.'''

'All 4 models produced similar results, but LightGBM produced a bit better.'

In [10]:
print(str(model))

XGBClassifier(random_state=17)


Q2) Using optuna hyperparmeter optimization technique and 100 trial

 a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation accuracy?

"max_depth": range(2, 16), "max_features": range(2, 16)

 b)Evaluate the performance of the  method with highest cross validation accuracy on test data.What is the accuracy value?


In [11]:
pip install optuna



In [12]:
import optuna
import sklearn.ensemble
import sklearn.model_selection
import sklearn.svm


def objective(trial):
    x, y = X_train,y_train

    classifier_name = trial.suggest_categorical("classifier", ["Random Forest","XGBoost", "LightGBM","GradientBoostingClassifier" ])
    if classifier_name == "Random Forest":
         from sklearn.ensemble import RandomForestClassifier
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         classifier_obj = sklearn.ensemble.RandomForestClassifier(random_state=17,  max_depth=max_depth, max_features=max_features )
        
         

    elif classifier_name == "XGBoost":
         from xgboost import XGBClassifier
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         classifier_obj = XGBClassifier(random_state=17,  max_depth=max_depth, max_features=max_features )
        
         
    elif classifier_name == "LightGBM":
         import lightgbm as lgb
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         classifier_obj = lgb.LGBMClassifier(random_state=17,  max_depth=max_depth, max_features=max_features )
        
       
       
    else:
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         classifier_obj = sklearn.ensemble.GradientBoostingClassifier(random_state=17,  max_depth=max_depth, max_features=max_features )
        
         

    accuracy=sklearn.model_selection.cross_val_score(classifier_obj, x, y, n_jobs=-1, cv=3).mean()
   
    return accuracy



study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)

[32m[I 2022-03-28 12:38:15,944][0m A new study created in memory with name: no-name-25b2cd22-7f54-4ac8-ba5d-58a542e53cb1[0m
[32m[I 2022-03-28 12:38:23,450][0m Trial 0 finished with value: 0.8997581898639169 and parameters: {'classifier': 'Random Forest', 'max_depth': 7, 'max_features': 4}. Best is trial 0 with value: 0.8997581898639169.[0m
[32m[I 2022-03-28 12:38:42,952][0m Trial 1 finished with value: 0.9049192677462871 and parameters: {'classifier': 'Random Forest', 'max_depth': 9, 'max_features': 8}. Best is trial 1 with value: 0.9049192677462871.[0m
[32m[I 2022-03-28 12:39:05,422][0m Trial 2 finished with value: 0.9026483798051328 and parameters: {'classifier': 'XGBoost', 'max_depth': 12, 'max_features': 16}. Best is trial 1 with value: 0.9049192677462871.[0m
[32m[I 2022-03-28 12:39:27,710][0m Trial 3 finished with value: 0.9020290595397314 and parameters: {'classifier': 'XGBoost', 'max_depth': 16, 'max_features': 8}. Best is trial 1 with value: 0.9049192677462871.[0

In [13]:
study.best_params

{'classifier': 'LightGBM', 'max_depth': 12, 'max_features': 16}

In [None]:
'''study.best_params == > {'classifier': 'LightGBM', 'max_depth': 12, 'max_features': 16} '''

In [15]:
lgb_model = lgb.LGBMClassifier(random_state=17,max_depth=12,max_features=16)
lgb_model.fit(X_train, y_train)
print(accuracy_score(y_test, lgb_model.predict(X_test)))

0.9062195877200743


For Q3 and Q4 ,use the following data.

In [16]:
dr=pd.read_csv('https://raw.githubusercontent.com/ogut77/DataScience/main/data/diamond.csv')
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,Ideal,H,SI1,VG,EX,GIA,5169
1,0.83,Ideal,H,VS1,ID,ID,AGSL,3470
2,0.85,Ideal,H,SI1,EX,EX,GIA,3183
3,0.91,Ideal,E,SI1,VG,VG,GIA,4370
4,0.83,Ideal,G,SI1,EX,EX,GIA,3171
...,...,...,...,...,...,...,...,...
5995,1.03,Ideal,D,SI1,EX,EX,GIA,6250
5996,1.00,Very Good,D,SI1,VG,VG,GIA,5328
5997,1.02,Ideal,D,SI1,EX,EX,GIA,6157
5998,1.27,Signature-Ideal,G,VS1,EX,EX,GIA,11206


In [17]:
def Encoder(df):
          from sklearn import preprocessing
          columnsToEncode = list(df.select_dtypes(include=['category','object']))
          le = preprocessing.LabelEncoder()
          for feature in columnsToEncode:
              try:
                  df[feature] = le.fit_transform(df[feature])
              except:
                  print('Error encoding '+feature)
          return df


In [18]:
dr=Encoder(dr)
dr

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report,Price
0,1.10,2,4,2,3,0,1,5169
1,0.83,2,4,3,2,2,0,3470
2,0.85,2,4,2,0,0,1,3183
3,0.91,2,1,2,3,3,1,4370
4,0.83,2,3,2,0,0,1,3171
...,...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1,6250
5996,1.00,4,0,2,3,3,1,5328
5997,1.02,2,0,2,0,0,1,6157
5998,1.27,3,3,3,0,0,1,11206


In [19]:
y = dr['Price'] #Output
X = dr.drop('Price',axis=1)
X

Unnamed: 0,Carat Weight,Cut,Color,Clarity,Polish,Symmetry,Report
0,1.10,2,4,2,3,0,1
1,0.83,2,4,3,2,2,0
2,0.85,2,4,2,0,0,1
3,0.91,2,1,2,3,3,1
4,0.83,2,3,2,0,0,1
...,...,...,...,...,...,...,...
5995,1.03,2,0,2,0,0,1
5996,1.00,4,0,2,3,3,1
5997,1.02,2,0,2,0,0,1
5998,1.27,3,3,3,0,0,1


In [20]:
from sklearn.model_selection import  train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=17)

Q3)Using Linear Regression,Decison Tree Random Forest,XGBoost, Light GBM and Gradient Boosting Classifier with default parameters (no parameter specifications except random_state) calculate R2 statistics on test data. Which method gives the best accuracy on test data

In [21]:
from sklearn.metrics import r2_score

#LinearRegression 
from sklearn.linear_model import LinearRegression
lr = LinearRegression()
lr.fit(X_train, y_train)
print('Linear Reg : ', end = " ")
print(r2_score(y_test, lr.predict(X_test)))

#Decision Tree 
from sklearn.tree import DecisionTreeRegressor
dt = DecisionTreeRegressor(random_state=17 )
dt.fit(X_train, y_train)
print('Decision T : ', end = " ")
print(r2_score(y_test, dt.predict(X_test)))

#Random Forest 
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor(random_state=17 )
forest.fit(X_train, y_train)
print('Random Forest: ', end = " ")
print(r2_score(y_test, forest.predict(X_test)))

# XGBoost
from xgboost import XGBRegressor
model = XGBRegressor(random_state=17)
model.fit(X_train, y_train)
print('XGBoost      : ', end = " ")
print(r2_score(y_test, model.predict(X_test)))

#Light GBM
import lightgbm as lgb
lgb_model = lgb.LGBMRegressor(random_state=17)
lgb_model.fit(X_train, y_train)
print('Light GBM    : ', end = " ")
print(r2_score(y_test, lgb_model.predict(X_test)))

#Gradient Boosting
from sklearn.ensemble import GradientBoostingRegressor
gbm_model = GradientBoostingRegressor(random_state=17)
gbm_model.fit(X_train, y_train)
print('Gradient B   : ', end = " ")
print(r2_score(y_test, gbm_model.predict(X_test)))

Linear Reg :  0.823384582969696
Decision T :  0.95923297527452
Random Forest:  0.9799660276771226
XGBoost      :  0.9727452099437852
Light GBM    :  0.9812609895578275
Gradient B   :  0.9739035655494326


Q4) Using optuna hyperparmeter optimization technique (100 trial)  with Random Forest,XGBoost, Light GBM and Gradient Boosting Regressor

a)find best methods with  parameters  using Cross validation (CV=3) technique for the range of   parameters below. What are the best parameters for the method with highest cross validation R2?

"max_depth": range(2, 7), "max_features": range(2, 7)

 b)Evaluate the performance of the  method with highest cross validation R2 on test data. What is the R2 value?


In [22]:
import optuna
import sklearn.ensemble
import sklearn.model_selection

def objective(trial):
    x, y = X_train,y_train

    regressor_name = trial.suggest_categorical("regressor", ["GradientBoosting", "XGBoost","LightGBM","RandomForest"])
    if regressor_name == "GradientBoosting":
         max_depth = trial.suggest_int("max_depth", 2,7)
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         regressor_obj = sklearn.ensemble.GradientBoostingRegressor(random_state=17,  max_depth=max_depth, max_features=max_features )
    
    elif regressor_name == "XGBoost":
         from xgboost import XGBRegressor
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         regressor_obj = XGBRegressor(random_state=17,  max_depth=max_depth, max_features=max_features )
        
         

    elif regressor_name == "LightGBM":
         import lightgbm as lgb
         max_depth = trial.suggest_int("max_depth", 2,X_train.shape[1])
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         regressor_obj = lgb.LGBMRegressor(random_state=17,  max_depth=max_depth, max_features=max_features )
        
       
   
    else:
         max_depth = trial.suggest_int("max_depth", 2,7)
         max_features = trial.suggest_int("max_features", 2,X_train.shape[1])
         regressor_obj = sklearn.ensemble.RandomForestRegressor(random_state=17,  max_depth=max_depth, max_features=max_features )

     

    r2=sklearn.model_selection.cross_val_score(regressor_obj, x, y, n_jobs=-1, cv=3).mean()
   
    return r2



study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=100)
print(study.best_trial)


[32m[I 2022-03-28 12:50:35,755][0m A new study created in memory with name: no-name-897f0440-d876-4e76-9e2d-e19904d3b9af[0m
[32m[I 2022-03-28 12:50:36,257][0m Trial 0 finished with value: 0.9577336844167635 and parameters: {'regressor': 'GradientBoosting', 'max_depth': 3, 'max_features': 2}. Best is trial 0 with value: 0.9577336844167635.[0m
[32m[I 2022-03-28 12:50:36,760][0m Trial 1 finished with value: 0.9577336844167635 and parameters: {'regressor': 'GradientBoosting', 'max_depth': 3, 'max_features': 2}. Best is trial 0 with value: 0.9577336844167635.[0m
[32m[I 2022-03-28 12:50:37,872][0m Trial 2 finished with value: 0.9845567951821383 and parameters: {'regressor': 'XGBoost', 'max_depth': 6, 'max_features': 5}. Best is trial 2 with value: 0.9845567951821383.[0m
[32m[I 2022-03-28 12:50:38,352][0m Trial 3 finished with value: 0.9758941121368242 and parameters: {'regressor': 'LightGBM', 'max_depth': 6, 'max_features': 2}. Best is trial 2 with value: 0.9845567951821383.[0

FrozenTrial(number=86, values=[0.985430200504429], datetime_start=datetime.datetime(2022, 3, 28, 12, 51, 36, 197643), datetime_complete=datetime.datetime(2022, 3, 28, 12, 51, 36, 816382), params={'regressor': 'GradientBoosting', 'max_depth': 4, 'max_features': 7}, distributions={'regressor': CategoricalDistribution(choices=('GradientBoosting', 'XGBoost', 'LightGBM', 'RandomForest')), 'max_depth': IntUniformDistribution(high=7, low=2, step=1), 'max_features': IntUniformDistribution(high=7, low=2, step=1)}, user_attrs={}, system_attrs={}, intermediate_values={}, trial_id=86, state=TrialState.COMPLETE, value=None)


In [23]:
study.best_params

{'max_depth': 4, 'max_features': 7, 'regressor': 'GradientBoosting'}

In [24]:
from sklearn.ensemble import GradientBoostingRegressor
gbm_model = GradientBoostingRegressor(random_state=17, max_depth = 4, max_features = 7)
gbm_model.fit(X_train, y_train)
print('Gradient B   : ', end = " ")
print(r2_score(y_test, gbm_model.predict(X_test)))

Gradient B   :  0.9829135298723956


In [None]:
'''Gradient B   :  0.9829135298723956'''

In [25]:
'''study.best_params == > {'max_depth': 4, 'max_features': 7, 'regressor': 'GradientBoosting'} '''

"study.best_params == > {'max_depth': 4, 'max_features': 7, 'regressor': 'GradientBoosting'} "