# **Problem Statement**


Beta Bank customers are leaving: little by little, chipping away every month. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.
We need to predict whether a customer will leave the bank soon. You have the data on
clients’ past behavior and termination of contracts with the bank.
Build a model with the maximum possible F1 score. To pass the project, you need an F1
score of at least 0.59. Check the F1 for the test set.
Additionally, measure the AUC-ROC metric and compare it with the F1.
1. Download and prepare the data. Explain the procedure.
2. Examine the balance of classes. Train the model without taking into account the
imbalance. Briefly describe your findings.
3. Improve the quality of the model. Make sure you use at least two approaches to
fixing class imbalance. Use the training set to pick the best parameters. Train
different models on training and validation sets. Find the best one. Briefly
describe your findings.
4. Perform the final testing.
Data description

● Dataset URL (CSV File): https://bit.ly/2XZK7Bo

**1. Data Importation**

In [None]:
#importing the relevant libraries
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import r2_score
from joblib import dump
from sklearn.metrics import mean_absolute_error
from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import f1_score
from sklearn.metrics import accuracy_score
from sklearn.metrics import recall_score
from sklearn.metrics import precision_score
from sklearn.utils import resample
from sklearn.metrics import roc_auc_score
from sklearn.metrics import make_scorer
from sklearn.utils import shuffle
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import GridSearchCV
from sklearn.metrics import roc_curve
import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve

In [None]:
#load the data
df_bank = pd.read_csv('https://bit.ly/2XZK7Bo')
df_bank.head()


Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


**2. Data Exploration & Data Preparation.**

In [None]:
df_bank.shape

(10000, 14)

In [None]:
# look for duplicates

df_bank.duplicated().sum()

0

In [None]:
# look for missing records
df_bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Columns: 2945 entries, RowNumber to Gender_Male
dtypes: float64(3), int64(8), uint8(2934)
memory usage: 28.8 MB


In [None]:
df_bank.describe().T

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
RowNumber,10000.0,5.000500e+03,2886.895680,1.0,2500.75,5000.5,7500.25,10000.0
CustomerId,10000.0,1.569094e+07,71936.186123,15565701.0,15628528.25,15690738.0,15753233.75,15815690.0
CreditScore,10000.0,6.505288e+02,96.653299,350.0,584.00,652.0,718.00,850.0
Age,10000.0,3.892180e+01,10.487806,18.0,32.00,37.0,44.00,92.0
Tenure,9091.0,4.997690e+00,2.894723,0.0,2.00,5.0,7.00,10.0
...,...,...,...,...,...,...,...,...
Surname_Zuyev,10000.0,2.000000e-04,0.014141,0.0,0.00,0.0,0.00,1.0
Surname_Zuyeva,10000.0,2.000000e-04,0.014141,0.0,0.00,0.0,0.00,1.0
Geography_Germany,10000.0,2.509000e-01,0.433553,0.0,0.00,0.0,1.00,1.0
Geography_Spain,10000.0,2.477000e-01,0.431698,0.0,0.00,0.0,0.00,1.0


From the above  we observed  that our taget is excited,we also see that tenure feature has missing data hence i will use the mean to impute the missing values.
some of the columns like RowNumber, CustomerId and Surname are not going to be useful features. I will remove them.

In [None]:
df_bank =df_bank.drop(['RowNumber','CustomerId','Surname'], axis = 1)
df_bank.info()


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           9091 non-null   float64
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 859.5+ KB


In [None]:
df_bank['Tenure']= df_bank['Tenure'].fillna(df_bank['Tenure'].mean())
df_bank['Tenure'] = df_bank['Tenure'].astype(int)
df_bank.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      10000 non-null  int64  
 1   Geography        10000 non-null  object 
 2   Gender           10000 non-null  object 
 3   Age              10000 non-null  int64  
 4   Tenure           10000 non-null  int64  
 5   Balance          10000 non-null  float64
 6   NumOfProducts    10000 non-null  int64  
 7   HasCrCard        10000 non-null  int64  
 8   IsActiveMember   10000 non-null  int64  
 9   EstimatedSalary  10000 non-null  float64
 10  Exited           10000 non-null  int64  
dtypes: float64(2), int64(7), object(2)
memory usage: 859.5+ KB


The data is now in a desirable state as i have dropped the irrerevant column and filled in the missing data.

## **3.Data Modeling**

In [None]:
scaler = StandardScaler() 
#OHE
df_bank = pd.get_dummies(df_bank, drop_first=True)
df_bank.head()


Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited,Geography_Germany,Geography_Spain,Gender_Male
0,619,42,2,0.0,1,1,1,101348.88,1,0,0,0
1,608,41,1,83807.86,1,0,1,112542.58,0,0,1,0
2,502,42,8,159660.8,3,1,0,113931.57,1,0,0,0
3,699,39,1,0.0,2,0,0,93826.63,0,0,0,0
4,850,43,2,125510.82,1,1,1,79084.1,0,0,1,0


In [None]:
#creating a features dataframe and a target dataframe
features = df_bank.drop(columns=['Exited'])
target =  df_bank['Exited']

print(features.shape)
print(target.shape)

(10000, 11)
(10000,)


In [None]:
#Split the dataset into a training set, a validation set and a test set.
# set aside 20% of train and test data for evaluation
features_train, features_test, target_train, target_test = train_test_split(features, target, test_size=0.20, random_state=12345)
features_train, features_valid, target_train, target_valid = train_test_split(features_train, target_train, test_size=0.2, random_state=12345 )

#Let's take a look at the split:
print(len(features_train))
print(len(target_train))
print(len(features_test))
print(len(target_test))
print(len(features_valid))
print(len(target_valid))

6400
6400
2000
2000
1600
1600


In [None]:
to_normalize = ['CreditScore', 'Age', 'Balance', 'EstimatedSalary', 'Tenure', 'NumOfProducts']
scaler = StandardScaler()
scaler.fit(features_train[to_normalize])
features_train.loc[:,to_normalize] = scaler.transform(features_train[to_normalize])
features_valid.loc[:,to_normalize] = scaler.transform(features_valid[to_normalize])
features_test.loc[:,to_normalize] = scaler.transform(features_test[to_normalize])

**Examine the balance of classes. Train the model without taking into account the imbalance. Briefly describe your findings**

In [None]:
#First, let's look at the class imbalance.
print(df_bank[df_bank['Exited'] == 1]['Exited'].count())
print(df_bank[df_bank['Exited'] == 0]['Exited'].count())

2037
7963


we can see we have an imbalance.

In [None]:
# Train the model without taking into account the imbalance

**Logistic Regression**

In [None]:
LrModel = LogisticRegression(solver='liblinear', random_state=12345)
LrModel.fit(features_train,target_train)
print('Accuracy', LrModel.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, LrModel.predict(features_valid)))
print('AUC:', roc_auc_score(target_valid, LrModel.predict_proba(features_valid)[:,1]))



Accuracy 0.8175
f1 score: 0.3145539906103286
AUC: 0.7634576873261729


In [52]:
LrModelBal = LogisticRegression(solver='liblinear', random_state=12345, class_weight='balanced')
LrModelBal.fit(features_train,target_train)
print('Accuracy', LrModelBal.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, LrModelBal.predict(features_valid)))
print('AUC:', roc_auc_score(target_valid, LrModel.predict_proba(features_valid)[:,1]))

Accuracy 0.70875
f1 score: 0.46924829157175396
AUC: 0.7634576873261729


The accuracy score has gone down from balancing the data though F1 score as improved .

**Improve the quality of the model. Make sure you use at least two approaches to fixing class imbalance. Use the training set to pick the best parameters. Train different models on training and validation sets. Find the best one. Briefly describe your findings**

In [53]:
def upsample(features, target, repeat):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_upsampled = pd.concat([features_zeros] + [features_ones] * repeat)
    target_upsampled = pd.concat([target_zeros] + [target_ones] * repeat)
    
    features_upsampled, target_upsampled = shuffle(
        features_upsampled, target_upsampled, random_state=12345)
    
    return features_upsampled, target_upsampled

features_upsampled, target_upsampled = upsample(features_train, target_train, 10)

upsampled_LogReg_model =LogisticRegression(random_state=12345,solver='liblinear')
upsampled_LogReg_model.fit(features_upsampled, target_upsampled)
upsampled_LogReg_predicted_valid = upsampled_LogReg_model.predict(features_valid)


print('Accuracy', upsampled_LogReg_model.score(features_valid, target_valid))
print('f1 score:' ,f1_score(target_valid, upsampled_LogReg_predicted_valid))
print('AUC:',roc_auc_score(target_valid, upsampled_LogReg_model.predict_proba(features_valid)[:,1]))

Accuracy 0.454375
f1 score: 0.39332870048644897
AUC: 0.7678598237618673


In [54]:
#Downsampling function 
def downsample(features, target, fraction):
    features_zeros = features[target == 0]
    features_ones = features[target == 1]
    target_zeros = target[target == 0]
    target_ones = target[target == 1]

    features_downsampled = pd.concat(
        [features_zeros.sample(frac=fraction, random_state=12345)] + [features_ones])
    target_downsampled = pd.concat(
        [target_zeros.sample(frac=fraction, random_state=12345)] + [target_ones])
    
    features_downsampled, target_downsampled = shuffle(
        features_downsampled, target_downsampled, random_state=12345)
    
    return features_downsampled, target_downsampled

features_downsampled, target_downsampled = downsample(features_train, target_train, 0.1)

down_LogReg_model =LogisticRegression(random_state=12345,solver='liblinear')
down_LogReg_model.fit(features_downsampled, target_downsampled)
down_LogReg_predicted_valid = down_LogReg_model.predict(features_valid)

print("F1:", f1_score(target_valid, down_LogReg_predicted_valid))
print('Accuracy:', down_LogReg_model.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, down_LogReg_model.predict_proba(features_valid)[:,1]))

F1: 0.39145416953824946
Accuracy: 0.448125
AUC-ROC: 0.7647757836693462


**Decision trees**

In [55]:
depth_param = {'max_depth':range(1,25)}
DecTreeModel = DecisionTreeClassifier(random_state=12345)
DecTreeModelOpt = GridSearchCV(DecTreeModel,depth_param)
DecTreeModelOpt.fit(features_train, target_train)
print(DecTreeModelOpt.best_estimator_)
DecTreeModelOpt_predicted_valid = DecTreeModelOpt.predict(features_valid)
print("F1:", f1_score(target_valid, DecTreeModelOpt_predicted_valid))
print('Accuracy:', DecTreeModelOpt.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, DecTreeModelOpt.predict_proba(features_valid)[:,1]))


DecisionTreeClassifier(max_depth=6, random_state=12345)
F1: 0.5176991150442477
Accuracy: 0.86375
AUC-ROC: 0.8132692606191999


**Random Forests**

In [56]:
depth_param = {'max_depth':range(1,10), 'n_estimators':range(1,50)}
RandForestMod = RandomForestClassifier(random_state=12345)
RandForestOpt = GridSearchCV(RandForestMod,depth_param)
RandForestOpt.fit(features_train, target_train)
print(RandForestOpt.best_estimator_)
RandForestOpt_predicted_valid = RandForestOpt.predict(features_valid)
print("F1:", f1_score(target_valid, RandForestOpt_predicted_valid))
print('Accuracy', RandForestOpt.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, RandForestOpt.predict_proba(features_valid)[:,1]))

RandomForestClassifier(max_depth=8, n_estimators=40, random_state=12345)
F1: 0.5431578947368422
Accuracy 0.864375
AUC-ROC: 0.8501874088719589


random forest has given us  f1 score of 0.54,we need  f1 score of at least 0.59.to keep one parameter constant and increase the range of the other parameter to see if that can help us improve our score. For the above model, max_depth of 8 gave us the best result. Let's keep that constant and increase the range of n_estimators and try again. Most importantly, let us add the argument: 'class weight = balanced' since simply increasing the parameter space alone is most likely not going to increase our f1 score by so much.

In [57]:
depth_param = {'n_estimators':range(1,150)}
RandForestMod = RandomForestClassifier(random_state=12345, max_depth = 8,class_weight='balanced')
RandForestOpt = GridSearchCV(RandForestMod, depth_param)
RandForestOpt.fit(features_train, target_train)
print(RandForestOpt.best_estimator_)
RandForestOpt_predicted_valid = RandForestOpt.predict(features_valid)
print("F1:", f1_score(target_valid, RandForestOpt_predicted_valid))
print('Accuracy:', RandForestOpt.score(features_valid, target_valid))
print("AUC-ROC:", roc_auc_score(target_valid, RandForestOpt.predict_proba(features_valid)[:,1]))

RandomForestClassifier(class_weight='balanced', max_depth=8, n_estimators=126,
                       random_state=12345)
F1: 0.5979680696661829
Accuracy: 0.826875
AUC-ROC: 0.8602385296355386


**Perform the final testing**

In [58]:
RandForestOpt_predicted_test = RandForestOpt.predict(features_test)
print("F1:", f1_score(target_test, RandForestOpt_predicted_test))
print("AUC-ROC:", roc_auc_score(target_test, RandForestOpt.predict_proba(features_test)[:,1]))
print('Accuracy:', RandForestOpt.score(features_valid, target_valid))


F1: 0.6452991452991452
AUC-ROC: 0.8673904337093606
Accuracy: 0.826875


For the test dataset, the F1 score is 0.64 and the AUC-ROC score is 0.87. Both these metrics signify good quality and meet the expectations of the assignment.