<a href="https://colab.research.google.com/github/hwnjoroge/Projects/blob/main/Imbalanced_Classification_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Problem Statement

Beta Bank customers are leaving. The bankers
figured out it’s cheaper to save the existing customers rather than to attract new ones.


Task: Predict whether a customer will leave the bank soon using a model with the maximum possible F1 score. 
Required: an F1 score of at least 0.59 on the test dataset


# Downloading the Data





Importing required libraries

In [321]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.utils import shuffle
from sklearn.utils import resample
from sklearn.metrics import confusion_matrix
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression

from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

In [187]:
#loading the dataset
credit_df = pd.read_csv('https://bit.ly/2XZK7Bo')


In [188]:
#viewing the dataset
credit_df

Unnamed: 0,RowNumber,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,1,15634602,Hargrave,619,France,Female,42,2.0,0.00,1,1,1,101348.88,1
1,2,15647311,Hill,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,3,15619304,Onio,502,France,Female,42,8.0,159660.80,3,1,0,113931.57,1
3,4,15701354,Boni,699,France,Female,39,1.0,0.00,2,0,0,93826.63,0
4,5,15737888,Mitchell,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.10,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9995,9996,15606229,Obijiaku,771,France,Male,39,5.0,0.00,2,1,0,96270.64,0
9996,9997,15569892,Johnstone,516,France,Male,35,10.0,57369.61,1,1,1,101699.77,0
9997,9998,15584532,Liu,709,France,Female,36,7.0,0.00,1,0,1,42085.58,1
9998,9999,15682355,Sabbatini,772,Germany,Male,42,3.0,75075.31,2,1,0,92888.52,1


In [189]:
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000 entries, 0 to 9999
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        10000 non-null  int64  
 1   CustomerId       10000 non-null  int64  
 2   Surname          10000 non-null  object 
 3   CreditScore      10000 non-null  int64  
 4   Geography        10000 non-null  object 
 5   Gender           10000 non-null  object 
 6   Age              10000 non-null  int64  
 7   Tenure           9091 non-null   float64
 8   Balance          10000 non-null  float64
 9   NumOfProducts    10000 non-null  int64  
 10  HasCrCard        10000 non-null  int64  
 11  IsActiveMember   10000 non-null  int64  
 12  EstimatedSalary  10000 non-null  float64
 13  Exited           10000 non-null  int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.1+ MB



The dataset contains 10000 rows and 14 columns.  

one of the columns 'Tenure' contains missing values.

3 of the columns 'Surname', 'Geography' and 'Gender' have string values while the rest of the columns are of type integer/float.



In [190]:
credit_df.duplicated(['CustomerId']).value_counts()

False    10000
dtype: int64


 Each row of the dataset contains information about one customer. 

In [191]:
#examining the balance of the classes using the 'Exited' column
class_0 = credit_df[credit_df['Exited'] == 0]
class_1 = credit_df[credit_df['Exited'] == 1]

#print the shapes of the classes
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)


class 0: (7963, 14)
class 1: (2037, 14)


In [192]:
#dropping the blank rows in the column'Tenure'

credit_df.dropna(subset = ["Tenure"], inplace=True)
credit_df.info()


<class 'pandas.core.frame.DataFrame'>
Int64Index: 9091 entries, 0 to 9998
Data columns (total 14 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   RowNumber        9091 non-null   int64  
 1   CustomerId       9091 non-null   int64  
 2   Surname          9091 non-null   object 
 3   CreditScore      9091 non-null   int64  
 4   Geography        9091 non-null   object 
 5   Gender           9091 non-null   object 
 6   Age              9091 non-null   int64  
 7   Tenure           9091 non-null   float64
 8   Balance          9091 non-null   float64
 9   NumOfProducts    9091 non-null   int64  
 10  HasCrCard        9091 non-null   int64  
 11  IsActiveMember   9091 non-null   int64  
 12  EstimatedSalary  9091 non-null   float64
 13  Exited           9091 non-null   int64  
dtypes: float64(3), int64(8), object(3)
memory usage: 1.0+ MB


In [193]:
#dropping the columns that will not be useful in our model ie -RowNumber,Surname and CustomerId
credit_df = credit_df.drop(labels=['RowNumber', 'Surname', 'CustomerId'], axis=1)
credit_df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 9091 entries, 0 to 9998
Data columns (total 11 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   CreditScore      9091 non-null   int64  
 1   Geography        9091 non-null   object 
 2   Gender           9091 non-null   object 
 3   Age              9091 non-null   int64  
 4   Tenure           9091 non-null   float64
 5   Balance          9091 non-null   float64
 6   NumOfProducts    9091 non-null   int64  
 7   HasCrCard        9091 non-null   int64  
 8   IsActiveMember   9091 non-null   int64  
 9   EstimatedSalary  9091 non-null   float64
 10  Exited           9091 non-null   int64  
dtypes: float64(3), int64(6), object(2)
memory usage: 852.3+ KB


# Data Preparation 




###Training, Validation and Test Sets

Splitting the Dataset 

We split our dataset into training, validation and test sets. 60% of the data for the training set, 20% for the validation set and 20% for the test set. 

We also set the random_state to 12345 to ensure reproducibility in the notebook.

In [269]:
# split the credit_df into train and test sets
train_val_df, test_df = train_test_split(credit_df, test_size=0.2, random_state=12345)
train_df, val_df = train_test_split(train_val_df, test_size=0.25, random_state=12345)
print(train_df.shape, val_df.shape, test_df.shape)

(5454, 11) (1818, 11) (1819, 11)


### Identifying Input and Target Columns

In this dataset, the target column 'Exited' was identified and omitted from the training data and the input columns 

In [270]:
input_columns = list(train_df.columns)[:-1]
target_column = 'Exited'

In [271]:
input_columns

['CreditScore',
 'Geography',
 'Gender',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

In [272]:
target_column

'Exited'

We can now create inputs and targets for the training, validation and test sets for further processing and model training.

In [273]:
train_inputs = train_df[input_columns].copy() # this is a dataframe
train_targets = train_df[target_column].copy() # this is a series

In [274]:
val_inputs = val_df[input_columns].copy()
val_targets = val_df[target_column].copy()

In [275]:
test_inputs = test_df[input_columns].copy()
test_targets = test_df[target_column].copy()

In [201]:
train_inputs

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
3706,629,Spain,Female,44,6.0,125512.98,2,0,0,79082.76
6805,614,France,Female,35,1.0,0.00,2,1,1,3342.62
4449,666,France,Male,36,3.0,0.00,2,1,0,35156.54
598,683,Germany,Female,57,5.0,162448.69,1,0,0,9221.78
1845,737,France,Male,36,9.0,0.00,1,0,1,188670.90
...,...,...,...,...,...,...,...,...,...,...
8706,850,Spain,Female,55,7.0,0.00,1,0,0,171762.87
113,675,Spain,Male,36,9.0,106190.55,1,0,1,22994.32
4961,689,Germany,Male,45,0.0,130170.82,2,1,0,150856.38
2403,641,France,Female,26,4.0,91547.84,2,0,1,28157.34


In [276]:
train_targets

3706    0
6805    0
4449    0
598     1
1845    1
       ..
8706    1
113     0
4961    0
2403    0
208     1
Name: Exited, Length: 5454, dtype: int64

Let's also identify which of the columns are numerical and which ones are categorical. This will be useful later, as we'll need to convert the categorical data to numbers for training the models

In [277]:
numeric_cols = train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = train_inputs.select_dtypes('object').columns.tolist()

In [278]:
categorical_cols

['Geography', 'Gender']

In [279]:
numeric_cols

['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

### Scaling Numeric Features 

Scaling numeric features to ensure that no particular feature has a disproportionate impact on the model. 
The numeric columns in our dataset have varying ranges and will scaled to a small range of values between $(0,1)$ 


using `MinMaxScaler` from `sklearn.preprocessing` to scale values to the $(0,1)$ range.

In [280]:
scaler = MinMaxScaler()

First, we `fit` the scaler to the data i.e. compute the range of values for each numeric column.

In [281]:
scaler.fit(credit_df[numeric_cols])

MinMaxScaler(copy=True, feature_range=(0, 1))

Checking the minimum and maximum values in each column.

In [282]:
print('Minimum:')
list(scaler.data_min_)

Minimum:


[350.0, 18.0, 0.0, 0.0, 1.0, 0.0, 0.0, 11.58]

In [283]:
print('Maximum:')
list(scaler.data_max_)

Maximum:


[850.0, 92.0, 10.0, 250898.09, 4.0, 1.0, 1.0, 199992.48]

Separately scaling the training, validation and test sets using the `transform` method of `scaler`.

In [285]:
train_inputs[numeric_cols] = scaler.transform(train_inputs[numeric_cols])
val_inputs[numeric_cols] = scaler.transform(val_inputs[numeric_cols])
test_inputs[numeric_cols] = scaler.transform(test_inputs[numeric_cols])

Verifying that values in each column lie in the range $(0,1)$

In [286]:
train_inputs[numeric_cols].describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
count,5454.0,5454.0,5454.0,5454.0,5454.0,5454.0,5454.0,5454.0
mean,-0.698804,-0.2394,0.049793,1.216699e-06,-0.273438,0.704987,0.508251,-5.5e-05
std,0.00039,0.00192,0.028974,9.888135e-07,0.065469,0.45609,0.499978,1e-06
min,-0.7,-0.243243,0.0,0.0,-0.333333,0.0,0.0,-5.8e-05
25%,-0.699076,-0.240687,0.02,0.0,-0.333333,0.0,0.0,-5.7e-05
50%,-0.6988,-0.239774,0.05,1.544521e-06,-0.333333,1.0,1.0,-5.5e-05
75%,-0.698532,-0.238495,0.07,2.02777e-06,-0.222222,1.0,1.0,-5.4e-05
max,-0.698,-0.22973,0.1,3.530868e-06,0.0,1.0,1.0,-5.3e-05


## Encoding Categorical Data

Using One hot encoding to convert the categorical columns to (0/1) column for each unique category 

In [287]:
credit_df[categorical_cols].nunique()

Geography    3
Gender       2
dtype: int64

performing one hot encoding using the `OneHotEncoder` class from `sklearn.preprocessing`.

In [216]:
encoder = OneHotEncoder(sparse=False, handle_unknown='ignore')

`Fitting` the encoder to the data i.e. identify the full list of categories across all categorical columns.

In [289]:
encoder.fit(credit_df[categorical_cols])

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=False)

In [291]:
encoder.categories_

[array(['France', 'Germany', 'Spain'], dtype=object),
 array(['Female', 'Male'], dtype=object)]

The encoder has created a list of categories for each of the categorical columns in the dataset. 

Generating the column names for each individual category using `get_feature_names`.

In [292]:
encoded_cols = list(encoder.get_feature_names(categorical_cols))
print(encoded_cols)

['Geography_France', 'Geography_Germany', 'Geography_Spain', 'Gender_Female', 'Gender_Male']


Adding the above columns to the `train_inputs`, `val_inputs` and `test_inputs`using the `transform` method of `encoder`.

In [293]:
train_inputs[encoded_cols] = encoder.transform(train_inputs[categorical_cols])
val_inputs[encoded_cols] = encoder.transform(val_inputs[categorical_cols])
test_inputs[encoded_cols] = encoder.transform(test_inputs[categorical_cols])

Verifying that these new columns have been added to our training, test and validation sets.

In [294]:
pd.set_option('display.max_columns', None)

In [295]:
test_inputs

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
862,-0.698500,France,Male,-0.239043,0.07,1.810651e-06,-0.333333,1.0,1.0,-0.000055,1.0,0.0,0.0,0.0,1.0
9727,-0.699280,France,Female,-0.238313,0.01,0.000000e+00,-0.333333,0.0,1.0,-0.000053,1.0,0.0,0.0,1.0,0.0
1717,-0.698572,Spain,Female,-0.240139,0.03,9.003116e-07,-0.333333,1.0,0.0,-0.000057,0.0,0.0,1.0,1.0,0.0
8640,-0.698480,France,Female,-0.240687,0.09,2.027990e-06,-0.333333,0.0,0.0,-0.000056,1.0,0.0,0.0,1.0,0.0
5288,-0.699072,France,Male,-0.241052,0.02,0.000000e+00,-0.222222,1.0,1.0,-0.000055,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
7697,-0.698996,Spain,Female,-0.239043,0.03,0.000000e+00,-0.222222,1.0,0.0,-0.000057,0.0,0.0,1.0,1.0,0.0
8323,-0.698552,Spain,Female,-0.241052,0.06,0.000000e+00,-0.222222,1.0,0.0,-0.000054,0.0,0.0,1.0,1.0,0.0
3900,-0.698164,France,Male,-0.239956,0.09,1.094230e-06,-0.222222,0.0,1.0,-0.000055,1.0,0.0,0.0,0.0,1.0
5474,-0.698796,France,Female,-0.240321,0.09,0.000000e+00,-0.222222,1.0,0.0,-0.000054,1.0,0.0,0.0,1.0,0.0


# Data Modelling on the imbalanced dataset 




## Training a Logistic Regression Model

Logistic regression is a commonly used technique for solving binary classification problems. 

Training a logistic regression model, using the `LogisticRegression` class from Scikit-learn.

In [309]:
model = LogisticRegression(solver='liblinear', random_state=12345)

In [310]:
model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=12345, solver='liblinear', tol=0.0001,
                   verbose=0, warm_start=False)

## Making Predictions and Evaluating the Logistic Regression Model

Using the trained model to make predictions on the training, validation and test sets

In [311]:
X_train = train_inputs[numeric_cols + encoded_cols]
X_val = val_inputs[numeric_cols + encoded_cols]
X_test = test_inputs[numeric_cols + encoded_cols]


In [312]:
train_preds = model.predict(X_train)
val_preds = model.predict(X_val)
test_preds = model.predict(X_test)

In [313]:
train_preds

array([0, 0, 0, ..., 0, 0, 0])

In [314]:
train_targets

3706    0
6805    0
4449    0
598     1
1845    1
       ..
8706    1
113     0
4961    0
2403    0
208     1
Name: Exited, Length: 5454, dtype: int64

In [315]:
model.classes_

array([0, 1])

We can test the accuracy of the model's predictions by computing the percentage of matching values in `train_preds` and `train_targets`.

This can be done using the `accuracy_score` function from `sklearn.metrics`.

In [316]:
accuracy_score(train_targets, train_preds)

0.7948294829482948

In [317]:
#accuracy of the model's predictions on the validation set
accuracy_score(val_targets, val_preds)

0.7986798679867987

In [318]:
#accuracy of the model's predictions on the test set
accuracy_score(test_targets, test_preds)

0.7971412864211105

In [320]:
# calculate the F-Score
print('F1:', f1_score(test_targets, test_preds))

F1: 0.0


The model achieves an accuracy of 79.4% on the training set, 79.8% on the validation set and 79.7% on the test set.


the models F1 score on teh imbalanced dataset is 0.0


Using a confusion matrix to visualize the breakdown of correctly and incorrectly classified inputs
:

In [323]:
#breakdown of correctly and incorrectly classified inputs in the train set
confusion_matrix(train_targets, train_preds, normalize='true')

array([[1., 0.],
       [1., 0.]])

In [324]:
#breakdown of correctly and incorrectly classified inputs in the validation set
confusion_matrix(val_targets, val_preds, normalize='true')

array([[1., 0.],
       [1., 0.]])

In [325]:
#breakdown of correctly and incorrectly classified inputs in the test set
confusion_matrix(test_targets, test_preds, normalize='true')

array([[1., 0.],
       [1., 0.]])

## Training a Random Forest Model
Training the RandomForest Model  to predict the 'Exited' values.

To train a Random Forest model, we use the RandomForestClassifier class from Scikit-learn.

In [326]:
random_model = RandomForestClassifier(n_jobs=-1, random_state=12345)

In [327]:
random_model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

## Making Predictions and Evaluating the Random Forest 

We can now use the trained model to make predictions on the training, valid and test set

In [328]:
Random_X_train = train_inputs[numeric_cols + encoded_cols]
Random_X_val = val_inputs[numeric_cols + encoded_cols]
Random_X_test = test_inputs[numeric_cols + encoded_cols]


In [329]:
random_train_preds = random_model.predict(Random_X_train)
random_val_preds = random_model.predict(Random_X_val)
random_test_preds = random_model.predict(Random_X_test)

In [330]:
random_train_preds

array([0, 0, 0, ..., 0, 0, 1])

In [331]:
train_targets

3706    0
6805    0
4449    0
598     1
1845    1
       ..
8706    1
113     0
4961    0
2403    0
208     1
Name: Exited, Length: 5454, dtype: int64

In [332]:
model.classes_

array([0, 1])

We can test the accuracy of the model's predictions by computing the percentage of matching values in `train_preds` and `train_targets`.

This can be done using the `accuracy_score` function from `sklearn.metrics`.

In [333]:
accuracy_score(train_targets, random_train_preds)

1.0

In [334]:
#accuracy of the model's predictions on the validation set
accuracy_score(val_targets, random_val_preds)

0.856985698569857

In [335]:
#accuracy of the model's predictions on the test set
accuracy_score(test_targets, random_test_preds)

0.8499175371083013

In [336]:
# calculate the F-Score
print('F1:', f1_score(test_targets, random_test_preds))

F1: 0.5202108963093146


The model achieves an accuracy of 100% on the training set, 85.6% on the validation set and 84.9% on the test set.

the models F1 score on teh imbalanced dataset is .52

Using a confusion matrix to visualize the breakdown of correctly and incorrectly classified inputs
:

In [337]:
#breakdown of correctly and incorrectly classified inputs in the train set
confusion_matrix(train_targets, random_train_preds, normalize='true')

array([[1., 0.],
       [0., 1.]])

In [338]:
#breakdown of correctly and incorrectly classified inputs in the validation set
confusion_matrix(val_targets, random_val_preds, normalize='true')

array([[0.96763085, 0.03236915],
       [0.58196721, 0.41803279]])

In [339]:
#breakdown of correctly and incorrectly classified inputs in the test set
confusion_matrix(test_targets, random_test_preds, normalize='true')

array([[0.96413793, 0.03586207],
       [0.59891599, 0.40108401]])

###Findings 

the Random Forest model has less errors than the logistic model on the test set.
False negatives: 59% , FALSE Positives: 3.5%

while the Logistic regression model has FN: 100% and FP: at 0%
this means that the model predicted that the users will not churn while they indeed churned.

For the imbalanced set, the Random Forest model gives better results. this is because the tree based algorithm works by learning a hierarchy of if/else questions.

#Downsampling - Dealing with Imbalanced Classes

Downsampling

In [341]:
credit_df.head(5)

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,619,France,Female,42,2.0,0.0,1,1,1,101348.88,1
1,608,Spain,Female,41,1.0,83807.86,1,0,1,112542.58,0
2,502,France,Female,42,8.0,159660.8,3,1,0,113931.57,1
3,699,France,Female,39,1.0,0.0,2,0,0,93826.63,0
4,850,Spain,Female,43,2.0,125510.82,1,1,1,79084.1,0


In [340]:
#examining the balance of the classes using the 'Exited' column
class_0 = credit_df[credit_df['Exited'] == 0]
class_1 = credit_df[credit_df['Exited'] == 1]

#print the shapes of the classes
print('class 0:', class_0.shape)
print('class 1:', class_1.shape)


class 0: (7237, 11)
class 1: (1854, 11)


In [342]:
#display the imbalanced class counts
credit_df.Exited.value_counts()

0    7237
1    1854
Name: Exited, dtype: int64

In [343]:
# Separate majority and minority classes
df_majority = credit_df[credit_df.Exited==0]
df_minority = credit_df[credit_df.Exited==1]
 
# Downsampling the majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sampling without replacement
                                 n_samples=1854,     # to match the minority class
                                 random_state=12345) # for reproducible results
 


In [344]:
# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])
 
# Display new class counts
df_downsampled.Exited.value_counts()


1    1854
0    1854
Name: Exited, dtype: int64

the new downsampled dataframe has fewer observations than the original, and the ratio of the two classes is now 1:1.

In [345]:
df_downsampled.shape

(3708, 11)

###Training, Validation and Test Sets

Splitting the Downsampled Dataset 

We split our dataset into training and test sets. 80% of the data for the training set and 20% for the test set. 

We also set the random_state to 12345 to ensure reproducibility in the notebook.

In [347]:
# split the credit_df into train,validation and test sets

train_val_df, downsampled_test_df = train_test_split(df_downsampled, test_size=0.2, random_state=12345)
downsampled_train_df, downsampled_val_df = train_test_split(train_val_df, test_size=0.25, random_state=12345)
print(downsampled_train_df.shape, downsampled_val_df.shape, downsampled_test_df.shape)

(2224, 11) (742, 11) (742, 11)


### Identifying Input and Target Columns

In this dataset, the target column 'Exited' was identified and omitted from the training data and the input columns 

In [348]:
input_columns = list(downsampled_train_df.columns)[:-1]
target_column = downsampled_train_df['Exited']

In [349]:
input_columns

['CreditScore',
 'Geography',
 'Gender',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

In [350]:
target_column

4541    0
516     1
3394    0
3370    1
6428    1
       ..
144     1
5213    0
3083    1
5685    1
618     0
Name: Exited, Length: 2224, dtype: int64

Creating inputs and targets for the training, validation and test sets for further processing and model training.

In [None]:
downsampled_train_inputs = downsampled_train_df[input_columns].copy() # this is a dataframe
#downsampled_train_targets = downsampled_train_df[target_column].copy() # this is a series

In [None]:
downsampled_val_inputs = downsampled_val_df[input_columns].copy()
#downsampled_val_targets = downsampled_val_df[target_column].copy()

In [89]:
downsampled_test_inputs = downsampled_test_df[input_columns].copy()
downsampled_test_targets = downsampled_test_df[target_column].copy()

In [90]:
downsampled_train_inputs

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
4541,617,Spain,Female,36,7.0,115617.24,1,1,1,71519.40
516,468,France,Female,56,10.0,0.00,3,0,1,62256.87
3394,466,France,Male,29,6.0,0.00,2,1,1,2797.27
3370,698,Spain,Female,47,6.0,0.00,1,1,0,50213.81
6428,627,Germany,Female,39,5.0,124586.93,1,1,0,93132.61
...,...,...,...,...,...,...,...,...,...,...
144,691,France,Female,31,5.0,40915.55,1,1,0,126213.84
5213,752,Germany,Male,29,4.0,129514.99,1,1,1,102930.46
3083,466,France,Male,40,4.0,91592.06,1,1,0,141210.18
5685,705,Spain,Female,47,3.0,63488.70,1,0,1,28640.92


In [91]:
downsampled_train_targets

4541    0
516     1
3394    0
3370    1
6428    1
       ..
144     1
5213    0
3083    1
5685    1
618     0
Name: Exited, Length: 2224, dtype: int64

Identifying which of the columns are numerical and which ones are categorical, as we'll need to convert the categorical data to numbers for training a logistic regression model.

In [92]:
numeric_cols = downsampled_train_inputs.select_dtypes(include=np.number).columns.tolist()
categorical_cols = downsampled_train_inputs.select_dtypes('object').columns.tolist()

In [93]:
categorical_cols

['Geography', 'Gender']

In [94]:
numeric_cols

['CreditScore',
 'Age',
 'Tenure',
 'Balance',
 'NumOfProducts',
 'HasCrCard',
 'IsActiveMember',
 'EstimatedSalary']

### Scaling Numeric Features -for the downsampled df

Scaling numeric features to ensure that no particular feature has a disproportionate impact on the model. 
The numeric columns in our dataset have varying ranges and will scaled to a small range of values between $(0,1)$ 


Let's use `MinMaxScaler` from `sklearn.preprocessing` to scale values to the $(0,1)$ range.

In [96]:
d_scaler = MinMaxScaler()

First, we `fit` the scaler to the data i.e. compute the range of values for each numeric column.

In [97]:
d_scaler.fit(df_downsampled[numeric_cols])

MinMaxScaler(copy=True, feature_range=(0, 1))

We can now separately scale the training, validation and test sets using the `transform` method of `scaler`.

In [99]:
downsampled_train_inputs[numeric_cols] = d_scaler.transform(downsampled_train_inputs[numeric_cols])
downsampled_val_inputs[numeric_cols] = d_scaler.transform(downsampled_val_inputs[numeric_cols])
downsampled_test_inputs[numeric_cols] = d_scaler.transform(downsampled_test_inputs[numeric_cols])

We can now verify that values in each column lie in the range $(0,1)$

In [100]:
downsampled_train_inputs[numeric_cols].describe()

Unnamed: 0,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
count,2224.0,2224.0,2224.0,2224.0,2224.0,2224.0,2224.0,2224.0
mean,0.592129,0.331462,0.508993,0.32504,0.168915,0.715378,0.459982,0.496486
std,0.194775,0.153364,0.292875,0.246759,0.226885,0.451336,0.498508,0.289759
min,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.000393
25%,0.458,0.228571,0.3,0.0,0.0,0.0,0.0,0.241135
50%,0.589,0.314286,0.5,0.409354,0.0,1.0,0.0,0.504163
75%,0.724,0.428571,0.8,0.515179,0.333333,1.0,1.0,0.747615
max,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.99966


## Encoding Categorical Data

Using One hot encoding to convert the categorical columns to (0/1) column for each unique category 

In [101]:
df_downsampled[categorical_cols].nunique()

Geography    3
Gender       2
dtype: int64

We can perform one hot encoding using the `OneHotEncoder` class from `sklearn.preprocessing`.

First, we `fit` the encoder to the data i.e. identify the full list of categories across all categorical columns.

In [102]:
encoder.fit(df_downsampled[categorical_cols])

OneHotEncoder(categories='auto', drop=None, dtype=<class 'numpy.float64'>,
              handle_unknown='ignore', sparse=False)

In [103]:
encoder.categories_

[array(['France', 'Germany', 'Spain'], dtype=object),
 array(['Female', 'Male'], dtype=object)]

The encoder has created a list of categories for each of the categorical columns in the dataset. 

We can generate column names for each individual category using `get_feature_names`.

In [104]:
encoded_cols = list(encoder.get_feature_names(categorical_cols))
print(encoded_cols)

['Geography_France', 'Geography_Germany', 'Geography_Spain', 'Gender_Female', 'Gender_Male']


Adding the above columns to the `downsampled_train_inputs`, `downsampled_val_inputs` and `downsampled_test_inputs`using the `transform` method of `encoder`.

In [105]:
downsampled_train_inputs[encoded_cols] = encoder.transform(downsampled_train_inputs[categorical_cols])
downsampled_val_inputs[encoded_cols] = encoder.transform(downsampled_val_inputs[categorical_cols])
downsampled_test_inputs[encoded_cols] = encoder.transform(downsampled_test_inputs[categorical_cols])


We can verify that these new columns have been added to our training, test and validation sets.

In [106]:
pd.set_option('display.max_columns', None)

In [107]:
downsampled_test_inputs

Unnamed: 0,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Geography_France,Geography_Germany,Geography_Spain,Gender_Female,Gender_Male
181,0.320,France,Male,0.671429,0.2,0.000000,0.333333,1.0,1.0,0.240423,1.0,0.0,0.0,0.0,1.0
8170,0.498,Germany,Male,0.614286,0.1,0.497166,0.000000,0.0,1.0,0.452121,0.0,1.0,0.0,0.0,1.0
7944,0.340,France,Female,0.300000,0.1,0.292920,0.000000,0.0,1.0,0.548353,1.0,0.0,0.0,1.0,0.0
4502,0.696,Spain,Male,0.385714,0.5,0.655449,0.000000,1.0,0.0,0.710155,0.0,0.0,1.0,0.0,1.0
1763,0.734,France,Male,0.285714,0.7,0.388441,0.000000,0.0,0.0,0.946304,1.0,0.0,0.0,0.0,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9898,0.478,France,Male,0.285714,0.4,0.000000,0.000000,1.0,0.0,0.477604,1.0,0.0,0.0,0.0,1.0
7815,0.466,France,Male,0.285714,0.8,0.000000,0.000000,1.0,0.0,0.239307,1.0,0.0,0.0,0.0,1.0
1248,0.678,Spain,Female,0.142857,0.3,0.000000,0.333333,1.0,1.0,0.962679,0.0,0.0,1.0,1.0,0.0
5937,0.780,Spain,Female,0.185714,0.8,0.000000,0.333333,0.0,0.0,0.433451,0.0,0.0,1.0,1.0,0.0


## Data Modelling on the downsampled dataset 




`Training a Logistic Regression Model using the downsampled dataset`

---




In [108]:
d_model = LogisticRegression(solver='liblinear')

In [116]:
d_model.fit(downsampled_train_inputs[numeric_cols + encoded_cols], downsampled_train_targets)

LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=None, solver='liblinear', tol=0.0001, verbose=0,
                   warm_start=False)

### Making Predictions and Evaluating the Logistic Regression Model

We can now use the trained model to make predictions on the training, test 

In [117]:
X_train = downsampled_train_inputs[numeric_cols + encoded_cols]
X_val = downsampled_val_inputs[numeric_cols + encoded_cols]
X_test = downsampled_test_inputs[numeric_cols + encoded_cols]


In [123]:
downsampled_train_preds = d_model.predict(X_train)
downsampled_val_preds = d_model.predict(X_val)
downsampled_test_preds = d_model.predict(X_test)

In [124]:
downsampled_train_preds

array([0, 1, 0, ..., 1, 1, 0])

In [125]:
downsampled_train_targets

4541    0
516     1
3394    0
3370    1
6428    1
       ..
144     1
5213    0
3083    1
5685    1
618     0
Name: Exited, Length: 2224, dtype: int64

In [126]:
model.classes_

array([0, 1])

We can test the accuracy of the model's predictions by computing the percentage of matching values in `train_preds` and `train_targets`.

This can be done using the `accuracy_score` function from `sklearn.metrics`.

In [127]:
accuracy_score(downsampled_train_targets, downsampled_train_preds)

0.7162769784172662

In [128]:
#accuracy of the model's predictions on the validation set
accuracy_score(downsampled_val_targets, downsampled_val_preds)

0.6994609164420486

In [129]:
#accuracy of the model's predictions on the test set
accuracy_score(downsampled_test_targets, downsampled_test_preds)

0.6819407008086253

In [150]:
# calculate the F-Score
print('F1:', f1_score(downsampled_test_targets, downsampled_test_preds))

F1: 0.6853333333333332


The model achieves an accuracy of 71.6% on the training set, 69.9% on the validation set and 68.1% on the test set.

the models F1 score is 68%

Using a confusion matrix to visualize the breakdown of correctly and incorrectly classified inputs
:

In [130]:
#breakdown of correctly and incorrectly classified inputs in the train set
confusion_matrix(downsampled_train_targets, downsampled_train_preds, normalize='true')

array([[0.69654528, 0.30345472],
       [0.26539462, 0.73460538]])

In [131]:
#breakdown of correctly and incorrectly classified inputs in the validation set
confusion_matrix(downsampled_val_targets, downsampled_val_preds, normalize='true')

array([[0.67005076, 0.32994924],
       [0.26724138, 0.73275862]])

In [132]:
#breakdown of correctly and incorrectly classified inputs in the test set
confusion_matrix(downsampled_test_targets, downsampled_test_preds, normalize='true')

array([[0.64010283, 0.35989717],
       [0.27195467, 0.72804533]])

The model achieves an accuracy of 86% on the training set, 88.5% on the validation set and 87% on the test set.

Using a confusion matrix to visualize the breakdown of correctly and incorrectly classified inputs
:

`Training a Random Forest Model using the downsampled dataset`

---




In [135]:
d_random_model = RandomForestClassifier(n_jobs=-1, random_state=12345)

In [136]:
d_random_model.fit(train_inputs[numeric_cols + encoded_cols], train_targets)

RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=None, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=-1, oob_score=False, random_state=12345,
                       verbose=0, warm_start=False)

### Making Predictions and Evaluating the Random Forest 

We can now use the trained model to make predictions on the training, valid and test set

In [137]:
Random_X_train = downsampled_train_inputs[numeric_cols + encoded_cols]
Random_X_val = downsampled_val_inputs[numeric_cols + encoded_cols]
Random_X_test = downsampled_test_inputs[numeric_cols + encoded_cols]


In [138]:
d_random_train_preds = d_random_model.predict(Random_X_train)
d_random_val_preds = d_random_model.predict(Random_X_val)
d_random_test_preds = d_random_model.predict(Random_X_test)

In [139]:
d_random_train_preds

array([0, 1, 0, ..., 1, 0, 0])

In [140]:
downsampled_train_targets

4541    0
516     1
3394    0
3370    1
6428    1
       ..
144     1
5213    0
3083    1
5685    1
618     0
Name: Exited, Length: 2224, dtype: int64

In [141]:
model.classes_

array([0, 1])

We can test the accuracy of the model's predictions by computing the percentage of matching values in `train_preds` and `train_targets`.

This can be done using the `accuracy_score` function from `sklearn.metrics`.

In [142]:
accuracy_score(downsampled_train_targets, d_random_train_preds)

0.864658273381295

In [143]:
#accuracy of the model's predictions on the validation set
accuracy_score(downsampled_val_targets, d_random_val_preds)

0.8854447439353099

In [144]:
#accuracy of the model's predictions on the test set
accuracy_score(downsampled_test_targets, d_random_test_preds)

0.8733153638814016

In [265]:
# calculate the F-Score
print('F1:', f1_score(downsampled_test_targets, d_random_test_preds))

F1: 0.8507936507936507


The model achieves an accuracy of 86% on the training set, 88.5% on the validation set and 87% on the test set.

The F1 score by the Random Forsest model is .85 an improvement from the imbalanced score of the same model which was .52

Using a confusion matrix to visualize the breakdown of correctly and incorrectly classified inputs
:

In [145]:
#breakdown of correctly and incorrectly classified inputs in the train set
confusion_matrix(downsampled_train_targets, d_random_train_preds, normalize='true')

array([[0.98225957, 0.01774043],
       [0.24457936, 0.75542064]])

In [146]:
#breakdown of correctly and incorrectly classified inputs in the validation set
confusion_matrix(downsampled_val_targets, d_random_val_preds, normalize='true')

array([[0.98730964, 0.01269036],
       [0.22988506, 0.77011494]])

In [147]:
#breakdown of correctly and incorrectly classified inputs in the test set
confusion_matrix(downsampled_test_targets, d_random_test_preds, normalize='true')

array([[0.97686375, 0.02313625],
       [0.2407932 , 0.7592068 ]])

###Findings 

The Random Forest model has less predicted errors than the logistic model on the downsampled test set . the False negatives are at 24% and the FALSE Positives are at 2.3% . This is an improvement from the imbalanced datasest  which was
False negatives: 59% , FALSE Positives: 3.5%

the Logistic regression model perfomed better on the downsampled dataset and the predicted errors improved,FN: from 78% in the imbalanced dataset to 27%  and FP: from at 2.4% to 35%

For the downsampled dataset, the Random Forest model gives better results and therefore is a better model to be used to predict the Exited value.

