# Binary Classification with a Bank Churn Dataset
(Playground Series - Season 4, Episode 1)

https://www.kaggle.com/competitions/playground-series-s4e1/data

![Image](./data/picture.png)

__About the Tabular Playground Series__
The goal of the Tabular Playground Series is to provide the Kaggle community with a variety of fairly light-weight challenges that can be used to learn and sharpen skills in different aspects of machine learning and data science. The duration of each competition will generally only last a few weeks, and may have longer or shorter durations depending on the challenge. The challenges will generally use fairly light-weight datasets that are synthetically generated from real-world data, and will provide an opportunity to quickly iterate through various model and feature engineering ideas, create visualizations, etc.

__Synthetically-Generated Datasets__
Using synthetic data for Playground competitions allows us to strike a balance between having real-world data (with named features) and ensuring test labels are not publicly available. This allows us to host competitions with more interesting datasets than in the past. While there are still challenges with synthetic data generation, the state-of-the-art is much better now than when we started the Tabular Playground Series two years ago, and that goal is to produce datasets that have far fewer artifacts. Please feel free to give us feedback on the datasets for the different competitions so that we can continue to improve!

__Dataset Description__
The dataset for this competition (both train and test) was generated from a deep learning model trained on the Bank Customer Churn Prediction dataset. Feature distributions are close to, but not exactly the same, as the original. Feel free to use the original dataset as part of this competition, both to explore differences as well as to see whether incorporating the original in training improves model performance.

Files
* train.csv - the training dataset; Exited is the binary target
* test.csv - the test dataset; your objective is to predict the probability of Exited
* sample_submission.csv - a sample submission file in the correct format

Models

1. K-Nearest Neighboor Model            x
2. Gaussian Naive Bayes Model           x
3. Logistic Regressor                   x
4. Support Vector Classification Model  x
5. Decision Tree Model                  x
6. Random Forest Model                  x
7. Linear Discriminant Analysis Model   x
8. Gradient Boosting Classifier Model   x
9. Neural Network CLassifier Model      x

In [1]:
%pip install scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [45]:
# Import all librarys

import time
import numpy as np
import pandas as pd
import pickle as pkl
import datetime as dt
import warnings as wn
import matplotlib.pyplot as plt


from sklearn.pipeline import Pipeline
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import classification_report
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import QuantileTransformer

In [3]:
# Ignore all warnings
wn.filterwarnings('ignore')

In [4]:
# Set all variables paths 

_plots = './plots/'
_test = './data/test.csv'
_train = './data/train.csv'
_model = './model/model.pkl'
_info = './model/model.docx'
_submission = './data/submission.csv'


In [5]:
# Read the datasets
test = pd.read_csv(_test)
train = pd.read_csv(_train)

In [6]:
# Display the first n rows of training
train.head(n=10)

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0
5,5,15771669,Genovese,588,Germany,Male,36.0,4,131778.58,1,1.0,0.0,136024.31,1
6,6,15692819,Ch'ang,593,France,Female,30.0,8,144772.69,1,1.0,0.0,29792.11,0
7,7,15669611,Chukwuebuka,678,Spain,Male,37.0,1,138476.41,1,1.0,0.0,106851.6,0
8,8,15691707,Manna,676,France,Male,43.0,4,0.0,2,1.0,0.0,142917.13,0
9,9,15591721,Cattaneo,583,Germany,Male,40.0,4,81274.33,1,1.0,1.0,170843.07,0


In [7]:
# Describe the data
train.describe()

Unnamed: 0,id,CustomerId,CreditScore,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
count,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0
mean,82516.5,15692010.0,656.454373,38.125888,5.020353,55478.086689,1.554455,0.753954,0.49777,112574.822734,0.211599
std,47641.3565,71397.82,80.10334,8.867205,2.806159,62817.663278,0.547154,0.430707,0.499997,50292.865585,0.408443
min,0.0,15565700.0,350.0,18.0,0.0,0.0,1.0,0.0,0.0,11.58,0.0
25%,41258.25,15633140.0,597.0,32.0,3.0,0.0,1.0,1.0,0.0,74637.57,0.0
50%,82516.5,15690170.0,659.0,37.0,5.0,0.0,2.0,1.0,0.0,117948.0,0.0
75%,123774.75,15756820.0,710.0,42.0,7.0,119939.5175,2.0,1.0,1.0,155152.4675,0.0
max,165033.0,15815690.0,850.0,92.0,10.0,250898.09,4.0,1.0,1.0,199992.48,1.0


In [8]:
# Display more info about the data
train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 165034 entries, 0 to 165033
Data columns (total 14 columns):
 #   Column           Non-Null Count   Dtype  
---  ------           --------------   -----  
 0   id               165034 non-null  int64  
 1   CustomerId       165034 non-null  int64  
 2   Surname          165034 non-null  object 
 3   CreditScore      165034 non-null  int64  
 4   Geography        165034 non-null  object 
 5   Gender           165034 non-null  object 
 6   Age              165034 non-null  float64
 7   Tenure           165034 non-null  int64  
 8   Balance          165034 non-null  float64
 9   NumOfProducts    165034 non-null  int64  
 10  HasCrCard        165034 non-null  float64
 11  IsActiveMember   165034 non-null  float64
 12  EstimatedSalary  165034 non-null  float64
 13  Exited           165034 non-null  int64  
dtypes: float64(5), int64(6), object(3)
memory usage: 17.6+ MB


In [9]:
# Set the columns names
names = train.columns
names

Index(['id', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender',
       'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [10]:
# Clean the data
print(train.notna().any())

id                 True
CustomerId         True
Surname            True
CreditScore        True
Geography          True
Gender             True
Age                True
Tenure             True
Balance            True
NumOfProducts      True
HasCrCard          True
IsActiveMember     True
EstimatedSalary    True
Exited             True
dtype: bool


In [11]:
# Remove unnecesarry columns
train = train.drop(['id'], axis=1)
test = test.drop(['id'], axis=1)

In [12]:
# Split The Training And Testing Data

rows = train.shape[0]
cols = train.shape[1]

y_data = pd.DataFrame(train['Exited'])
X_data = pd.DataFrame(train.iloc[:,:-1])
X_train, X_test, y_train, y_test = train_test_split(X_data, y_data, test_size=0.2, random_state=42)

In [13]:
# Fill the null values with zero

X_train = X_train.fillna(0)
X_test = X_test.fillna(0)

In [14]:
# Encode the string training data

for column in X_train.columns[:]:
    if X_train[column].dtype == 'O':
        encoder = LabelEncoder()
        X_train[column] = encoder.fit_transform(X_train[column]) + 1
        mapping_dict = dict(zip(encoder.classes_, encoder.transform(encoder.classes_) + 1))

for column in X_test.columns[:]: 
    if X_test[column].dtype == 'O':
        encoder = LabelEncoder()
        X_test[column] = encoder.fit_transform(X_test[column]) + 1
        mapping_dict = dict(zip(encoder.classes_, encoder.transform(encoder.classes_) + 1))


In [15]:
# Newer input and output data
X_train

Unnamed: 0,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
149380,15780088,2704,679,1,2,41.0,9,0.00,2,1.0,1.0,103560.98
164766,15679760,2303,721,1,2,46.0,9,115764.32,2,1.0,0.0,102950.79
155569,15637678,1511,551,1,2,35.0,5,0.00,1,1.0,0.0,155394.52
124304,15728693,889,743,3,1,31.0,3,57866.03,2,1.0,1.0,107428.42
108008,15613673,1506,675,1,2,28.0,2,0.00,2,1.0,0.0,134110.93
...,...,...,...,...,...,...,...,...,...,...,...,...
119879,15730673,1274,668,2,2,45.0,6,104576.80,1,1.0,0.0,113081.42
103694,15731166,1274,751,1,2,43.0,7,0.00,2,1.0,0.0,88866.39
131932,15573741,2395,753,1,1,39.0,7,0.00,2,1.0,0.0,167973.63
146867,15754574,200,685,1,1,48.0,4,0.00,2,1.0,1.0,24998.75


In [16]:
# Plotting the training data for relevancy 

def plot():
    numerical_columns = train.select_dtypes(include=['float64', 'int64']).columns
    for column in numerical_columns[:-1]:
        plt.hist(train[train['Exited'] == 0][column], label='Exited', color='red', alpha=0.7, density=False)
        plt.hist(train[train['Exited'] == 1][column], label='NotExited', color='blue', alpha=0.7, density=False)
        plt.legend()
        plt.title(column)
        plt.ylabel(column)
        plt.xlabel('Exited')
        plt.savefig(f'{_plots}{column}.png')
        plt.show()
        
# plot()

# K-NearestNeighboor Model 

In [17]:
from sklearn.neighbors import KNeighborsClassifier

In [18]:
knn_model = KNeighborsClassifier(n_neighbors=6)
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print(classification_report(y_test, knn_pred))

              precision    recall  f1-score   support

           0       0.79      0.97      0.87     26052
           1       0.27      0.04      0.07      6955

    accuracy                           0.78     33007
   macro avg       0.53      0.50      0.47     33007
weighted avg       0.68      0.78      0.70     33007



# Gaussian Naive Bayes Model 

In [19]:
from sklearn.naive_bayes import GaussianNB

In [20]:
gnb_model = GaussianNB()
gnb_model.fit(X_train, y_train)
gnb_pred = gnb_model.predict(X_test)
print(classification_report(y_test, gnb_pred))

              precision    recall  f1-score   support

           0       0.82      0.95      0.88     26052
           1       0.52      0.20      0.29      6955

    accuracy                           0.79     33007
   macro avg       0.67      0.57      0.58     33007
weighted avg       0.75      0.79      0.75     33007



# Logistic Regression Model

In [21]:
from sklearn.linear_model import LogisticRegression

In [22]:
log_model = LogisticRegression()
log_model.fit(X_train, y_train)
log_pred = log_model.predict(X_test)
print(classification_report(y_test, log_pred))

              precision    recall  f1-score   support

           0       0.79      1.00      0.88     26052
           1       0.00      0.00      0.00      6955

    accuracy                           0.79     33007
   macro avg       0.39      0.50      0.44     33007
weighted avg       0.62      0.79      0.70     33007



# Support Vector Classifier Model

In [26]:
from sklearn.svm import SVC

In [27]:
svc_model = SVC()
svc_model.fit(X_train, y_train)
svc_pred = svc_model.predict(X_test)
print(classification_report(y_test, svc_pred))

# Decision Tree Cassifier Model

In [23]:
from sklearn.tree import DecisionTreeClassifier

In [24]:
dtc_model = DecisionTreeClassifier()
dtc_model.fit(X_train, y_train)
dtc_pred = dtc_model.predict(X_test)
print(classification_report(y_test, dtc_pred))

              precision    recall  f1-score   support

           0       0.88      0.87      0.87     26052
           1       0.52      0.54      0.53      6955

    accuracy                           0.80     33007
   macro avg       0.70      0.70      0.70     33007
weighted avg       0.80      0.80      0.80     33007



# Random Forest Classifier Model

In [25]:
from sklearn.ensemble import RandomForestClassifier

In [26]:
rfc_model = RandomForestClassifier()
rfc_model.fit(X_train, y_train)
rfc_pred = rfc_model.predict(X_test)
print(classification_report(y_test, rfc_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.92     26052
           1       0.74      0.54      0.62      6955

    accuracy                           0.86     33007
   macro avg       0.81      0.74      0.77     33007
weighted avg       0.85      0.86      0.85     33007



# Linear Discriminant Analysis Model

In [27]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis

In [28]:
lda_model = LinearDiscriminantAnalysis()
lda_model.fit(X_train, y_train)
lda_pred = lda_model.predict(X_test)
print(classification_report(y_test, lda_pred))

              precision    recall  f1-score   support

           0       0.85      0.95      0.90     26052
           1       0.66      0.37      0.48      6955

    accuracy                           0.83     33007
   macro avg       0.75      0.66      0.69     33007
weighted avg       0.81      0.83      0.81     33007



# Gradient Boosting Classifier Model

In [29]:
from sklearn.ensemble import GradientBoostingClassifier

In [30]:
gbc_model = GradientBoostingClassifier()
gbc_model.fit(X_train, y_train)
gbc_pred = gbc_model.predict(X_test)
print(classification_report(y_test, gbc_pred))

              precision    recall  f1-score   support

           0       0.88      0.95      0.92     26052
           1       0.76      0.53      0.62      6955

    accuracy                           0.86     33007
   macro avg       0.82      0.74      0.77     33007
weighted avg       0.86      0.86      0.86     33007



# Neural Network Classifier Model

In [31]:
from sklearn.neural_network import MLPClassifier

In [32]:
nnc_model = MLPClassifier()
nnc_model.fit(X_train, y_train)
nnc_pred = nnc_model.predict(X_test)
print(classification_report(y_test, nnc_pred))

              precision    recall  f1-score   support

           0       0.79      1.00      0.88     26052
           1       0.00      0.00      0.00      6955

    accuracy                           0.79     33007
   macro avg       0.39      0.50      0.44     33007
weighted avg       0.62      0.79      0.70     33007



In [38]:
def Statistics():
    print("1. KNeighborsClassifier Score: \t\t\t", accuracy_score(y_test, knn_pred))
    print("2. Gaussian Naive Bayes Score: \t\t\t", accuracy_score(y_test, gnb_pred))
    print("3. Logistic Regressor Score: \t\t\t", accuracy_score(y_test, log_pred))
    # print("4. Support Vector Classification Score: ", accuracy_score(y_test, svc_pred))
    print("5. Decision Tree Score: \t\t\t", accuracy_score(y_test, dtc_pred))
    print("6. Random Forest Score: \t\t\t", accuracy_score(y_test, rfc_pred))
    print("7. Linear Discriminant Analysis Score: \t\t", accuracy_score(y_test, lda_pred))
    print("8. Gradient Boost Classifier Score: \t\t", accuracy_score(y_test, gbc_pred))
    print("9. Neural Network Score: \t\t\t", accuracy_score(y_test, nnc_pred))

Statistics()

1. KNeighborsClassifier Score: 			 0.7754415729996668
2. Gaussian Naive Bayes Score: 			 0.7931650861938376
3. Logistic Regressor Score: 			 0.7892871209137455
5. Decision Tree Score: 			 0.796800678643924
6. Random Forest Score: 			 0.8630290544429969
7. Linear Discriminant Analysis Score: 		 0.8267034265458842
8. Gradient Boost Classifier Score: 		 0.864968037083043
9. Neural Network Score: 			 0.7892871209137455


# Pipeline Model & Scaler

In [42]:
# Create the pipeline

model = GradientBoostingClassifier()
pipeline = Pipeline([
    ('scaler', QuantileTransformer()),
    ('model', model)
])

pipeline.fit(X_train, y_train)
y_pipe = pipeline.predict(X_test)
print(classification_report(y_test, y_pipe))
print("Final Score: ", accuracy_score(y_test, y_pipe))

              precision    recall  f1-score   support

           0       0.88      0.95      0.92     26052
           1       0.75      0.53      0.62      6955

    accuracy                           0.87     33007
   macro avg       0.82      0.74      0.77     33007
weighted avg       0.86      0.87      0.86     33007

Final Score:  0.8650286302905444


In [43]:
# Load the info about model in the file

def update_info():
    info = open(_info, "w")
    pipeline.fit(X_train, y_train)
    y_pipe = pipeline.predict(X_test)
    model_info = [f"Accuracy Score: {accuracy_score(y_test, y_pipe)}\n", 
                  f"Model Name: {pipeline.named_steps['model']}\n", 
                  f"Time:   {dt.datetime.now()}\n\n",
                  f"Report: {classification_report(y_test, y_pipe)}"]
    info.writelines(model_info)
    print(''.join(model_info))
    pipeline.fit(X_train, y_train)

In [46]:
# Load Model In File Project

val = input('Are you sure you want to save the last model: ')
if(val == 'y'):
    print('Saving . . .')
    update_info()
    model = pipeline
    pkl.dump(model, open(_model, 'wb'))


Saving . . .
Accuracy Score: 0.8649983336867937
Model Name: GradientBoostingClassifier()
Time:   2024-01-31 17:04:54.358229

Report:               precision    recall  f1-score   support

           0       0.88      0.95      0.92     26052
           1       0.75      0.53      0.62      6955

    accuracy                           0.86     33007
   macro avg       0.82      0.74      0.77     33007
weighted avg       0.86      0.86      0.86     33007

