# DTSA5511 - Bank Churn Classifiation

**Author** - Korkrid Akepanidtaworn, University of Colorado Boulder, Masters in Data Science

**Date** - August, 21,2024

## Project Goal

- My goal is to predict customer churn in banking industry using machine learning.
- [Bank Customer Churn Prediction](https://www.kaggle.com/datasets/shubhammeshram579/bank-customer-churn-prediction)

## Dataset Description

The bank customer churn dataset is a commonly used dataset for predicting customer churn in the banking industry. It contains information on bank customers who either left the bank or continue to be a customer. The dataset includes the following attributes:

    Customer ID: A unique identifier for each customer
    Surname: The customer's surname or last name
    Credit Score: A numerical value representing the customer's credit score
    Geography: The country where the customer resides (France, Spain or Germany)
    Gender: The customer's gender (Male or Female)
    Age: The customer's age.
    Tenure: The number of years the customer has been with the bank
    Balance: The customer's account balance
    NumOfProducts: The number of bank products the customer uses (e.g., savings account, credit card)
    HasCrCard: Whether the customer has a credit card (1 = yes, 0 = no)
    IsActiveMember: Whether the customer is an active member (1 = yes, 0 = no)
    EstimatedSalary: The estimated salary of the customer
    Exited: Whether the customer has churned (1 = yes, 0 = no)


In [1]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

/kaggle/input/binary-classification-bank-churn-dataset-cleaned/train_cleaned.csv
/kaggle/input/binary-classification-bank-churn-dataset-cleaned/test_cleaned.csv


In [3]:
# List all the files in the specified directory
os.listdir('../input/binary-classification-bank-churn-dataset-cleaned/')

['train_cleaned.csv', 'test_cleaned.csv']

In [4]:
# Load data
import pandas as pd
train_df = pd.read_csv('/kaggle/input/binary-classification-bank-churn-dataset-cleaned/train_cleaned.csv')
test_df = pd.read_csv('/kaggle/input/binary-classification-bank-churn-dataset-cleaned/test_cleaned.csv')

In [9]:
train_df.head()

Unnamed: 0,Gender,Balance,NumOfProducts,IsActiveMember,Geography_France,Geography_Germany,Geography_Spain,Age_bin,Exited
0,0,0.0,2,0.0,1,0,0,1,0
1,0,0.0,2,1.0,1,0,0,1,0
2,0,0.0,2,0.0,1,0,0,3,0
3,0,148882.54,1,1.0,1,0,0,1,0
4,0,0.0,2,1.0,0,0,1,1,0


In [10]:
test_df.head()

Unnamed: 0,Gender,Balance,NumOfProducts,IsActiveMember,Geography_France,Geography_Germany,Geography_Spain,Age_bin
0,1,0.0,2,1.0,1,0,0,0
1,1,0.0,1,0.0,1,0,0,4
2,1,0.0,2,0.0,1,0,0,1
3,0,0.0,1,0.0,1,0,0,2
4,0,121263.62,1,0.0,0,1,0,2


## EDA Procedure

Inspect, Visualize, and Clean the Data

In [6]:
# Display basic information about the datasets
print("Training Data Info:")
print(train_df.info())
print("\nTest Data Info:")
print(test_df.info())

# Display first few rows of the datasets
print("\nTraining Data Sample:")
print(train_df.head())
print("\nTest Data Sample:")
print(test_df.head())

Training Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 260226 entries, 0 to 260225
Data columns (total 9 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Gender             260226 non-null  int64  
 1   Balance            260226 non-null  float64
 2   NumOfProducts      260226 non-null  int64  
 3   IsActiveMember     260226 non-null  float64
 4   Geography_France   260226 non-null  int64  
 5   Geography_Germany  260226 non-null  int64  
 6   Geography_Spain    260226 non-null  int64  
 7   Age_bin            260226 non-null  int64  
 8   Exited             260226 non-null  int64  
dtypes: float64(2), int64(7)
memory usage: 17.9 MB
None

Test Data Info:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 110023 entries, 0 to 110022
Data columns (total 8 columns):
 #   Column             Non-Null Count   Dtype  
---  ------             --------------   -----  
 0   Gender             110023 non-null  int64  
 

### Declare Model Architecture

In [11]:
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Embedding, LSTM, Dense, SpatialDropout1D, Bidirectional

# Define model architecture
model = Sequential()
model.add(Embedding(input_dim=20000, output_dim=128, input_length=100))
model.add(SpatialDropout1D(0.2))
model.add(Bidirectional(LSTM(64, dropout=0.2, recurrent_dropout=0.2)))
model.add(Dense(1, activation='sigmoid'))

# Compile the model
model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])

# Print the model summary
model.summary()

2024-08-20 21:42:49.791756: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-08-20 21:42:49.791944: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-08-20 21:42:49.967166: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered


In [14]:
from sklearn.model_selection import train_test_split

# Split the data into training and validation sets
X_train, X_val, y_train, y_val = train_test_split(train_df['IsActiveMember'].values, test_size=0.2, random_state=42)

# Train the model
history = model.fit(X_train, y_train, epochs=5, batch_size=64, validation_data=(X_val, y_val), verbose=2)

ValueError: not enough values to unpack (expected 4, got 2)

In [None]:
# Evaluate the model
val_loss, val_accuracy = model.evaluate(X_val, y_val, verbose=0)
print(f"Validation Loss: {val_loss}")
print(f"Validation Accuracy: {val_accuracy}")

In [None]:
import matplotlib.pyplot as plt

# Plot accuracy
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epochs')
plt.ylabel('Accuracy')
plt.legend()
plt.show()

# Plot loss
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()

In [None]:
# Make predictions on the test data
test_predictions = model.predict(test_padded)
test_predictions = (test_predictions > 0.5).astype(int).flatten()

In [None]:
# Prepare the Submission File
submission = sample_submission_df.copy()
submission['target'] = test_predictions

In [None]:
# Check the first few rows of the submission file
print(submission.head())

# Save the file
submission.to_csv('/kaggle/working/submission_a4.csv', index=False)