## 1. Overview, Description of Problem

Predict whether a customer continues with their account or closes it. 

Performance metric is the area under the ROC curve between predicted probability and observed target. 

For each id in the test set, you must predict the probability of the target value "Exited". 

Citation:
Walter Reade, Ashley Chow. (2024). Binary Classification with a Bank Churn Dataset . Kaggle. https://kaggle.com/competitions/playground-series-s4e1

## 2. Import the Data

Import libraries

In [2]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import roc_auc_score

In [3]:
train_df = pd.read_csv('train.csv')
test_df = pd.read_csv('test.csv')

## 3. Explore the Data

In [4]:
train_df.head()

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0


Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary
0,165034,15773898,Lucchese,586,France,Female,23.0,2,0.0,2,0.0,1.0,160976.75
1,165035,15782418,Nott,683,France,Female,46.0,2,0.0,1,1.0,0.0,72549.27
2,165036,15807120,K?,656,France,Female,34.0,7,0.0,2,1.0,0.0,138882.09
3,165037,15808905,O'Donnell,681,France,Male,36.0,8,0.0,1,1.0,0.0,113931.57
4,165038,15607314,Higgins,752,Germany,Male,38.0,10,121263.62,1,1.0,0.0,139431.0


In [9]:
#Name of Columns
train_df.columns

Index(['id', 'CustomerId', 'Surname', 'CreditScore', 'Geography', 'Gender',
       'Age', 'Tenure', 'Balance', 'NumOfProducts', 'HasCrCard',
       'IsActiveMember', 'EstimatedSalary', 'Exited'],
      dtype='object')

In [16]:
#Types
train_df.dtypes

int_types = ['id', 'CustomerId', 'CreditScore', 'Tenure', 'NumOfProducts', 'Exited']
target = train_df['Exited']
cat_types = ['Surname', 'Geography', 'Gender']
float_types = ['Age', 'Balance', 'HasCrCard', 'IsActiveMember', 'EstimatedSalary']

In [11]:
#Missing Values
train_df.isna().sum()

id                 0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

### Visualize Data

In [14]:
target.value_counts() 
#O = No 1 = Yes

Exited
0    130113
1     34921
Name: count, dtype: int64

In [17]:
train_df[int_types].describe()

Unnamed: 0,id,CustomerId,CreditScore,Tenure,NumOfProducts,Exited
count,165034.0,165034.0,165034.0,165034.0,165034.0,165034.0
mean,82516.5,15692010.0,656.454373,5.020353,1.554455,0.211599
std,47641.3565,71397.82,80.10334,2.806159,0.547154,0.408443
min,0.0,15565700.0,350.0,0.0,1.0,0.0
25%,41258.25,15633140.0,597.0,3.0,1.0,0.0
50%,82516.5,15690170.0,659.0,5.0,2.0,0.0
75%,123774.75,15756820.0,710.0,7.0,2.0,0.0
max,165033.0,15815690.0,850.0,10.0,4.0,1.0


ID and CustomerID don't really provide any information.

Credit Scores - Between 350 to 850, with the average around 656 (median 659), which with minimal knowledge seems normal. 

NumOfProducts - 1 to 4, with an average of 2 (bc you can't have 1.5 products?) - Median 2. 

Exited is the target and is yes or no (1 or 0)

In [18]:
train_df[float_types].describe()

Unnamed: 0,Age,Balance,HasCrCard,IsActiveMember,EstimatedSalary
count,165034.0,165034.0,165034.0,165034.0,165034.0
mean,38.125888,55478.086689,0.753954,0.49777,112574.822734
std,8.867205,62817.663278,0.430707,0.499997,50292.865585
min,18.0,0.0,0.0,0.0,11.58
25%,32.0,0.0,1.0,0.0,74637.57
50%,37.0,0.0,1.0,0.0,117948.0
75%,42.0,119939.5175,1.0,1.0,155152.4675
max,92.0,250898.09,1.0,1.0,199992.48


Age - 18 to 92, with the average of 38 years old, median around 37. 

Balance - 0 to 250,898  with average of 55,478 and a median of 0 -- Maybe check into this. 

HasCrCard, IsActiveMember - yes or No.

EstimatedSalary - 11$ to 199,992. Avg: 112,572 Med: 117,948

In [20]:
train_df['Balance'].value_counts().sort_values()

Balance
62321.36         1
84483.05         1
122723.67        1
121323.19        1
118711.57        1
             ...  
129855.32       59
122314.50       63
127864.40       64
124577.33       88
0.00         89648
Name: count, Length: 30075, dtype: int64