## Your Goal 
For this Episode of the Series, your task is to predict whether a customer continues with their account or closes it (e.g., churns). Good luck!

## Evaluation
Submissions are evaluated on area under the ROC curve between the predicted probability and the observed target.

## Submission File
For each id in the test set, you must predict the probability for the target variable Exited. The file should contain a header and have the following format:

## Features
* Balance-related Features 

We’ve introduced features such as IsBalanceZero, Low_Balance, Mid_Balance, and High_Balance to categorize customers based on their account balances.
* Salary-related Features

Similarly, we’ve categorized customers into Low_salary, Mid_salary, and High_salary groups based on their estimated salaries.
* Interaction Features

We’ve created interaction features like HasCard&Active, Interaction_Score, and Gender_Balance to capture relationships between different attributes.
* Credit Score and Age Groups

By binning credit scores and ages into categories, we’ve added credit_score_cat and Age_Group features, providing a holistic view of customer demographics.
* Utilization Ratios

We’ve calculated the Credit_Utilization_Ratio and Balance_to_Salary_Ratio to gauge customers' credit and financial behaviors
* Other Features

Additionally, we’ve incorporated features such as Inactive_Flag, Full_Utilization_Flag, Salary_Credit_Score_Interaction, and more to capture diverse aspects of customer interactions and behaviors.

In [22]:
import pandas as pd
import numpy as np
from sklearn.compose import make_column_transformer
from sklearn.preprocessing import OneHotEncoder
from sklearn.pipeline import make_pipeline
from sklearn.linear_model import SGDClassifier
from sklearn.kernel_approximation import RBFSampler
from sklearn.model_selection import train_test_split

In [2]:
df = pd.read_csv("../data/Binary Classification with a Bank Churn Dataset/train.csv")
df.head()

Unnamed: 0,id,CustomerId,Surname,CreditScore,Geography,Gender,Age,Tenure,Balance,NumOfProducts,HasCrCard,IsActiveMember,EstimatedSalary,Exited
0,0,15674932,Okwudilichukwu,668,France,Male,33.0,3,0.0,2,1.0,0.0,181449.97,0
1,1,15749177,Okwudiliolisa,627,France,Male,33.0,1,0.0,2,1.0,1.0,49503.5,0
2,2,15694510,Hsueh,678,France,Male,40.0,10,0.0,2,1.0,0.0,184866.69,0
3,3,15741417,Kao,581,France,Male,34.0,2,148882.54,1,1.0,1.0,84560.88,0
4,4,15766172,Chiemenam,716,Spain,Male,33.0,5,0.0,2,1.0,1.0,15068.83,0


In [3]:
len(df)

165034

In [5]:
df.dtypes

id                   int64
CustomerId           int64
Surname             object
CreditScore          int64
Geography           object
Gender              object
Age                float64
Tenure               int64
Balance            float64
NumOfProducts        int64
HasCrCard          float64
IsActiveMember     float64
EstimatedSalary    float64
Exited               int64
dtype: object

In [6]:
df.isna().sum()

id                 0
CustomerId         0
Surname            0
CreditScore        0
Geography          0
Gender             0
Age                0
Tenure             0
Balance            0
NumOfProducts      0
HasCrCard          0
IsActiveMember     0
EstimatedSalary    0
Exited             0
dtype: int64

In [18]:
X = df.drop("Exited", axis=1)
y = df["Exited"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

In [19]:
object_columns = df.select_dtypes(include=["object"]).columns
preprocessor = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), object_columns), remainder="passthrough")

In [20]:
sgd_model = make_pipeline((preprocessor), (SGDClassifier()))
sgd_model.fit(X_train, y_train)

In [21]:
sgd_model.score(X_train, y_train)

0.788399342558719