# <h1><b>Churn Prediction: Credit Card Customers</b></h1>

The main issue in the dataset is that customers are leaving simultaneously. Therefore, the primary objective is to reduce churn. This is clearly a binary classification problem. If I can classify or predict which customers are likely to leave the service, the bank will have a better opportunity to retain them.

## <b>Data preprocessing</b>

In [1]:
import pandas as pd

In [2]:
df = pd.read_csv("data/data.csv")
df.columns

Index(['CLIENTNUM', 'Attrition_Flag', 'Customer_Age', 'Gender',
       'Dependent_count', 'Education_Level', 'Marital_Status',
       'Income_Category', 'Card_Category', 'Months_on_book',
       'Total_Relationship_Count', 'Months_Inactive_12_mon',
       'Contacts_Count_12_mon', 'Credit_Limit', 'Total_Revolving_Bal',
       'Avg_Open_To_Buy', 'Total_Amt_Chng_Q4_Q1', 'Total_Trans_Amt',
       'Total_Trans_Ct', 'Total_Ct_Chng_Q4_Q1', 'Avg_Utilization_Ratio',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1',
       'Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2'],
      dtype='object')

### <b>Drop irrelevant columns</b>

In [3]:
df["Churn"] = (df["Attrition_Flag"] == "Attrited Customer").astype(int)
dropping_columns = [
    "CLIENTNUM",
    "Attrition_Flag",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_1",
    "Naive_Bayes_Classifier_Attrition_Flag_Card_Category_Contacts_Count_12_mon_Dependent_count_Education_Level_Months_Inactive_12_mon_2",
]
df = df.drop(columns=dropping_columns)

In [4]:
df.select_dtypes(include="object").columns

Index(['Gender', 'Education_Level', 'Marital_Status', 'Income_Category',
       'Card_Category'],
      dtype='object')

In [5]:
df = pd.get_dummies(df, columns=[
    "Gender",
    "Education_Level",
    "Marital_Status",
    "Income_Category",
    "Card_Category"
])
df

Unnamed: 0,Customer_Age,Dependent_count,Months_on_book,Total_Relationship_Count,Months_Inactive_12_mon,Contacts_Count_12_mon,Credit_Limit,Total_Revolving_Bal,Avg_Open_To_Buy,Total_Amt_Chng_Q4_Q1,...,Income_Category_$120K +,Income_Category_$40K - $60K,Income_Category_$60K - $80K,Income_Category_$80K - $120K,Income_Category_Less than $40K,Income_Category_Unknown,Card_Category_Blue,Card_Category_Gold,Card_Category_Platinum,Card_Category_Silver
0,45,3,39,5,1,3,12691.0,777,11914.0,1.335,...,False,False,True,False,False,False,True,False,False,False
1,49,5,44,6,1,2,8256.0,864,7392.0,1.541,...,False,False,False,False,True,False,True,False,False,False
2,51,3,36,4,1,0,3418.0,0,3418.0,2.594,...,False,False,False,True,False,False,True,False,False,False
3,40,4,34,3,4,1,3313.0,2517,796.0,1.405,...,False,False,False,False,True,False,True,False,False,False
4,40,3,21,5,1,0,4716.0,0,4716.0,2.175,...,False,False,True,False,False,False,True,False,False,False
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
10122,50,2,40,3,2,3,4003.0,1851,2152.0,0.703,...,False,True,False,False,False,False,True,False,False,False
10123,41,2,25,4,2,3,4277.0,2186,2091.0,0.804,...,False,True,False,False,False,False,True,False,False,False
10124,44,1,36,5,3,4,5409.0,0,5409.0,0.819,...,False,False,False,False,True,False,True,False,False,False
10125,30,2,36,4,3,3,5281.0,0,5281.0,0.535,...,False,True,False,False,False,False,True,False,False,False


### <b>Split Data</b>

In [6]:
x = df.drop(columns=["Churn"])
y = df["Churn"]

In [7]:
from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

### <b>Standardize Data</b>

In [8]:
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

### <b>Train & Test the Model</b>

In [9]:
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [10]:
def test_model(model):
    # Predict on the test set
    y_pred = model.predict(x_test_scaled)

    # Evaluation
    accuracy = accuracy_score(y_test, y_pred)
    conf_matrix = confusion_matrix(y_test, y_pred)
    report = classification_report(y_test, y_pred)

    print(f"Accuracy: {accuracy}")
    print(f"\nConfusion Matrix:\n{conf_matrix}")
    print(f"\nClassificattion Report:\n{report}")

In [11]:
model_1 = LogisticRegression(solver='liblinear')
model_1.fit(x_train_scaled, y_train)
test_model(model=model_1)

Accuracy: 0.8998025666337611

Confusion Matrix:
[[1648   53]
 [ 150  175]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94      1701
           1       0.77      0.54      0.63       325

    accuracy                           0.90      2026
   macro avg       0.84      0.75      0.79      2026
weighted avg       0.89      0.90      0.89      2026



In [12]:
model_2 = LogisticRegression(solver='liblinear', class_weight="balanced")
model_2.fit(x_train_scaled, y_train)
test_model(model=model_2)

Accuracy: 0.8553800592300099

Confusion Matrix:
[[1467  234]
 [  59  266]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.96      0.86      0.91      1701
           1       0.53      0.82      0.64       325

    accuracy                           0.86      2026
   macro avg       0.75      0.84      0.78      2026
weighted avg       0.89      0.86      0.87      2026



In [13]:
model_3 = LogisticRegression(solver="liblinear", class_weight="balanced")
model_3.fit(x_train, y_train)
test_model(model=model_3)

Accuracy: 0.29318854886475815

Confusion Matrix:
[[ 277 1424]
 [   8  317]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.97      0.16      0.28      1701
           1       0.18      0.98      0.31       325

    accuracy                           0.29      2026
   macro avg       0.58      0.57      0.29      2026
weighted avg       0.85      0.29      0.28      2026





In [14]:
model_4 = LogisticRegression(solver="liblinear", class_weight="balanced")
model_4.fit(x_train_scaled, y_train)
test_model(model=model_4)

Accuracy: 0.8553800592300099

Confusion Matrix:
[[1467  234]
 [  59  266]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.96      0.86      0.91      1701
           1       0.53      0.82      0.64       325

    accuracy                           0.86      2026
   macro avg       0.75      0.84      0.78      2026
weighted avg       0.89      0.86      0.87      2026



In [15]:
model_5 = LogisticRegression(solver="liblinear")
model_5.fit(x_train_scaled, y_train)
test_model(model=model_5)

Accuracy: 0.8998025666337611

Confusion Matrix:
[[1648   53]
 [ 150  175]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.92      0.97      0.94      1701
           1       0.77      0.54      0.63       325

    accuracy                           0.90      2026
   macro avg       0.84      0.75      0.79      2026
weighted avg       0.89      0.90      0.89      2026



### <b>Improving Model</b>

In [16]:
for idx, col in enumerate(df.columns):
    print(f"{idx + 1}. {col}")

1. Customer_Age
2. Dependent_count
3. Months_on_book
4. Total_Relationship_Count
5. Months_Inactive_12_mon
6. Contacts_Count_12_mon
7. Credit_Limit
8. Total_Revolving_Bal
9. Avg_Open_To_Buy
10. Total_Amt_Chng_Q4_Q1
11. Total_Trans_Amt
12. Total_Trans_Ct
13. Total_Ct_Chng_Q4_Q1
14. Avg_Utilization_Ratio
15. Churn
16. Gender_F
17. Gender_M
18. Education_Level_College
19. Education_Level_Doctorate
20. Education_Level_Graduate
21. Education_Level_High School
22. Education_Level_Post-Graduate
23. Education_Level_Uneducated
24. Education_Level_Unknown
25. Marital_Status_Divorced
26. Marital_Status_Married
27. Marital_Status_Single
28. Marital_Status_Unknown
29. Income_Category_$120K +
30. Income_Category_$40K - $60K
31. Income_Category_$60K - $80K
32. Income_Category_$80K - $120K
33. Income_Category_Less than $40K
34. Income_Category_Unknown
35. Card_Category_Blue
36. Card_Category_Gold
37. Card_Category_Platinum
38. Card_Category_Silver


Drop irrelevant columns

In [17]:
df_filtered = df[[
    "Churn",
    "Total_Relationship_Count",
    "Months_Inactive_12_mon",
    "Contacts_Count_12_mon",
    "Total_Revolving_Bal",
    "Total_Amt_Chng_Q4_Q1",
    "Total_Trans_Amt",
    "Total_Trans_Ct",
    "Total_Ct_Chng_Q4_Q1",
    "Avg_Utilization_Ratio"
]]

In [18]:
x = df_filtered.drop(columns=["Churn"])
y = df_filtered["Churn"]

x_train, x_test, y_train, y_test = train_test_split(
    x,
    y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

scaler = StandardScaler()
x_train_scaled = scaler.fit_transform(x_train)
x_test_scaled = scaler.transform(x_test)

In [19]:
model_6 = LogisticRegression(solver='liblinear')
model_6.fit(x_train_scaled, y_train)
test_model(model=model_6)

Accuracy: 0.8973346495557749

Confusion Matrix:
[[1650   51]
 [ 157  168]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.91      0.97      0.94      1701
           1       0.77      0.52      0.62       325

    accuracy                           0.90      2026
   macro avg       0.84      0.74      0.78      2026
weighted avg       0.89      0.90      0.89      2026



In [20]:
model_7 = LogisticRegression(solver='liblinear', class_weight="balanced")
model_7.fit(x_train_scaled, y_train)
test_model(model=model_7)

Accuracy: 0.8509378084896347

Confusion Matrix:
[[1461  240]
 [  62  263]]

Classificattion Report:
              precision    recall  f1-score   support

           0       0.96      0.86      0.91      1701
           1       0.52      0.81      0.64       325

    accuracy                           0.85      2026
   macro avg       0.74      0.83      0.77      2026
weighted avg       0.89      0.85      0.86      2026

