This project builds an end-to-end machine learning pipeline to predict customer churn using historical customer data. The goal is to identify customers at risk of leaving so retention strategies can be applied proactively.



<mark>add a description here bu note this is supervised training because we are using a target column and features to predict outcome</mark>

In [None]:
import pandas as pd
import numpy as np

import matplotlib.pyplot as plt
import seaborn as sns

pd.set_option("display.max_columns", None)

# Load data
df = pd.read_csv("../data/credit_default.csv")

df.head()


Let's understand the data. We will view:
1. size
2. columns (names & data types)
3. summary statistics
4. columns with NULL values

In [None]:
df.shape

In [None]:
df.info()
df.columns

In [20]:
df.describe()

Unnamed: 0,ID,LIMIT_BAL,SEX,EDUCATION,MARRIAGE,AGE,PAY_0,PAY_2,PAY_3,PAY_4,PAY_5,PAY_6,BILL_AMT1,BILL_AMT2,BILL_AMT3,BILL_AMT4,BILL_AMT5,BILL_AMT6,PAY_AMT1,PAY_AMT2,PAY_AMT3,PAY_AMT4,PAY_AMT5,PAY_AMT6,default.payment.next.month
count,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0,30000.0
mean,15000.5,167484.322667,1.603733,1.853133,1.551867,35.4855,-0.0167,-0.133767,-0.1662,-0.220667,-0.2662,-0.2911,51223.3309,49179.075167,47013.15,43262.948967,40311.400967,38871.7604,5663.5805,5921.163,5225.6815,4826.076867,4799.387633,5215.502567,0.2212
std,8660.398374,129747.661567,0.489129,0.790349,0.52197,9.217904,1.123802,1.197186,1.196868,1.169139,1.133187,1.149988,73635.860576,71173.768783,69349.39,64332.856134,60797.15577,59554.107537,16563.280354,23040.87,17606.96147,15666.159744,15278.305679,17777.465775,0.415062
min,1.0,10000.0,1.0,0.0,0.0,21.0,-2.0,-2.0,-2.0,-2.0,-2.0,-2.0,-165580.0,-69777.0,-157264.0,-170000.0,-81334.0,-339603.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
25%,7500.75,50000.0,1.0,1.0,1.0,28.0,-1.0,-1.0,-1.0,-1.0,-1.0,-1.0,3558.75,2984.75,2666.25,2326.75,1763.0,1256.0,1000.0,833.0,390.0,296.0,252.5,117.75,0.0
50%,15000.5,140000.0,2.0,2.0,2.0,34.0,0.0,0.0,0.0,0.0,0.0,0.0,22381.5,21200.0,20088.5,19052.0,18104.5,17071.0,2100.0,2009.0,1800.0,1500.0,1500.0,1500.0,0.0
75%,22500.25,240000.0,2.0,2.0,2.0,41.0,0.0,0.0,0.0,0.0,0.0,0.0,67091.0,64006.25,60164.75,54506.0,50190.5,49198.25,5006.0,5000.0,4505.0,4013.25,4031.5,4000.0,0.0
max,30000.0,1000000.0,2.0,6.0,3.0,79.0,8.0,8.0,8.0,8.0,8.0,8.0,964511.0,983931.0,1664089.0,891586.0,927171.0,961664.0,873552.0,1684259.0,896040.0,621000.0,426529.0,528666.0,1.0


In [None]:
df.isnull().sum()

Looking at the data,  
Let's ask ourself these questions:
1. What is the target column? 
2. Are most features numeric? 
3. Any obvious ID column?
4. Any weird values?

Note:  
target: The specific column in a dataset that the model is being trained to predict  
features: All the other columns that provide input information used by the model to make its predictions

Observations:  
1. <mark>**default.payment.next.month**</mark> is the target variable  
    * We know this because this variable represents to a future outcome prediction 
    * It represents a binary value (1 for yes and 0 for no)
    * This makes it a **binary classification problem**
    * The task definition:  
    <mark>**Given customer information and payment history, predict whether a customer will default next month.**</mark>

2. All the feature columns are numeric
3. The ID column is explictly defined
4. There all no:  
* NaNs / nulls
* impossible numeric ranges
* corrupted rows  



This dataset is fairly clean 

Let's set the target column **default.payment.next.month**

* default.payment.next.month = 1 → the customer did default  
* default.payment.next.month = 0 → the customer did NOT default


In [None]:
target_col = "default.payment.next.month"

default.payment.next.month
0    0.7788
1    0.2212
Name: proportion, dtype: float64

Let's answer the questions:  
1. Is the dataset balanced or imbalanced?
2. Roughly what % default vs not?

In [None]:
# Use value counts to get a new Series object that contains the counts of each unique value
# We are looking for the number of defaults and non defaults 
df[target_col].value_counts()

In [None]:
# normalize returns percentages instead of raw counts
df[target_col].value_counts(normalize=True)

    The dataset is moderately imbalanced, with the majority of customers not defaulting (78% vs 22%). This makes precision and recall more meaningful than accuracy alone.

Note (Define):  
Precision: Of all the items the model predicted as positive, how many were actually positive?  
Recall: Of all the actual positive items, how many did the model find?  
**Precision** and **recall** are key metrics in evaluating classification models, measuring different aspects of performance

Let's visualize the data:

In [None]:
sns.countplot(x=target_col, data=df)
plt.title("Target Distribution: Credit Card Default")
plt.show()

Now that we've seen the data distribution,

Let's set our priorities before optimization.

    We understand that ML models aren't perfect and do make mistakes. We must decide how to best to control errors by answering the question:

**Which kind of mistake hurts the business more?**
* false positive (predict default when they won’t)?
* false negative (miss a real default)?

<mark>In this context, false negatives are costly because failing to identify a true defaulter may lead to financial loss. This suggests **recall** may be more important than precision, depending on business strategy. </mark>

Let's remove the ID column because it is non-informative. We don't want the model to accidentally train on the column.

In [27]:
df.drop(columns=["ID"], inplace=True, errors="ignore")
df.columns

Index(['LIMIT_BAL', 'SEX', 'EDUCATION', 'MARRIAGE', 'AGE', 'PAY_0', 'PAY_2',
       'PAY_3', 'PAY_4', 'PAY_5', 'PAY_6', 'BILL_AMT1', 'BILL_AMT2',
       'BILL_AMT3', 'BILL_AMT4', 'BILL_AMT5', 'BILL_AMT6', 'PAY_AMT1',
       'PAY_AMT2', 'PAY_AMT3', 'PAY_AMT4', 'PAY_AMT5', 'PAY_AMT6',
       'default.payment.next.month'],
      dtype='object')