# CREDIT CARD APPROVAL PROJECT

This notebook presents a complete machine learning project to predict **credit card approval** using client application data and historical credit behavior.

We'll work with two datasets:

- `application_record.csv`: Contains demographic and financial attributes for each client (one row per client).
- `credit_record.csv`: Contains monthly credit status history per client (multiple rows per client).

The goal is to build a **model** that predicts whether a client should be approved for a credit card, based on their profile and past credit behavior.

### Key steps in this notebook:

1. **Load and inspect the data**
2. **Clean and merge the datasets**
3. **Define a meaningful target variable** (late payment history as a proxy for credit risk for example)
4. **Explore and visualize the data**
5. **Engineer useful features**
6. **Train and evaluate classification models**
7. **Interpret model results and feature importance**

This project will highlight best practices in real-world data science, including dealing with missing values, imbalanced classes, and model evaluation beyond accuracy.


Let's get to it.

In [1]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

In [2]:
# Now let's load the datasets
df_application = pd.read_csv("data/application_record.csv")
df_credit = pd.read_csv("data/credit_record.csv")

In [3]:
df_application

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,N,Y,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,-22717,365243,1,0,0,0,,1.0
438553,6840222,F,N,N,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,-15939,-3007,1,0,0,0,Laborers,1.0
438554,6841878,F,N,N,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,-8169,-372,1,1,0,0,Sales staff,1.0
438555,6842765,F,N,Y,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,-21673,365243,1,0,0,0,,2.0


In [4]:
df_credit

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C
...,...,...,...
1048570,5150487,-25,C
1048571,5150487,-26,C
1048572,5150487,-27,C
1048573,5150487,-28,C


First we're gonna clean the datasets and generate our target column

In [5]:
df_application.drop_duplicates(subset=["ID"], inplace=True)
df_application

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,N,Y,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,-22717,365243,1,0,0,0,,1.0
438553,6840222,F,N,N,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,-15939,-3007,1,0,0,0,Laborers,1.0
438554,6841878,F,N,N,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,-8169,-372,1,1,0,0,Sales staff,1.0
438555,6842765,F,N,Y,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,-21673,365243,1,0,0,0,,2.0


In [6]:
df_application.isna().sum()

ID                          0
CODE_GENDER                 0
FLAG_OWN_CAR                0
FLAG_OWN_REALTY             0
CNT_CHILDREN                0
AMT_INCOME_TOTAL            0
NAME_INCOME_TYPE            0
NAME_EDUCATION_TYPE         0
NAME_FAMILY_STATUS          0
NAME_HOUSING_TYPE           0
DAYS_BIRTH                  0
DAYS_EMPLOYED               0
FLAG_MOBIL                  0
FLAG_WORK_PHONE             0
FLAG_PHONE                  0
FLAG_EMAIL                  0
OCCUPATION_TYPE        134193
CNT_FAM_MEMBERS             0
dtype: int64

We'll deal with this later

In [7]:
df_credit.STATUS.value_counts()

STATUS
C    442031
0    383120
X    209230
1     11090
5      1693
2       868
3       320
4       223
Name: count, dtype: int64

In [8]:
# Change the C and X values to numeric
df_credit["STATUS"] = df_credit["STATUS"].replace(
    {"C": -2,
     "X": -1}
).astype(int)

# define bad record and make it our target
df_credit["BAD_CREDIT"] = (df_credit["STATUS"] >= 2).astype(int)

In [9]:
df_credit.BAD_CREDIT.value_counts()

BAD_CREDIT
0    1045471
1       3104
Name: count, dtype: int64

In [10]:
df_credit.isna().sum()

ID                0
MONTHS_BALANCE    0
STATUS            0
BAD_CREDIT        0
dtype: int64

Perfect! no missing values here

Now let's aggregate credit data to get one BAD_CREDIT label per ID

In [20]:
df_credit_agg = df_credit.groupby("ID")["BAD_CREDIT"].max().reset_index()

# Merge with application data
df_merged = df_application.merge(df_credit_agg, on="ID", how="inner")

df_merged.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 36457 entries, 0 to 36456
Data columns (total 19 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 0   ID                   36457 non-null  int64  
 1   CODE_GENDER          36457 non-null  object 
 2   FLAG_OWN_CAR         36457 non-null  object 
 3   FLAG_OWN_REALTY      36457 non-null  object 
 4   CNT_CHILDREN         36457 non-null  int64  
 5   AMT_INCOME_TOTAL     36457 non-null  float64
 6   NAME_INCOME_TYPE     36457 non-null  object 
 7   NAME_EDUCATION_TYPE  36457 non-null  object 
 8   NAME_FAMILY_STATUS   36457 non-null  object 
 9   NAME_HOUSING_TYPE    36457 non-null  object 
 10  DAYS_BIRTH           36457 non-null  int64  
 11  DAYS_EMPLOYED        36457 non-null  int64  
 12  FLAG_MOBIL           36457 non-null  int64  
 13  FLAG_WORK_PHONE      36457 non-null  int64  
 14  FLAG_PHONE           36457 non-null  int64  
 15  FLAG_EMAIL           36457 non-null 

Now it's time to encode our data and handle missing values, we're going to do this with the `scikit-learn` Pipeline

In [30]:
# In this cell we're going to:
# 1- Fill missing data
# 2- Convert data to numbers

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import StandardScaler

# Handle missing data
df = df_merged.copy()
df["OCCUPATION_TYPE"] = df["OCCUPATION_TYPE"].fillna("Unknown")

# Define Features
# Categorical
categorical_features = df.select_dtypes(include=["object"]).columns.tolist()
categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy = 'constant', fill_value= 'missing')),
    ("onehot", OneHotEncoder(handle_unknown='ignore'))
])

# Numerical
numerical_features = df_merged.select_dtypes(include=["int64", "float64"]).columns.tolist()
numerical_features.remove("ID") # ID it's just an identifier, it's not a feature
numerical_features.remove("BAD_CREDIT") # TARGET VARIABLE
numerical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy='mean')),
    ("scaler", StandardScaler())
])

# Setup the preprocessing steps
preprocessor = ColumnTransformer(
    transformers=[
        ("cat", categorical_transformer, categorical_features),
        ("num", numerical_transformer, numerical_features)
    ]
)


Once we've done this, we can create our model and test it, but we don't know which model will be the best as now, so have to run some tests...

In [None]:
# Importing the models
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

# Importing the metrics
from sklearn.model_selection import cross_val_score

# Lets split into X & y first
X = df.drop("BAD_CREDIT", axis=1)
y = df["BAD_CREDIT"]

# Now let's split into train and test data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Dictionary to map the models
models = {
    "Random Forest": RandomForestClassifier(random_state=42,class_weight="balanced"),
    "Logistic Regression": LogisticRegression(random_state=42, max_iter= 5000, class_weight="balanced"),
    "KNN": KNeighborsClassifier()
}

for name, base_model in models.items():
    model = Pipeline(steps=[
        ("preprocessor", preprocessor),
        ("model", base_model)
    ]    
    )
    
    score = cross_val_score(model, X,y, cv = 5,  scoring="roc_auc")
    print(f"MODEL: {name} - AUC-MEAN: {np.mean(score)} - ALL SCORES: {score}")

MODEL: Random Forest - AUC-MEAN: 0.4997936596942235 - ALL SCORES: [0.45142591 0.38198782 0.53204282 0.49420187 0.63930987]
MODEL: Logistic Regression - AUC-MEAN: 0.5151355147324505 - ALL SCORES: [0.45208197 0.50965424 0.5242065  0.54376157 0.5459733 ]
MODEL: KNN - AUC-MEAN: 0.6020496301393014 - ALL SCORES: [0.59033134 0.60044137 0.62481852 0.5488871  0.64576982]


In [29]:
df["BAD_CREDIT"].value_counts(normalize=True)

BAD_CREDIT
0    0.983103
1    0.016897
Name: proportion, dtype: float64