# CREDIT CARD APPROVAL PROJECT

This notebook presents a complete machine learning project to predict **credit card approval** using client application data and historical credit behavior.

We'll work with two datasets:

- `application_record.csv`: Contains demographic and financial attributes for each client (one row per client).
- `credit_record.csv`: Contains monthly credit status history per client (multiple rows per client).

The goal is to build a **model** that predicts whether a client should be approved for a credit card, based on their profile and past credit behavior.

### Key steps in this notebook:

1. **Load and inspect the data**
2. **Clean and merge the datasets**
3. **Define a meaningful target variable** (late payment history as a proxy for credit risk for example)
4. **Explore and visualize the data**
5. **Engineer useful features**
6. **Train and evaluate classification models**
7. **Interpret model results and feature importance**

This project will highlight best practices in real-world data science, including dealing with missing values, imbalanced classes, and model evaluation beyond accuracy.


Let's get to it.

In [1]:
# Importing libraries
import pandas as pd
import matplotlib.pyplot as plt

In [2]:
# Now let's load the datasets
df_application = pd.read_csv("data/application_record.csv")
df_credit = pd.read_csv("data/credit_record.csv")

Once we've done this, we're gonna take a look at both dataframes that we've just imported

In [3]:
df_application

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
438552,6840104,M,N,Y,0,135000.0,Pensioner,Secondary / secondary special,Separated,House / apartment,-22717,365243,1,0,0,0,,1.0
438553,6840222,F,N,N,0,103500.0,Working,Secondary / secondary special,Single / not married,House / apartment,-15939,-3007,1,0,0,0,Laborers,1.0
438554,6841878,F,N,N,0,54000.0,Commercial associate,Higher education,Single / not married,With parents,-8169,-372,1,1,0,0,Sales staff,1.0
438555,6842765,F,N,Y,0,72000.0,Pensioner,Secondary / secondary special,Married,House / apartment,-21673,365243,1,0,0,0,,2.0


As we see, it is a big dataset, with 438557 rows and 18 columns, now let's see if it has any missing values...

In [4]:
df_application.isna().sum()

ID                          0
CODE_GENDER                 0
FLAG_OWN_CAR                0
FLAG_OWN_REALTY             0
CNT_CHILDREN                0
AMT_INCOME_TOTAL            0
NAME_INCOME_TYPE            0
NAME_EDUCATION_TYPE         0
NAME_FAMILY_STATUS          0
NAME_HOUSING_TYPE           0
DAYS_BIRTH                  0
DAYS_EMPLOYED               0
FLAG_MOBIL                  0
FLAG_WORK_PHONE             0
FLAG_PHONE                  0
FLAG_EMAIL                  0
OCCUPATION_TYPE        134203
CNT_FAM_MEMBERS             0
dtype: int64

Great, almost every column is free of missing values, just the `OCCUPATION_TYPE` column has. We're going to take care of it later.

Now let's take a look at the other dataset and see how many missing values it has

In [5]:
df_credit

Unnamed: 0,ID,MONTHS_BALANCE,STATUS
0,5001711,0,X
1,5001711,-1,0
2,5001711,-2,0
3,5001711,-3,0
4,5001712,0,C
...,...,...,...
1048570,5150487,-25,C
1048571,5150487,-26,C
1048572,5150487,-27,C
1048573,5150487,-28,C


This is also a big set, even bigger that the last, with over 1 million rows. Let's look for missing values now...

In [6]:
df_credit.isna().sum()

ID                0
MONTHS_BALANCE    0
STATUS            0
dtype: int64

That is convinient, we have no missing values on this dataset.

Now let's go back and look at the `OCCUPATION_TYPE` column on the first dataset.  
We'll see with what are we dealing with

In [7]:
df_application["OCCUPATION_TYPE"]

0                    NaN
1                    NaN
2         Security staff
3            Sales staff
4            Sales staff
               ...      
438552               NaN
438553          Laborers
438554       Sales staff
438555               NaN
438556       Sales staff
Name: OCCUPATION_TYPE, Length: 438557, dtype: object

In [8]:
df_application["OCCUPATION_TYPE"].value_counts()

OCCUPATION_TYPE
Laborers                 78240
Core staff               43007
Sales staff              41098
Managers                 35487
Drivers                  26090
High skill tech staff    17289
Accountants              15985
Medicine staff           13520
Cooking staff             8076
Security staff            7993
Cleaning staff            5845
Private service staff     3456
Low-skill Laborers        2140
Secretaries               2044
Waiters/barmen staff      1665
Realty agents             1041
HR staff                   774
IT staff                   604
Name: count, dtype: int64