# About the Dataset

A Credit Card Dataset for Machine Learning!  
Source information for this data is unknown.

## Context

Credit score cards are a widely used risk control tool in the financial industry. They utilize the personal and application data provided by credit card applicants to estimate the likelihood of future defaults or credit borrowing. Banks use these predictions to determine whether to issue a card, with credit scores quantifying applicant risk.

Traditionally, credit scoring models rely on historical data. However, major economic shifts can reduce the accuracy of these models. Logistic regression is a classical method for credit scoring due to its suitability for binary classification and its ability to assign coefficients to features. These coefficients are often scaled (e.g., by 100) and rounded for scorecards.

Nowadays, with advances in machine learning, techniques such as Boosting, Random Forest, and Support Vector Machines have also been applied to credit scoring. While powerful, these models tend to lack transparency, making it challenging to explain approval or rejection decisions to customers and regulators.

## Task

The goal is to build a machine learning model to predict whether an applicant is a 'good' or 'bad' client. Unlike some other datasets, the specific definition of 'good' or 'bad' is not provided. You should use methods such as vintage analysis to create your target labels. Be aware that class imbalance is a significant issue in this problem.

## Data Content

There are two tables, which can be merged using the `ID` column:

### `application_record.csv`

| Feature Name          | Explanation                                  | Remarks                                                                                  |
|----------------------|----------------------------------------------|------------------------------------------------------------------------------------------|
| ID                   | Client number                                 |                                                                                          |
| CODE_GENDER          | Gender                                        |                                                                                          |
| FLAG_OWN_CAR         | Is there a car                                |                                                                                          |
| FLAG_OWN_REALTY      | Is there a property                           |                                                                                          |
| CNT_CHILDREN         | Number of children                            |                                                                                          |
| AMT_INCOME_TOTAL     | Annual income                                 |                                                                                          |
| NAME_INCOME_TYPE     | Income category                               |                                                                                          |
| NAME_EDUCATION_TYPE  | Education level                               |                                                                                          |
| NAME_FAMILY_STATUS   | Marital status                                |                                                                                          |
| NAME_HOUSING_TYPE    | Way of living                                 |                                                                                          |
| DAYS_BIRTH           | Birthday                                      | Counted backwards from current day (0); -1 means yesterday                               |
| DAYS_EMPLOYED        | Start date of employment                      | Counted backwards from current day (0). If positive, the person is currently unemployed. |
| FLAG_MOBIL           | Is there a mobile phone                       |                                                                                          |
| FLAG_WORK_PHONE      | Is there a work phone                         |                                                                                          |
| FLAG_PHONE           | Is there a phone                              |                                                                                          |
| FLAG_EMAIL           | Is there an email                             |                                                                                          |
| OCCUPATION_TYPE      | Occupation                                    |                                                                                          |
| CNT_FAM_MEMBERS      | Family size                                   |                                                                                          |

### `credit_record.csv`

| Feature Name    | Explanation         | Remarks                                                                                                                                                                                                                                                                                                             |
|-----------------|--------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID              | Client number      |                                                                                                                                                                                                                                                                                                                     |
| MONTHS_BALANCE  | Record month       | The month of the extracted data is the starting point. 0 is the current month, -1 is the previous month, and so on.                                                                                                                                                                                                 |
| STATUS          | Status             | 0: 1–29 days past due<br>1: 30–59 days past due<br>2: 60–89 days overdue<br>3: 90–119 days overdue<br>4: 120–149 days overdue<br>5: Overdue or bad debts, write-offs for more than 150 days<br>C: Paid off that month<br>X: No loan for the month                                                                    |

---


In [0]:
%pip install xgboost lightgbm catboost  --quiet

[43mNote: you may need to restart the kernel using %restart_python or dbutils.library.restartPython() to use updated packages.[0m


In [0]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import pandas as pd   
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

In [0]:
record = pd.read_csv("Data/credit_record.csv")
display(record.head())

ID,MONTHS_BALANCE,STATUS
5001711,0,X
5001711,-1,0
5001711,-2,0
5001711,-3,0
5001712,0,C


In [0]:
data = pd.read_csv("Data/application_record.csv")
data.head()

Unnamed: 0,ID,CODE_GENDER,FLAG_OWN_CAR,FLAG_OWN_REALTY,CNT_CHILDREN,AMT_INCOME_TOTAL,NAME_INCOME_TYPE,NAME_EDUCATION_TYPE,NAME_FAMILY_STATUS,NAME_HOUSING_TYPE,DAYS_BIRTH,DAYS_EMPLOYED,FLAG_MOBIL,FLAG_WORK_PHONE,FLAG_PHONE,FLAG_EMAIL,OCCUPATION_TYPE,CNT_FAM_MEMBERS
0,5008804,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
1,5008805,M,Y,Y,0,427500.0,Working,Higher education,Civil marriage,Rented apartment,-12005,-4542,1,1,0,0,,2.0
2,5008806,M,Y,Y,0,112500.0,Working,Secondary / secondary special,Married,House / apartment,-21474,-1134,1,0,0,0,Security staff,2.0
3,5008808,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0
4,5008809,F,N,Y,0,270000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-19110,-3051,1,0,1,1,Sales staff,1.0


In [0]:
record.isnull().sum()

ID                0
MONTHS_BALANCE    0
STATUS            0
dtype: int64

In [0]:
for col in record.columns:
    print(f"\n Column: {col}")
    print(record[col].unique())


 Column: ID
[5001711 5001712 5001713 ... 5150484 5150485 5150487]

 Column: MONTHS_BALANCE
[  0  -1  -2  -3  -4  -5  -6  -7  -8  -9 -10 -11 -12 -13 -14 -15 -16 -17
 -18 -19 -20 -21 -22 -23 -24 -25 -26 -27 -28 -29 -30 -31 -32 -33 -34 -35
 -36 -37 -38 -39 -40 -41 -42 -43 -44 -45 -46 -47 -48 -49 -50 -51 -52 -53
 -54 -55 -56 -57 -58 -59 -60]

 Column: STATUS
['X' '0' 'C' '1' '2' '3' '4' '5']


In [0]:
begin_month = (
    record.groupby("ID")['MONTHS_BALANCE']
    .min()
    .rename("begin_month")
    .to_frame()
)
begin_month.head()

Unnamed: 0_level_0,begin_month
ID,Unnamed: 1_level_1
5001711,-3
5001712,-18
5001713,-21
5001714,-14
5001715,-59
