## Data Dictionary

Below is a detailed description of each feature in both tables included in the Credit Card Approval Prediction dataset:

### application_record.csv

| Feature Name         | Explanation                | Remarks                                                                                           |
|----------------------|---------------------------|---------------------------------------------------------------------------------------------------|
| ID                   | Client number             | Unique identifier for each client                                                                  |
| CODE_GENDER          | Gender                    | Values: M (male), F (female)                                                                      |
| FLAG_OWN_CAR         | Is there a car            | 1 if the client owns a car, 0 otherwise                                                           |
| FLAG_OWN_REALTY      | Is there a property       | 1 if the client owns property, 0 otherwise                                                        |
| CNT_CHILDREN         | Number of children        | Number of children the client has                                                                  |
| AMT_INCOME_TOTAL     | Annual income             | Client's annual income                                                                            |
| NAME_INCOME_TYPE     | Income category           | Type of income: e.g., Working, Commercial associate, Pensioner, State servant, Unemployed         |
| NAME_EDUCATION_TYPE  | Education level           | Level of education: e.g., Higher education, Secondary, Academic degree, Lower secondary           |
| NAME_FAMILY_STATUS   | Marital status            | Client's family status: e.g., Single, Married, Civil marriage, Separated, Widow                   |
| NAME_HOUSING_TYPE    | Way of living             | Type of housing: e.g., Own house/apartment, With parents, Municipal, Co-op, Rented, Office apt    |
| DAYS_BIRTH           | Birthday                  | Counted backwards from current day (0); negative values, -1 means yesterday                       |
| DAYS_EMPLOYED        | Start date of employment  | Counted backwards from current day (0). If positive, client is currently unemployed               |
| FLAG_MOBIL           | Is there a mobile phone   | 1 if the client has a mobile phone, 0 otherwise                                                   |
| FLAG_WORK_PHONE      | Is there a work phone     | 1 if the client has a work phone, 0 otherwise                                                     |
| FLAG_PHONE           | Is there a phone          | 1 if the client has a phone, 0 otherwise                                                          |
| FLAG_EMAIL           | Is there an email         | 1 if the client has an email address, 0 otherwise                                                 |
| OCCUPATION_TYPE      | Occupation                | Client's occupation                                                                               |
| CNT_FAM_MEMBERS      | Family size               | Number of family members                                                                          |

### credit_record.csv

| Feature Name    | Explanation              | Remarks                                                                                                                                                                                      |
|-----------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID              | Client number           | Unique identifier for each client                                                                                                                      |
| MONTHS_BALANCE  | Record month            | The month of the extracted data is the starting point; 0 is the current month, -1 is the previous month, and so on                                    |
| STATUS          | Status                  | Payment status: <br>0 = 1–29 days past due <br>1 = 30–59 days past due <br>2 = 60–89 days overdue <br>3 = 90–119 days overdue <br>4 = 120–149 days overdue <br>5 = Overdue or bad debts, write-offs for more than 150 days <br>C = Paid off that month <br>X = No loan for the month |

---

This data dictionary provides clarity on the attributes available for modeling and feature engineering in the credit approval prediction task.


In [0]:
%pip install xgboost lightgbm catboost  --quiet

In [0]:
%pip install pycaret --quiet

In [0]:
import warnings
warnings.filterwarnings('ignore')
import pandas as pd
import numpy as np
import pandas as pd   
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import LabelEncoder

from xgboost import XGBClassifier
from lightgbm import LGBMClassifier
from catboost import CatBoostClassifier

import pycaret
from pycaret.classification import *

In [0]:
record = pd.read_csv("Data/credit_record.csv")
display(record.head())

In [0]:
data = pd.read_csv("Data/application_record.csv")
data.head()

In [0]:
record.isnull().sum()

In [0]:
for col in record.columns:
    print(f"\n Column: {col}")
    print(record[col].unique())

### credit_record.csv

| Feature Name    | Explanation              | Remarks                                                                                                                                                                                      |
|-----------------|-------------------------|----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| ID              | Client number           | Unique identifier for each client                                                                                                                      |
| MONTHS_BALANCE  | Record month            | The month of the extracted data is the starting point; 0 is the current month, -1 is the previous month, and so on                                    |
| STATUS          | Status                  | Payment status: <br>0 = 1–29 days past due <br>1 = 30–59 days past due <br>2 = 60–89 days overdue <br>3 = 90–119 days overdue <br>4 = 120–149 days overdue <br>5 = Overdue or bad debts, write-offs for more than 150 days <br>C = Paid off that month <br>X = No loan for the month |

---

#Pre-Processing

In [0]:
begin_month = (
    record.groupby("ID")['MONTHS_BALANCE']
    .min()
    .rename("begin_month")
    .to_frame()
)
begin_month.head()

In [0]:

record['dep_value'] = None
record.loc[record['STATUS'].isin(['2', '3', '4', '5']), 'dep_value'] = 'Yes'

cpunt = record.groupby('ID')['dep_value'].apply(lambda x: 'Yes' if (x == 'Yes').any() else 'No')
cpunt = cpunt.to_frame()


In [0]:
print(cpunt['dep_value'].value_counts())
cpunt['dep_value'].value_counts(normalize=True)

In [0]:
data = pd.merge(data, begin_month, how = "left", on = "ID")
data = pd.merge(data, cpunt, how = "inner", on = "ID" )
data['label'] = data['dep_value'].map({'Yes': 1, 'No': 0})
display(data.head())

In [0]:
data.info()

In [0]:
data = data.rename(columns={'CODE_GENDER':'Gender','FLAG_OWN_CAR':'Car','FLAG_OWN_REALTY':'Reality',
                         'CNT_CHILDREN':'ChldNo','AMT_INCOME_TOTAL':'inc',
                         'NAME_EDUCATION_TYPE':'edutp','NAME_FAMILY_STATUS':'famtp',
                        'NAME_HOUSING_TYPE':'houtp','FLAG_EMAIL':'email',
                         'NAME_INCOME_TYPE':'inctp','FLAG_WORK_PHONE':'wkphone',
                         'FLAG_PHONE':'phone','CNT_FAM_MEMBERS':'famsize',
                        'OCCUPATION_TYPE':'occyp'
                        })

In [0]:
data.isnull().sum()

Dont forget Change days_employed

In [0]:
data = data.drop(columns = ['dep_value', 'email', 'phone', 'wkphone', 'FLAG_MOBIL'])

In [0]:
data.info()

In [0]:
display(data.head()) 

In [0]:
data['occyp'] = data['occyp'].fillna('Unknown')
print("\n Gender :", data['Gender'].unique())
print("\n Car :", data['Car'].unique())
print("\n Reality :", data['Reality'].unique())
print("\n ChldNo :", data['ChldNo'].unique())
print("\n inctp :", data['inctp'].unique())
print("\n edutp :", data['edutp'].unique())
print("\n famtp :", data['famtp'].unique())
print("\n houtp :", data['houtp'].unique())
print("\n occyp :", data['occyp'].unique())

In [0]:


data['Gender'] = data['Gender'].map({'F': 0, 'M': 1})
data['Car'] = data['Car'].map({'N': 0, 'Y': 1})
data['Reality'] = data['Reality'].map({'N': 0, 'Y': 1})

data = pd.get_dummies(data, columns = ['inctp', 'edutp', 'famtp', 'houtp', 'occyp'], dtype='int32')


In [0]:
for col in ['Gender', 'Car', 'Reality']:
    data[col] = data[col].astype('int32')


In [0]:
display(data.head())

In [0]:
data['EMPLOYED_FLAG'] = (data['DAYS_EMPLOYED'] != 365243).astype(int)

In [0]:
data['AGE_YEARS'] = (-data['DAYS_BIRTH']) // 365
data['YEARS_EMPLOYED'] = np.where(data['DAYS_EMPLOYED'] == 365243, 0, -data['DAYS_EMPLOYED'] // 365)
data = data.drop(columns = ['DAYS_BIRTH', 'DAYS_EMPLOYED'])

print("\n YEARS_EMPLOYED: ", data['YEARS_EMPLOYED'].unique())

In [0]:
display(data.head())

In [0]:
print(data['label'].unique()) # Good Debt Bad Debt

In [0]:
data.info()

#Model

In [0]:
stp = setup(
    data = data,
    target= 'label',
    train_size=0.8,
    ignore_features=['ID'],
    fix_imbalance=True,
    session_id=123
)

In [0]:
est_model = compare_models()

# Model Evaluation Summary

Several classification models were tested to predict the target variable. The main metrics used to compare models were Accuracy, AUC (Area Under the ROC Curve), Recall, Precision, F1-Score, Kappa, and MCC (Matthews Correlation Coefficient).

- **Accuracy:** Measures the overall percentage of correct predictions but can be misleading if the data is imbalanced.
- **AUC:** Reflects the model’s ability to distinguish between classes; values closer to 1 are better.
- **Recall:** Indicates how well the model detects positive cases.
- **Precision:** Measures how many predicted positives are actually correct.
- **F1-Score:** Balances Precision and Recall into a single metric.
- **Kappa and MCC:** Assess the model's performance accounting for chance agreement and class imbalance.

The best performing model for this dataset was **CatBoost Classifier**, showing the highest balance between Recall, Precision, and overall discrimination (AUC). Random Forest and Extra Trees classifiers also performed well.

Models with high accuracy but low recall or precision (like the Dummy classifier) are not reliable because they tend to predict only the majority class.


In [0]:
catboost_model = create_model('catboost')

catboost_model = finalize_model(catboost_model)

print(catboost_model)

In [0]:
evaluate_model(catboost_model)

In [0]:
plot_model(catboost_model, plot = 'auc')

In [0]:
plot_model(catboost_model, plot = 'pr')

In [0]:
plot_model(catboost_model, plot='feature')

In [0]:
plot_model(catboost_model, plot = 'confusion_matrix')