In [1]:
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# **CONTRACT table (6,234,093 rows)**
- `SKP_CREDIT_CASE`: ID for credit (4280377 distinct) -> drop duplicated (not enough info) -> **Primary Key**
- `SKP_CLIENT`: ID for customer (1000528 distinct) -> **Foreign Key to CUSTOMER table**
- `NAME_EDUCATION_TYPE`: Education level (`primary school`, `junior school`, `high school`, `bachelor`, `master`, `XNA`)
- `CNT_CHILDREN`: Number of children (range 0 - 100 ??)
- `AMT_INCOME_MAIN`: Main personal income (VND)
- `AMT_INCOME_HOUSEHOLD`: Total household income (VND)
- `NAME_INCOME_TYPE`: Type of income (`Employed`, `Self-employed`, `Retired`, `Person in household`, `Student`, `Unemployed`, `XNA`)
- `CODE_PROFESSION`: Profession code (`WORKER`, `SALESMAN`, `FARMER`, `ADMINISTRATIVE`, `SERVICES`, `ENGINEER`, `OTHER`, `XNA`)
- `NAME_CREDIT_STATUS`: Credit status of the application

| Trạng thái        | Giải thích                                                                                         | Nhóm trạng thái (`STATUS_GROUP`) |
| ----------------- | -------------------------------------------------------------------------------------------------- | -------------------------------- |
| **In Preprocess** | Hồ sơ đang trong giai đoạn chuẩn bị trước khi xử lý chính thức (tiền thẩm định, chuẩn bị giấy tờ). | In Progress                      |
| **In Process**    | Hồ sơ đang trong quá trình xử lý (thẩm định, xét duyệt…).                                          | In Progress                      |
| **Rejected**      | Hồ sơ vay đã bị từ chối, không được phê duyệt.                                                     | Risky                            |
| **Cancelled**     | Khách hàng hoặc ngân hàng đã hủy hồ sơ vay trước khi giải ngân hoặc ký kết hợp đồng.               | Risky                            |
| **Approved**      | Khoản vay đã được phê duyệt nhưng chưa chắc đã giải ngân.                                          | Positive                         |
| **Signed**        | Khách hàng đã ký hợp đồng vay nhưng khoản vay chưa được giải ngân hoặc đang chờ xử lý.             | Positive                         |
| **Active**        | Khoản vay đang còn hiệu lực, khách hàng đang trả dần từng kỳ (còn dư nợ).                          | Active                           |
| **Finished**      | Khoản vay đã hoàn tất (được trả đầy đủ hoặc đã kết thúc đúng hạn).                                 | Positive                         |
| **Paid off**      | Khoản vay đã được tất toán hoàn toàn (thường là trả sớm trước hạn).                                | Positive                         |
| **Written off**   | Khoản vay đã được xóa sổ (coi như mất trắng) vì khách hàng không có khả năng thanh toán.           | Risky                            |
| **Sold**          | Khoản vay đã được bán lại cho bên thứ ba (ví dụ công ty đòi nợ hoặc quỹ đầu tư).                   | Risky                            |

- `PRODUCT`: Product name/type (missing 1%) of Home Credit

| Mã sản phẩm (PRODUCT) | Diễn giải                         | Giải thích cụ thể                                                              |
| --------------------- | --------------------------------- | ------------------------------------------------------------------------------ |
| **CL**                | **Cash Loan (Vay tiền mặt)**      | Sản phẩm vay tín chấp tiền mặt, không cần tài sản đảm bảo.                     |
| **CD**                | **Consumer Durables**             | Vay mua điện máy, thiết bị gia dụng                                            |
| **BNPL**              | **Buy Now Pay Later**             | Mua trước trả sau, thường dùng cho giao dịch thương mại điện tử, trả góp.      |
| **CC**                | **Credit Card**                   | Thẻ tín dụng, người dùng chi tiêu trước và trả sau theo hạn mức.               |
| **TW**                | **Two-Wheeler Loan**              | Vay mua xe hai bánh (xe máy), phổ biến ở các nước đang phát triển.             |
| **CW**                | **Car/Consumer Wheel Loan**       | Vay mua ô tô hoặc phương tiện lớn hơn xe máy (ngành hàng mới).                 |
| **IN**                | **Insurance Loan/Product**        | Sản phẩm liên quan đến bảo hiểm.                                               |

- `AMT_CREDIT`: Loan amount requested or approved (VND) (missing 1%)
- `PAYMENT_NUM`: Number of payments/installments (tenor) (missing 9%)
- `INIT_PAY`: Initial payment or down payment (VND) (missing 1%)
- `ANNUITY`: Monthly installment (VND) (missing 9%)
- `SKP_SALESROOM`: ID for salesroom (34176 distinct)
- `APPLY_CONTRACT_TIME`: Timestamp when the contract was applied for
- `APPROVE_CONTRACT_TIME`: Timestamp when the contract was approved (missing 29%) (related `NAME_CREDIT_STATUS`)
- `SIGN_CONTRACT_TIME`: Timestamp when the contract was signed (missing 36%) (related `NAME_CREDIT_STATUS`)
- `APPLY_EMPLOYEE`: Staff code who created the application (`vn_api_fin`, `OpenAPI_User`, `Koyal_User`, `BNPL_ONBOARDING_TECH_USER`, `EPOS2_User`, `EPOS_User`, `BNPL_Onboarding`,...) -> **Foreign Key to EMPLOYEE table**
- `SIGN_EMPLOYEE`: Staff code who processed the contract signing (same `APPLY_EMPLOYEE`) (missing 36%) (related `NAME_CREDIT_STATUS`) -> **Foreign Key to EMPLOYEE table**
- `TRANSAC`: Transaction or installment channel code (missing 95%) ????
- `FIRST_DUE`: First overdue (1 = overdue, 0 = no overdue)
- `SECOND_DUE`: Second overdue (0/1)
- `THIRD_DUE`: Third overdue (0/1)
- `FOURTH_DUE`: Fourth overdue (0/1)
- `AMT_BILLING`: First billed amount for revolving loans (BNPL or CC) (missing 95%)
- `FLAG_INS`: Insurance flag (1 = has insurance, 0 = no insurance)
- `APPLY_EMPLOYEE_LEVEL`: Seniority of staff created the application (`JUNIOR`, `SENIOR`, `WARRIOR`, `MASTER`) (missing 88%)
- `SIGN_EMPLOYEE_LEVEL`: Seniority of staff processed the contract signing (same `APPLY_EMPLOYEE_LEVEL`) (missing 92%)

In [2]:
contract_df = pd.read_csv(r'data/CONTRACT.csv')
contract_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6234093 entries, 0 to 6234092
Data columns (total 29 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SKP_CREDIT_CASE        int64  
 1   SKP_CLIENT             int64  
 2   NAME_EDUCATION_TYPE    object 
 3   CNT_CHILDREN           int64  
 4   AMT_INCOME_MAIN        float64
 5   AMT_INCOME_HOUSEHOLD   int64  
 6   NAME_INCOME_TYPE       object 
 7   CODE_PROFESSION        object 
 8   NAME_CREDIT_STATUS     object 
 9   PRODUCT                object 
 10  AMT_CREDIT             float64
 11  PAYMENT_NUM            float64
 12  INIT_PAY               float64
 13  ANNUITY                float64
 14  SKP_SALESROOM          int64  
 15  APPLY_CONTRACT_TIME    object 
 16  APPROVE_CONTRACT_TIME  object 
 17  SIGN_CONTRACT_TIME     object 
 18  APPLY_EMPLOYEE         object 
 19  SIGN_EMPLOYEE          object 
 20  TRANSAC                object 
 21  FIRST_DUE              int64  
 22  SECOND_DUE        

In [3]:
contract_df

Unnamed: 0,SKP_CREDIT_CASE,SKP_CLIENT,NAME_EDUCATION_TYPE,CNT_CHILDREN,AMT_INCOME_MAIN,AMT_INCOME_HOUSEHOLD,NAME_INCOME_TYPE,CODE_PROFESSION,NAME_CREDIT_STATUS,PRODUCT,...,SIGN_EMPLOYEE,TRANSAC,FIRST_DUE,SECOND_DUE,THIRD_DUE,FOURTH_DUE,AMT_BILLING,FLAG_INS,APPLY_EMPLOYEE_LEVEL,SIGN_EMPLOYEE_LEVEL
0,46804792,13152490,XNA,0,0.0,0,XNA,XNA,Finished,CD,...,,,0,0,0,0,,1,,
1,54355719,14915647,XNA,0,0.0,0,XNA,XNA,Finished,CD,...,,,0,0,0,0,,1,,
2,43078344,11009338,XNA,0,0.0,0,XNA,XNA,Rejected,CD,...,,,0,0,0,0,,0,,
3,199270422,3489209,XNA,0,9000000.0,0,XNA,XNA,Rejected,CD,...,,,0,0,0,0,,0,,
4,199467152,62216381,Bachelor's degree,0,10000000.0,0,Employed person,OTHER,Finished,CD,...,0027039,,0,0,0,0,,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6234088,117314500,4244312,Elementary (primary) school,1,7000000.0,0,Employed person,WORKER,Cancelled,CL,...,,,0,0,0,0,,0,,
6234089,321433839,11854609,XNA,0,10000000.0,0,Employed person,OTHER,Finished,CL,...,00049441,,0,0,0,0,,0,,WARRIOR
6234090,81343486,12603259,XNA,0,0.0,0,XNA,XNA,Finished,CD,...,0034562,,0,0,0,0,,1,,
6234091,278360564,40677477,Junior school education,0,10000000.0,0,Self-employed person / business owner,WORKER,Cancelled,BNPL,...,,,0,0,0,0,,0,,


In [4]:
# Dataframe for percentages of missing values each column
missing_values = contract_df.isnull().mean() * 100
missing_values = missing_values[missing_values > 0].sort_values(ascending=False)
missing_values_df = pd.DataFrame(missing_values).reset_index()
missing_values_df.columns = ['Column', 'Percentage of Missing Values']
missing_values_df

Unnamed: 0,Column,Percentage of Missing Values
0,AMT_BILLING,93.521784
1,TRANSAC,93.52172
2,SIGN_EMPLOYEE_LEVEL,91.132086
3,APPLY_EMPLOYEE_LEVEL,88.403381
4,SIGN_CONTRACT_TIME,32.198525
5,SIGN_EMPLOYEE,32.198525
6,APPROVE_CONTRACT_TIME,23.012522
7,ANNUITY,10.646088
8,PAYMENT_NUM,10.555698
9,PRODUCT,0.454725


In [5]:
# Count the number of distinct contract id
distinct_contracts = contract_df['SKP_CREDIT_CASE'].nunique()
print(f"Number of distinct contracts: {distinct_contracts}")

Number of distinct contracts: 4280377


In [6]:
# Non-duplicated contract ids
non_duplicated_contracts = contract_df[contract_df['SKP_CREDIT_CASE'].duplicated(keep=False) == False]
print(f"Number of non-duplicated contracts: {len(non_duplicated_contracts)}")

# Duplicated contract ids
duplicated_contracts = contract_df[contract_df['SKP_CREDIT_CASE'].duplicated(keep=False)]
print(f"Number of duplicated contracts: {len(duplicated_contracts)}")

Number of non-duplicated contracts: 2326915
Number of duplicated contracts: 3907178


In [7]:
duplicated_contracts = duplicated_contracts.drop_duplicates()
print(f"Number of duplicated contracts: {len(duplicated_contracts)}")

Number of duplicated contracts: 3906868


In [8]:
duplicated_contracts = duplicated_contracts\
        .sort_values(by=['SKP_CREDIT_CASE', 'AMT_INCOME_MAIN', 'NAME_EDUCATION_TYPE'], ascending=[True, True, False])\
        .drop_duplicates(subset=['SKP_CREDIT_CASE'], keep='last')
print(f"Number of duplicated contracts after sorting and dropping: {len(duplicated_contracts)}")

Number of duplicated contracts after sorting and dropping: 1953462


In [9]:
# Append the non-duplicated contracts to the duplicated contracts
contract_df = pd.concat([non_duplicated_contracts, duplicated_contracts], ignore_index=True)
contract_df

Unnamed: 0,SKP_CREDIT_CASE,SKP_CLIENT,NAME_EDUCATION_TYPE,CNT_CHILDREN,AMT_INCOME_MAIN,AMT_INCOME_HOUSEHOLD,NAME_INCOME_TYPE,CODE_PROFESSION,NAME_CREDIT_STATUS,PRODUCT,...,SIGN_EMPLOYEE,TRANSAC,FIRST_DUE,SECOND_DUE,THIRD_DUE,FOURTH_DUE,AMT_BILLING,FLAG_INS,APPLY_EMPLOYEE_LEVEL,SIGN_EMPLOYEE_LEVEL
0,199270422,3489209,XNA,0,9000000.0,0,XNA,XNA,Rejected,CD,...,,,0,0,0,0,,0,,
1,199467152,62216381,Bachelor's degree,0,10000000.0,0,Employed person,OTHER,Finished,CD,...,0027039,,0,0,0,0,,1,,
2,202732373,12665970,Bachelor's degree,0,7000000.0,0,Employed person,OTHER,Finished,CD,...,0023074,,0,0,0,0,,1,,
3,203031337,38026388,Elementary (primary) school,0,7000000.0,0,Employed person,WORKER,Finished,CD,...,0041262,,0,0,0,0,,1,,
4,149179096,14975594,Elementary (primary) school,1,4000000.0,0,Employed person,SALESMAN,Finished,CD,...,0031089,,0,0,0,0,,1,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4280372,342935768,77354359,High school education,0,15000000.0,0,Employed person,ADMINISTRATIVE,Active,CL,...,Koyal_User,,0,0,0,0,,0,,
4280373,342935857,20795524,High school education,0,12000000.0,0,Employed person,WORKER,Active,CC,...,Koyal_User,,0,0,0,0,,1,,
4280374,342935889,14994675,High school education,0,50000000.0,0,Employed person,OTHER,Active,CL,...,0033883,,0,0,0,0,,1,,
4280375,342936178,67269333,High school education,0,20000000.0,0,Self-employed person / business owner,OTHER,Active,CL,...,Koyal_User,,0,0,0,0,,0,,


In [10]:
contract_df['NAME_EDUCATION_TYPE'].value_counts()

NAME_EDUCATION_TYPE
High school education          1471797
XNA                            1093326
Junior school education         823065
Bachelor's degree               494245
Elementary (primary) school     339835
Master's degree                  58109
Name: count, dtype: int64

In [11]:
contract_df['NAME_INCOME_TYPE'].value_counts()

NAME_INCOME_TYPE
Employed person                          2178272
Self-employed person / business owner    1124455
XNA                                       938119
Student                                    22109
Person in household                        11248
Retired person                              4120
Unemployed                                  2054
Name: count, dtype: int64

In [12]:
contract_df['CODE_PROFESSION'].value_counts()

CODE_PROFESSION
WORKER            1213477
XNA                954327
OTHER              912944
SALESMAN           379242
FARMER             264773
ADMINISTRATIVE     241476
SERVICES           210532
ENGINEER           103606
Name: count, dtype: int64

In [13]:
contract_df['NAME_CREDIT_STATUS'].value_counts()

NAME_CREDIT_STATUS
Finished         2017785
Rejected          914467
Active            847515
Cancelled         477677
Written off        14520
Paid off            5598
Signed              1772
Approved             483
Sold                 387
In Preprocess        159
In Process            14
Name: count, dtype: int64

In [14]:
contract_df['PRODUCT'].value_counts()

PRODUCT
CD      3095684
CL       565521
BNPL     211481
CC       135823
TW       133928
CW       109591
IN            1
Name: count, dtype: int64

In [15]:
contract_df['APPLY_EMPLOYEE'].value_counts()

APPLY_EMPLOYEE
vn_api_fin                   1181407
OpenAPI_User                  562252
Koyal_User                    229929
BNPL_ONBOARDING_TECH_USER     125633
EPOS2_User                     72649
                              ...   
FPT107902                          1
00059787                           1
R00027808                          1
R00028096                          1
00060448                           1
Name: count, Length: 61678, dtype: int64

In [16]:
contract_df['APPLY_EMPLOYEE'].value_counts()

APPLY_EMPLOYEE
vn_api_fin                   1181407
OpenAPI_User                  562252
Koyal_User                    229929
BNPL_ONBOARDING_TECH_USER     125633
EPOS2_User                     72649
                              ...   
FPT107902                          1
00059787                           1
R00027808                          1
R00028096                          1
00060448                           1
Name: count, Length: 61678, dtype: int64

In [17]:
contract_df['APPLY_EMPLOYEE_LEVEL'].value_counts()

APPLY_EMPLOYEE_LEVEL
SENIOR     164273
JUNIOR     138165
MASTER     107650
WARRIOR     99291
Name: count, dtype: int64

# **CUSTOMER table (1,000,528 rows)**
- `SKP_CLIENT`: ID for customer -> **Primary Key**
- `NAME_GENDER`: Gender (Male/Female)
- `NAME_EDUCATION_TYPE`: Education level
- `DATE_BIRTH`: Date of birth
- `CNT_CHILDREN`: Number of children
- `FLAG_CAR_OWNER`: Car ownership flag (1: YES, 0: NO) -> Only `X`
- `NAME_SALARY_FREQUENCY`: Frequency of salary payments (`Monthly paid`, `Irregular salary frequency`, `Every week paid`, `Once a two week paid`, `XNA`) -> missing > 99% -> useless
- `CNT_PERSON_DEPENDENT`: Number of dependents (missing 99%) -> useless
- `ADDRESS`: Residential address (missing 15%)
- `AVG_SESSION_PER_WEEK_2025`: Average of using Online Application (GMA) per week in 2025


In [18]:
customer_df = pd.read_csv(r'data/CUSTOMER.csv')
customer_df.info()

  customer_df = pd.read_csv(r'data/CUSTOMER.csv')


<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000528 entries, 0 to 1000527
Data columns (total 10 columns):
 #   Column                     Non-Null Count    Dtype  
---  ------                     --------------    -----  
 0   SKP_CLIENT                 1000528 non-null  int64  
 1   NAME_GENDER                1000528 non-null  object 
 2   NAME_EDUCATION_TYPE        1000528 non-null  object 
 3   DATE_BIRTH                 1000528 non-null  object 
 4   CNT_CHILDREN               874457 non-null   float64
 5   FLAG_CAR_OWNER             1000528 non-null  object 
 6   NAME_SALARY_FREQUENCY      1000528 non-null  object 
 7   CNT_PERSON_DEPENDENT       14869 non-null    float64
 8   ADDRESS                    851286 non-null   object 
 9   AVG_SESSION_PER_WEEK_2025  1000528 non-null  float64
dtypes: float64(3), int64(1), object(6)
memory usage: 76.3+ MB


In [19]:
customer_df

Unnamed: 0,SKP_CLIENT,NAME_GENDER,NAME_EDUCATION_TYPE,DATE_BIRTH,CNT_CHILDREN,FLAG_CAR_OWNER,NAME_SALARY_FREQUENCY,CNT_PERSON_DEPENDENT,ADDRESS,AVG_SESSION_PER_WEEK_2025
0,676,Female,High school education,1966-03-18 00:00,0.0,X,XNA,,"P. 2,TP. Tan An,Long An",0.0
1,1470,Male,High school education,1967-12-27 00:00,1.0,X,XNA,,"P. 9,Q. 8,TP. HCM",0.0
2,7123,Male,Junior school education,1972-10-03 00:00,2.0,X,XNA,6.0,"TT.BUON TRAP,H. Krong Ana,Dak Lak",0.0
3,10711,Female,Elementary (primary) school,1978-01-01 00:00,1.0,X,XNA,1.0,"X. An Thai Dong,H. Cai Be,Tien Giang",0.0
4,12407,Male,High school education,1986-03-14 00:00,1.0,X,XNA,,"P. Yen The,TP. Pleiku,Gia Lai",0.0
...,...,...,...,...,...,...,...,...,...,...
1000523,126809534,Male,XNA,1990-04-02 00:00,0.0,X,XNA,,,0.0
1000524,127462754,Male,XNA,2005-04-30 00:00,,X,XNA,,,0.0
1000525,126394510,Male,XNA,1995-07-24 00:00,0.0,X,XNA,,,0.0
1000526,118415150,Female,XNA,2000-05-09 00:00,,X,XNA,,,0.0


In [20]:
customer_df['NAME_SALARY_FREQUENCY'].value_counts()

NAME_SALARY_FREQUENCY
XNA                           991399
Monthly paid                    7696
Irregular salary frequency       884
Every week paid                  321
Once a two week paid             228
Name: count, dtype: int64

# **EMPLOYEE table (11,383 rows)**
- `CODE_EMPLOYEE`: ID for Employee -> **Primary Key**
- `HIRING_DATE`: Hiring date of employee
- `LEAVING_DATE`: Termination date of employee (missing 34%) -> Only 34% employee still work
- `MANAGER_CODE_EMPLOYEE`: ID for manager
- `GENDER`: Gender of employee (Male/Female)
- `BIRTH_DATE`: Birth date of employee
- `ADDRESS`: Home address of employee
- `LEVEL_SA`: New employees who joined on and after March 31 will not yet have a level_SA assigned — by default, they will be considered Junior, the lowest level

In [21]:
employee_df = pd.read_csv(r'data/EMPLOYEE.csv')
employee_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 11383 entries, 0 to 11382
Data columns (total 8 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   CODE_EMPLOYEE          11383 non-null  int64 
 1   HIRING_DATE            11383 non-null  object
 2   LEAVING_DATE           7565 non-null   object
 3   MANAGER_CODE_EMPLOYEE  11383 non-null  int64 
 4   GENDER                 11383 non-null  object
 5   BIRTH_DATE             11383 non-null  object
 6   LEVEL_SA               7220 non-null   object
 7   ADDRESS                11383 non-null  object
dtypes: int64(2), object(6)
memory usage: 711.6+ KB


In [22]:
employee_df

Unnamed: 0,CODE_EMPLOYEE,HIRING_DATE,LEAVING_DATE,MANAGER_CODE_EMPLOYEE,GENDER,BIRTH_DATE,LEVEL_SA,ADDRESS
0,46911,2021-01-04 00:00:00,2021-10-01 00:00:00,115169,Female,2001-11-28 00:00:00,,"Phuong Dong Hai, Tp. Phan Rang-thap Cham, Ninh..."
1,46945,2021-01-11 00:00:00,2021-07-26 00:00:00,35377,Female,1999-12-23 00:00:00,,"Phuong Xuan An, Thanh Pho Long Khanh, Dong Nai"
2,47297,2021-02-22 00:00:00,2024-03-13 00:00:00,115717,Female,1999-06-10 00:00:00,JUNIOR,"Phuong 08, Quan 6, TP Ho Chi Minh"
3,47428,2021-03-01 00:00:00,2021-04-11 00:00:00,2101,Female,1998-10-10 00:00:00,,"Thi Tran Lien Huong, Huyen Tuy Phong, Binh Thuan"
4,47435,2021-03-01 00:00:00,2021-11-11 00:00:00,115642,Female,1988-02-18 00:00:00,,"Xa Binh Hung, Huyen Binh Chanh, TP Ho Chi Minh"
...,...,...,...,...,...,...,...,...
11378,118286,2013-07-24 00:00:00,2021-09-19 00:00:00,120745,Female,1992-11-05 00:00:00,,"Thi Tran Tien Hai, Huyen Tien Hai, Thai Binh"
11379,120002,2013-10-14 00:00:00,2022-06-01 00:00:00,17466,Female,1992-05-01 00:00:00,,"Xa Long Hau, Huyen Lai Vung, Dong Thap"
11380,120825,2013-11-27 00:00:00,2021-08-15 00:00:00,23979,Female,1991-03-26 00:00:00,,"Phuong Binh Tri Dong, Quan Binh Tan, TP Ho Chi..."
11381,121209,2013-12-16 00:00:00,,16010,Male,1988-03-11 00:00:00,JUNIOR,"Thi Tran Thanh Binh, Huyen Thanh Binh, Dong Thap"


# **LEADS table (2,143,891 rows)**
- `DTIME_CREATED`: Timestamp when the lead or record was created
- `DATE_ASSIGNED`: Date when the lead was assigned. This field may be missing due to a system error during lead assignment. (3000-01-01 00:00:00)
- `DATE_ACCEPTED`: Date when the lead was accepted. In cases where the leads come from the SOB process, SA will provide direct support without clicking 'Accept Leads' (Date accepted = 3000-01-01)
- `DESC_FIN_REASON`: Description of the financial reason for the lead or outcome (`Not Interest`, `Expire Lifetime`, `Not Applicable`, `Deal`, `Existing Lead On LDS`, `No Cluster`, `Rejected`, `Expired Acceptance`, `Replaced`, `Expired Offer`, `Expired Campaign`) (missing 2%)
- `CODE_POS`: Code representing the point of sale or staff position -> **Foreign Key to SHOP table**
- `CODE_PRODUCT_TYPE`: Code for the type of financial product (`BNP`, `CLX`, `CCX`, `SCW`, `VCC`, `CD`, `TW`, `ACL`, `EPOS`, `CCW`)
- `FLAG_ASSIGNED`: Binary flag indicating whether the lead was assigned (1) or not (0)
- `CODE_SA`: Sales agent or staff code handling the lead. In cases where the leads come from the SOB process, SA will provide direct support without clicking 'Accept Leads'. (CODE_SA is null) (missing 28%) -> **Foreign Key to Employee table**
- `SKP_CLIENT`: ID for customer
- `MAX_OFFER`: Maximum loan offer amount approved for the customer (missing 1%)
- `LEAD_SOURCE`: Lead source (RA, SA, TLS, SOB, GMA, LANDING PAGE, RTDM)
- `PROCESS`: Lead approach process - Standard Process (STD), Proactive Process (PRO)

In [23]:
leads_df = pd.read_csv(r'data/LEADS.csv')
leads_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2143891 entries, 0 to 2143890
Data columns (total 12 columns):
 #   Column             Dtype  
---  ------             -----  
 0   DTIME_CREATED      object 
 1   DATE_ASSIGNED      object 
 2   DATE_ACCEPTED      object 
 3   DESC_FIN_REASON    object 
 4   CODE_POS           object 
 5   CODE_PRODUCT_TYPE  object 
 6   FLAG_ASSIGNED      int64  
 7   CODE_SA            object 
 8   SKP_CLIENT         int64  
 9   MAX_OFFER          float64
 10  LEAD_SOURCE        object 
 11  PROCESS            object 
dtypes: float64(1), int64(2), object(9)
memory usage: 196.3+ MB


In [24]:
leads_df

Unnamed: 0,DTIME_CREATED,DATE_ASSIGNED,DATE_ACCEPTED,DESC_FIN_REASON,CODE_POS,CODE_PRODUCT_TYPE,FLAG_ASSIGNED,CODE_SA,SKP_CLIENT,MAX_OFFER,LEAD_SOURCE,PROCESS
0,2019-10-16 13:38:42,3000-01-01 00:00:00,3000-01-01 00:00:00,Not Applicable,011417,CLX,0,,12393659,20000000.0,TLS,PRO
1,2024-09-26 18:22:11,2024-09-26 18:22:11,2024-09-26 18:22:21,Not Interest,440055,BNP,1,0026865,93813570,8000000.0,SA,STD
2,2024-11-06 20:08:00,2024-11-07 08:33:46,2024-11-07 08:34:32,Not Interest,1W0499,CCX,1,00058616,1748865,25000000.0,RA,STD
3,2022-09-29 11:56:46,2022-09-29 11:56:46,2022-09-29 18:48:50,Not Interest,400554,CCX,1,0027520,22148819,5000000.0,SA,STD
4,2022-03-30 20:17:34,2022-03-30 20:17:34,2022-03-31 08:48:42,Not Interest,100284,CLX,1,0023262,54145376,40000000.0,SA,STD
...,...,...,...,...,...,...,...,...,...,...,...,...
2143886,2022-11-04 19:28:06,2022-11-04 19:28:06,2022-11-06 12:33:10,Not Interest,030298,CLX,1,0031265,57165785,72657000.0,SA,STD
2143887,2022-11-11 19:50:40,2022-11-11 19:50:40,2022-11-12 08:09:48,Not Interest,221486,VCC,1,0027690,42788141,3000000.0,SA,STD
2143888,2025-03-19 14:38:46,2025-03-20 08:00:50,2025-03-20 08:03:59,Not Interest,013355,CCX,1,00047460,104196078,35000000.0,RA,STD
2143889,2022-11-11 16:12:23,2022-11-11 16:12:23,2022-11-11 16:12:48,Expire Lifetime,170508,VCC,1,0023679,12577773,5500000.0,RA,STD


In [25]:
leads_df['DESC_FIN_REASON'].value_counts()

DESC_FIN_REASON
Not Interest            1247039
Expire Lifetime          463192
Not Applicable           189805
Deal                      57537
Existing Lead On LDS      55081
No Cluster                43800
Rejected                  23004
Expired Acceptance        15761
Replaced                   8902
Expired Offer              3120
Expired Campaign             65
Name: count, dtype: int64

In [26]:
leads_df['CODE_PRODUCT_TYPE'].value_counts()

CODE_PRODUCT_TYPE
BNP     734197
CLX     627256
CCX     315572
SCW     265964
VCC     130229
CD       32955
TW       19097
ACL      18375
EPOS       123
CCW        123
Name: count, dtype: int64

In [27]:
leads_df['LEAD_SOURCE'].value_counts()

LEAD_SOURCE
RA              1145648
SA               876587
TLS               42733
SOB               38450
GMA               25461
LANDING PAGE       9641
RTDM               5371
Name: count, dtype: int64

# **PAST_BEHAVIOR table (4,412,610 rows)**
- `DTIME_EVENT`: Timestamp when the event or action occurred
- `SKP_CLIENT`: ID for customer
- `ACTION`: Description of the action taken (`APP SCORING`, `SA SCORING`, `WEB SCORING`, `TLS SCORING`, `Partner Site SCORING`, `SOB_CD_QR`, `SOB_TW_LINK`, `SOB_TW_QR`, `TW0BOD`, `SOB_CD_LINK`, `WALK_IN`, `SA, TLS SCORING`, `APP, SA SCORING`, `SA, WEB SCORING`, `APP, Partner Site SCORING`, `SOB_CD_HAPP`)
- `PRODUCT_CODE`: Code representing the product involved in the event (`BNPL`, `CLX`, `CCX`, `ACL`, `TW`, `CC_SC`, `CD`, `CC_CC`, `SAI`, `CC_VCC`, `ACLX`)

In [28]:
past_behavior_df = pd.read_csv(r'data/PASt_BEHAVIOR.csv')
past_behavior_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4412610 entries, 0 to 4412609
Data columns (total 4 columns):
 #   Column        Dtype 
---  ------        ----- 
 0   DTIME_EVENT   object
 1   SKP_CLIENT    int64 
 2   ACTION        object
 3   PRODUCT_CODE  object
dtypes: int64(1), object(3)
memory usage: 134.7+ MB


In [29]:
past_behavior_df

Unnamed: 0,DTIME_EVENT,SKP_CLIENT,ACTION,PRODUCT_CODE
0,2022-12-01 01:40:38,15368837,APP SCORING,CLX
1,2022-12-01 02:04:46,7658238,APP SCORING,BNPL
2,2022-12-01 02:07:22,12686267,APP SCORING,BNPL
3,2022-12-01 06:01:48,16715460,APP SCORING,BNPL
4,2022-12-01 06:01:48,16715460,APP SCORING,CCX
...,...,...,...,...
4412605,2025-06-03 22:08:07,67685864,APP SCORING,BNPL
4412606,2025-06-03 22:09:07,39273226,APP SCORING,CCX
4412607,2025-06-03 22:15:54,127338235,WEB SCORING,BNPL
4412608,2025-06-03 22:27:18,43180366,APP SCORING,CLX


In [32]:
past_behavior_df['ACTION'].value_counts()

ACTION
APP SCORING                  1630349
SA SCORING                    867787
WEB SCORING                   746021
TLS SCORING                   726233
Partner Site SCORING          396829
SOB_CD_QR                      37813
SOB_TW_LINK                     5415
SOB_TW_QR                       1136
TW0BOD                           594
SOB_CD_LINK                      258
WALK_IN                          167
SA, TLS SCORING                    3
APP, SA SCORING                    2
SA, WEB SCORING                    1
APP, Partner Site SCORING          1
SOB_CD_HAPP                        1
Name: count, dtype: int64

In [33]:
past_behavior_df['PRODUCT_CODE'].value_counts()

PRODUCT_CODE
BNPL      1990361
CLX       1250701
CCX        590115
ACL        374434
TW          89346
CC_SC       48101
CD          38180
CC_CC       30845
SAI           289
CC_VCC        150
ACLX           88
Name: count, dtype: int64

# **SHOP table (64,525 rows)**
- `CODE_POS`: ID for the point of sale or staff position (POS code)
- `SALESROOM_TOWN`: Name of the ward/commune where the salesroom is located
- `SALESROOM_DISTRICT`: Name of the district where the salesroom is located
- `SALESROOM_PROVINCE`: Name of the province/city where the salesroom is located
- `ADDRESS`: Detailed address of the salesroom or registered location

In [30]:
shop_df = pd.read_csv(r'data/SHOP.csv')
shop_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 64525 entries, 0 to 64524
Data columns (total 5 columns):
 #   Column              Non-Null Count  Dtype 
---  ------              --------------  ----- 
 0   CODE_POS            64525 non-null  object
 1   SALESROOM_TOWN      64096 non-null  object
 2   SALESROOM_DISTRICT  64096 non-null  object
 3   SALESROOM_PROVINCE  64525 non-null  object
 4   ADDRESS             64525 non-null  object
dtypes: object(5)
memory usage: 2.5+ MB


In [31]:
shop_df

Unnamed: 0,CODE_POS,SALESROOM_TOWN,SALESROOM_DISTRICT,SALESROOM_PROVINCE,ADDRESS
0,410163,P. LẠCH TRAY,Q. Ngo Quyen,Hai Phong,"P. LẠCH TRAY, Q. Ngo Quyen, Hai Phong"
1,400714,P. Truong Thi,TP. Vinh,Nghe An,"P. Truong Thi, TP. Vinh, Nghe An"
2,240068,P. THAC MO,TX. Phuoc Long,Binh Phuoc,"P. THAC MO, TX. Phuoc Long, Binh Phuoc"
3,170525,X. Ham Ninh,TP. Phu Quoc,Kien Giang,"X. Ham Ninh, TP. Phu Quoc, Kien Giang"
4,540196,TT.Bac Ha,H. Bac Ha,Lao Cai,"TT.Bac Ha, H. Bac Ha, Lao Cai"
...,...,...,...,...,...
64520,570492,X. Ky Phong,H. Ky Anh,Ha Tinh,"X. Ky Phong, H. Ky Anh, Ha Tinh"
64521,224132,THI TRAN XUAN MAI,H. Chuong My,TP. Ha Noi,"THI TRAN XUAN MAI, H. Chuong My, TP. Ha Noi"
64522,210940,P.5,TP. Da Lat,Lam Dong,"P.5, TP. Da Lat, Lam Dong"
64523,014513,Phuong 9,Q. 5,TP. HCM,"Phuong 9, Q. 5, TP. HCM"


In [34]:
contract_train = pd.read_csv(r'data/contract_train.csv')
contract_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4080568 entries, 0 to 4080567
Data columns (total 29 columns):
 #   Column                 Dtype  
---  ------                 -----  
 0   SKP_CREDIT_CASE        int64  
 1   SKP_CLIENT             int64  
 2   NAME_EDUCATION_TYPE    object 
 3   CNT_CHILDREN           int64  
 4   AMT_INCOME_MAIN        float64
 5   AMT_INCOME_HOUSEHOLD   int64  
 6   NAME_INCOME_TYPE       object 
 7   CODE_PROFESSION        object 
 8   NAME_CREDIT_STATUS     object 
 9   PRODUCT                object 
 10  AMT_CREDIT             float64
 11  PAYMENT_NUM            float64
 12  INIT_PAY               float64
 13  ANNUITY                float64
 14  SKP_SALESROOM          int64  
 15  APPLY_CONTRACT_TIME    object 
 16  APPROVE_CONTRACT_TIME  object 
 17  SIGN_CONTRACT_TIME     object 
 18  APPLY_EMPLOYEE         object 
 19  SIGN_EMPLOYEE          object 
 20  TRANSAC                object 
 21  FIRST_DUE              int64  
 22  SECOND_DUE        

In [35]:
contract_test = pd.read_csv(r'data/contract_test.csv')
contract_test.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 290094 entries, 0 to 290093
Data columns (total 29 columns):
 #   Column                 Non-Null Count   Dtype  
---  ------                 --------------   -----  
 0   SKP_CREDIT_CASE        290094 non-null  int64  
 1   SKP_CLIENT             290094 non-null  int64  
 2   NAME_EDUCATION_TYPE    290094 non-null  object 
 3   CNT_CHILDREN           290094 non-null  int64  
 4   AMT_INCOME_MAIN        290094 non-null  float64
 5   AMT_INCOME_HOUSEHOLD   290094 non-null  int64  
 6   NAME_INCOME_TYPE       290094 non-null  object 
 7   CODE_PROFESSION        290094 non-null  object 
 8   NAME_CREDIT_STATUS     290094 non-null  object 
 9   PRODUCT                289376 non-null  object 
 10  AMT_CREDIT             290094 non-null  float64
 11  PAYMENT_NUM            234551 non-null  float64
 12  INIT_PAY               289376 non-null  float64
 13  ANNUITY                234551 non-null  float64
 14  SKP_SALESROOM          290094 non-nu