### Pendahuluan

Bagi sebagian besar financial institutions, seperti bank dan perusahaan multi-finance, sumber pendapatan utama mereka berasal dari kegiatan pinjaman mereka. Dengan melakukan kegiatan ini, berarti pemberi pinjaman dihadapkan pada potensi risiko, di mana debitur berhenti membayar pinjamannya sehingga menyebabkan kerugian bagi pemberi pinjaman. Untuk mengurangi kerugian ini, pemberi pinjaman diharapkan dengan tepat memilih siapa yang memenuhi syarat untuk pinjaman, pada tingkat berapa, dan berapa jumlahnya.

### Data yang Digunakan

Dataset yang digunakan berasal dari perusahaan multi-finance.

Untuk detail datanya adalah sebagai berikut:
- `LN_ID`, Loan ID
- `TARGET`, Target variable (1 - client with late payment more than X days, 0 - all other cases)
- `CONTRACT_TYPE`, Identification if loan is cash or revolving
- `GENDER`, Gender of the client
- `NUM_CHILDREN`, Number of children the client has
- `INCOME`, Monthly income of the client
- `APPROVED_CREDIT`, Approved credit amount of the loan
- `ANNUITY`, Loan annuity (amount that must be paid monthly)
- `PRICE`, For consumer loans it is the price of the goods for which the loan is given
- `INCOME_TYPE`," Clients income type (businessman, working, maternity leave,…)"
- `EDUCATION`, The client highest education
- `FAMILY_STATUS`, Family status of the client
- `HOUSING_TYPE`, "What is the housing situation of the client (renting, living with parents, ...)"
- `DAYS_AGE`, Client's age in days at the time of application
- `DAYS_WORK`, How many days before the application the person started current job
- `DAYS_REGISTRATION`, How many days before the application did client change his registration
- `DAYS_ID_CHANGE`, How many days before the application did client change the identity document with which he applied for the loan
- `WEEKDAYS_APPLY`, On which day of the week did the client apply for the loan
- `HOUR_APPLY`, Approximately at what hour did the client apply for the loan
- `ORGANIZATION_TYPE`, Type of organization where client works
- `EXT_SCORE_1`, Normalized score from external data source
- `EXT_SCORE_2`, Normalized score from external data source
- `EXT_SCORE_3`, Normalized score from external data source

### Import Library

In [16]:
import re
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import MinMaxScaler, LabelEncoder, OneHotEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import GridSearchCV, StratifiedKFold
from sklearn.metrics import confusion_matrix, classification_report, f1_score, roc_curve, roc_auc_score

sns.set(style='darkgrid')
pd.options.display.max_columns = 50
import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

In [43]:
loan = pd.read_csv('./Dataset/app_train.csv')
loan.columns = loan.columns.str.lower()
loan.head()

Unnamed: 0,unnamed: 0,ln_id,target,contract_type,gender,num_children,income,approved_credit,annuity,price,income_type,education,family_status,housing_type,days_age,days_work,days_registration,days_id_change,weekdays_apply,hour_apply,organization_type,ext_score_1,ext_score_2,ext_score_3
0,201468,333538,0,Revolving loans,F,1,67500.0,202500.0,10125.0,202500.0,Working,Secondary / secondary special,Married,With parents,-11539,-921,-119.0,-2757,TUESDAY,18,Business Entity Type 3,0.572805,0.608276,
1,264803,406644,0,Cash loans,F,1,202500.0,976711.5,49869.0,873000.0,Commercial associate,Secondary / secondary special,Married,House / apartment,-15743,-4482,-1797.0,-2455,TUESDAY,14,Other,0.6556,0.684298,
2,137208,259130,0,Cash loans,F,0,180000.0,407520.0,25060.5,360000.0,Pensioner,Secondary / secondary special,Married,House / apartment,-20775,365243,-8737.0,-4312,THURSDAY,14,NA1,,0.580687,0.749022
3,269220,411997,0,Cash loans,M,0,225000.0,808650.0,26086.5,675000.0,State servant,Higher education,Married,House / apartment,-20659,-10455,-4998.0,-4010,WEDNESDAY,10,Culture,,0.62374,0.710674
4,122096,241559,0,Revolving loans,M,0,135000.0,180000.0,9000.0,180000.0,Commercial associate,Secondary / secondary special,Single / not married,House / apartment,-9013,-1190,-3524.0,-1644,SUNDAY,11,Construction,0.175511,0.492994,0.085595


# Data Cleansing

In [22]:
loan.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 61503 entries, 0 to 61502
Data columns (total 24 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   unnamed: 0         61503 non-null  int64  
 1   ln_id              61503 non-null  int64  
 2   target             61503 non-null  int64  
 3   contract_type      61503 non-null  object 
 4   gender             61503 non-null  object 
 5   num_children       61503 non-null  int64  
 6   income             61503 non-null  float64
 7   approved_credit    61503 non-null  float64
 8   annuity            61502 non-null  float64
 9   price              61441 non-null  float64
 10  income_type        61503 non-null  object 
 11  education          61503 non-null  object 
 12  family_status      61503 non-null  object 
 13  housing_type       61503 non-null  object 
 14  days_age           61503 non-null  int64  
 15  days_work          61503 non-null  int64  
 16  days_registration  615

In [23]:
loan.isnull().sum()[loan.isnull().sum() > 0].reset_index(name='count').sort_values(
    by='count',
    ascending=False
).set_index('index')

Unnamed: 0_level_0,count
index,Unnamed: 1_level_1
ext_score_1,34845
ext_score_3,12239
ext_score_2,134
price,62
annuity,1


In [24]:
loan.duplicated().sum()

0

In [25]:
for col_name in loan.select_dtypes(include=['object']).columns.tolist():
    print('Variable ',col_name)
    print(loan[col_name].value_counts())

Variable  contract_type
Cash loans         55699
Revolving loans     5804
Name: contract_type, dtype: int64
Variable  gender
F    40549
M    20954
Name: gender, dtype: int64
Variable  income_type
Working                 31621
Commercial associate    14217
Pensioner               11249
State servant            4407
Unemployed                  5
Student                     3
Businessman                 1
Name: income_type, dtype: int64
Variable  education
Secondary / secondary special    43777
Higher education                 14887
Incomplete higher                 2045
Lower secondary                    760
Academic degree                     34
Name: education, dtype: int64
Variable  family_status
Married                 39370
Single / not married     9029
Civil marriage           5881
Separated                3970
Widow                    3253
Name: family_status, dtype: int64
Variable  housing_type
House / apartment      54648
With parents            2891
Municipal apartment     2203

In [44]:
loan['organization_type'] = loan['organization_type'].apply(
    lambda x: '-'.join(re.findall(r'\w+', x.split()[0]))
)

loan['organization_type'].unique()

array(['Business', 'Other', 'NA1', 'Culture', 'Construction',
       'Self-employed', 'University', 'Kindergarten', 'Restaurant',
       'Trade', 'Services', 'Housing', 'Industry', 'Transport',
       'Medicine', 'School', 'Security', 'Government', 'Agriculture',
       'Police', 'Emergency', 'Realtor', 'Electricity', 'Military',
       'Postal', 'Hotel', 'Cleaning', 'Bank', 'Telecom', 'Insurance',
       'Mobile', 'Advertising', 'Legal', 'Religion'], dtype=object)

In [49]:
day_cols = [i for i in loan if i.startswith('days')]
loan[day_cols] = abs(loan[day_cols])

In [50]:
loan['age'] = (loan['days_age']/360).astype(int)
loan['age'].unique()

array([32, 43, 57, 25, 29, 53, 46, 27, 28, 39, 36, 42, 30, 37, 63, 61, 54,
       55, 66, 50, 51, 26, 64, 38, 62, 59, 35, 65, 34, 31, 58, 40, 45, 33,
       60, 23, 44, 49, 41, 52, 56, 47, 69, 48, 22, 24, 68, 67, 21, 70])

In [60]:
loan['age_group'] = pd.cut(
    loan['age'], 
    bins=[19, 25, 35, 60, 100], 
    labels=['very_young','young','middle_age','senior_citizen']
)

loan[['age','age_group']].head()

Unnamed: 0,age,age_group
0,32,young
1,43,middle_age
2,57,middle_age
3,57,middle_age
4,25,very_young


https://www.analyticsvidhya.com/blog/2022/03/exploratory-data-analysis-eda-credit-card-fraud-detection-case-study/