## 데이터 전처리 및 기본 모델 생성

1. 전처리
    - 1-1 NA features 가 있는 rows 를 제거하여 **기본적인 데이터**를 만든다. (DATA Remove NA rows)
    - 1-2 **최적의 스케일** 방식을 찾는다. (Standard Scaler)
    

2. 기본 모델 분석 
    - 파라미터가 초기화된 **기본 모델을 만들고 최적의 스케일 방식을 정해 roc_auc_score 를 분석** 최적의 모델 형성 방식을 정한다.<br>
      (Standard scaling, rf- knn - svc - tree 순으로 roc_auc_score-val 의 점수가 높다)
      
3. Featuers 의 중요도 분석과 새로운 데이터 전처리 방법 
    - RandomForest의 Featuer importances 함수로 피쳐의 중요성을 메기고 낮은 중요도의 피쳐를 제거 한 새로운 데이터 셋 Data1, Data2, Data3 생성
    - 생성된 데이터 셋을 최척의 스케일(Standard Scaler)과 기본 모델 평가를 통해 roc_auc_score_val의 점수를 구해본다.  
    (Data 쓰기로 결정)
    

In [1]:
import pandas as pd 
import numpy as np
data =pd.read_csv("preprocessing\cs_data.csv",
                 header=0,
                 index_col=0)

In [2]:
data.head()
#SeriousDlqin2yrs / RevolvingUtilizationOfUnsecuredLines /age /NumberOfTime30-59DaysPastDueNotWorse/
#DebtRatio/MonthlyIncome/NumberOfOpenCreditLinesAndLoans/NumberOfTimes90DaysLate/NumberRealEstateLoansOrLines/
#NumberOfTime60-89DaysPastDueNotWorse/NumberOfDependents

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [3]:
print(data.count())
print(data.isnull().sum())

SeriousDlqin2yrs                        150000
RevolvingUtilizationOfUnsecuredLines    150000
age                                     150000
NumberOfTime30-59DaysPastDueNotWorse    150000
DebtRatio                               150000
MonthlyIncome                           120269
NumberOfOpenCreditLinesAndLoans         150000
NumberOfTimes90DaysLate                 150000
NumberRealEstateLoansOrLines            150000
NumberOfTime60-89DaysPastDueNotWorse    150000
NumberOfDependents                      146076
dtype: int64
SeriousDlqin2yrs                            0
RevolvingUtilizationOfUnsecuredLines        0
age                                         0
NumberOfTime30-59DaysPastDueNotWorse        0
DebtRatio                                   0
MonthlyIncome                           29731
NumberOfOpenCreditLinesAndLoans             0
NumberOfTimes90DaysLate                     0
NumberRealEstateLoansOrLines                0
NumberOfTime60-89DaysPastDueNotWorse        0
NumberOfDe

In [4]:
data.head()
data.shape

(150000, 11)

In [5]:
# 1-1 NA 제거 
data.dropna(inplace=True)
data.head()

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.65818,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.23381,30,0,0.03605,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0


In [6]:
data.isnull().sum()
data.shape

(120269, 11)

In [10]:
data

Unnamed: 0,SeriousDlqin2yrs,RevolvingUtilizationOfUnsecuredLines,age,NumberOfTime30-59DaysPastDueNotWorse,DebtRatio,MonthlyIncome,NumberOfOpenCreditLinesAndLoans,NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse,NumberOfDependents
1,1,0.766127,45,2,0.802982,9120.0,13,0,6,0,2.0
2,0,0.957151,40,0,0.121876,2600.0,4,0,0,0,1.0
3,0,0.658180,38,1,0.085113,3042.0,2,1,0,0,0.0
4,0,0.233810,30,0,0.036050,3300.0,5,0,0,0,0.0
5,0,0.907239,49,1,0.024926,63588.0,7,0,1,0,0.0
...,...,...,...,...,...,...,...,...,...,...,...
149995,0,0.385742,50,0,0.404293,3400.0,7,0,0,0,0.0
149996,0,0.040674,74,0,0.225131,2100.0,4,0,1,0,0.0
149997,0,0.299745,44,0,0.716562,5584.0,4,0,1,0,2.0
149999,0,0.000000,30,0,0.000000,5716.0,4,0,0,0,0.0


In [None]:
# Basic 모델 만들기 

In [12]:
# DATA SET X ,y 로 나누기 
X= data.drop(columns='SeriousDlqin2yrs')
y= data['SeriousDlqin2yrs']
X.shape , y.shape

((120269, 10), (120269,))

In [13]:
# Dataset 구분 TEST,VAL,TRAIN
from sklearn.model_selection import train_test_split 
X_train, X_test, y_train, y_test = train_test_split (X, y, test_size =0.2, stratify=y ,random_state=0) 
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, test_size=0.2, stratify=y_train, random_state=0)

In [14]:
# Scaling 
# Scaled1 : StandardScaler
# Scaled2 : MinMaxScaler
from sklearn.preprocessing import StandardScaler, MinMaxScaler
sc= StandardScaler()
X_train_scaled1=sc.fit_transform(X_train)
X_val_scaled1=sc.transform(X_val)
X_test_scaled1=sc.transform(X_test)


mn = MinMaxScaler()
X_train_scaled2 =mn.fit_transform(X_train)
X_val_scaled2 =mn.transform(X_val)
X_test_scaled2=mn.transform(X_test)

In [16]:
# Training 
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import roc_curve, plot_roc_curve, roc_auc_score

knn =KNeighborsClassifier()
tree=DecisionTreeClassifier(random_state=0)
rf=RandomForestClassifier(random_state=0)
svc= SVC(random_state=0, probability =True)

In [17]:
knn.fit(X_train_scaled1,y_train)
tree.fit(X_train_scaled1,y_train)
rf.fit(X_train_scaled1,y_train)
svc.fit(X_train_scaled1,y_train)

SVC(probability=True, random_state=0)

In [18]:
#train_scaled1 proba => StandardScaler 로 했을 경우 
prob_knn_train=knn.predict_proba(X_train_scaled1)[:,1]
prob_tree_train= tree.predict_proba(X_train_scaled1)[:,1]
prob_rf_train=rf.predict_proba(X_train_scaled1)[:,1]
prob_svc_train=svc.predict_proba(X_train_scaled1)[:,1]

In [31]:
#train_scaled1 proba => StandardScaler 의 roc_auc_score
print("Train_StandardScaler_roc_auc_score")
print("knn:{} | Decisiontree:{} | RandomForest:{} | SVC:{}".format(roc_auc_score(y_train, prob_knn_train),roc_auc_score(y_train,prob_tree_train),roc_auc_score(y_train,prob_rf_train),roc_auc_score(y_train, prob_svc_train)))

Train_StandardScaler_roc_auc_score
knn:0.9438702150745297 | Decisiontree:0.9999999921693801 | RandomForest:0.9999955313261903 | SVC:0.6550256904762071


In [20]:
# val_scaled1 proba => standardScaler 로 했을 경우 
prob_knn_val=knn.predict_proba(X_val_scaled1)[:,1]
prob_tree_val= tree.predict_proba(X_val_scaled1)[:,1]
prob_rf_val=rf.predict_proba(X_val_scaled1)[:,1]
prob_svc_val=svc.predict_proba(X_val_scaled1)[:,1]

In [54]:
y_val.shape , prob_knn_val.shape

((19243,), (19243,))

In [32]:
print("Val_StandardScaler_roc_auc_Score")
print("knn:{} | Decisiontree:{} | RandomForest:{} | SVC:{}".format(roc_auc_score(y_val, prob_knn_val),roc_auc_score(y_val,prob_tree_val),roc_auc_score(y_val,prob_rf_val),roc_auc_score(y_val, prob_svc_val)))

Val_StandardScaler_roc_auc_Score
knn:0.6999108449752681 | Decisiontree:0.5981121724260852 | RandomForest:0.8389143220379409 | SVC:0.6346719146049915


In [30]:
# val_scaled2 proba
knn.fit(X_train_scaled2,y_train)
tree.fit(X_train_scaled2,y_train)
rf.fit(X_train_scaled2,y_train)
svc.fit(X_train_scaled2,y_train)

SVC(probability=True, random_state=0)

In [37]:
# val_scaled2 proba => MinMaxScaler 로 했을 경우 
prob_knn_train=knn.predict_proba(X_train_scaled2)[:,1]
prob_tree_train= tree.predict_proba(X_train_scaled2)[:,1]
prob_rf_train=rf.predict_proba(X_train_scaled2)[:,1]
prob_svc_train=svc.predict_proba(X_train_scaled2)[:,1]

In [38]:
#train_scaled1 proba => MinMaxScaler 의 roc_auc_score
print("Train_MinMaxScaler_roc_auc_score")
print("knn:{} | Decisiontree:{} | RandomForest:{} | SVC:{}".format(roc_auc_score(y_train, prob_knn_train),roc_auc_score(y_train,prob_tree_train),roc_auc_score(y_train,prob_rf_train),roc_auc_score(y_train, prob_svc_train)))

Train_MinMaxScaler_roc_auc_score
knn:0.9443915333801829 | Decisiontree:0.9999999856438633 | RandomForest:0.9999946516865444 | SVC:0.6137663729878775


In [35]:
prob_knn_val2=knn.predict_proba(X_val_scaled2)[:,1]
prob_tree_val2= tree.predict_proba(X_val_scaled2)[:,1]
prob_rf_val2=rf.predict_proba(X_val_scaled2)[:,1]
prob_svc_val2=svc.predict_proba(X_val_scaled2)[:,1]

In [36]:
print("Val_MinMaxScaler_roc_auc_Score")
print("knn:{} | Decisiontree:{} | RnadomForest:{} | SVC:{}".format(roc_auc_score(y_val, prob_knn_val2), roc_auc_score(y_val,prob_tree_val2), roc_auc_score(y_val,prob_rf_val2), roc_auc_score(y_val, prob_svc_val2)))

Val_MinMaxScaler_roc_auc_Score
knn:0.6643361563808541 | Decisiontree:0.6108506393523028 | RnadomForest:0.8393583845697647 | SVC:0.6095620184223086


In [None]:
# Scaler 비교 : Train, Val모든 결과 값에서 StandardScaler 가 더 우세 하다  

# Train_StandardScaler_roc_auc_score
# knn:0.9438702150745297 | Decisiontree:0.9999999921693801 | RandomForest:0.9999955313261903 | SVC:0.6550256904762071


# Val_StandardScaler_roc_auc_Score
# knn:0.6999108449752681 | Decisiontree:0.5981121724260852 | RandomForest:0.8389143220379409 | SVC:0.6346719146049915
                
    
# Train_MinMaxScaler_roc_auc_score
# knn:0.9443915333801829 | Decisiontree:0.9999999856438633 | RandomForest:0.9999946516865444 | SVC:0.6137663729878775
    
                
# Val_MinMaxScaler_roc_auc_Score
# knn:0.6643361563808541 | Decisiontree:0.6108506393523028 | RnadomForest:0.8393583845697647 | SVC:0.6095620184223086


In [None]:
# 기본으로 했을 때 standard scaling이 더 잘나옴, 잘나오는 모델 순 rf- knn - svc - tree

In [19]:
rf.feature_importances_
fi =pd.Series(rf.feature_importances_, index=X.columns)
fi.sort_values()

NumberRealEstateLoansOrLines            0.035682
NumberOfTime60-89DaysPastDueNotWorse    0.044924
NumberOfDependents                      0.046968
NumberOfTime30-59DaysPastDueNotWorse    0.051149
NumberOfTimes90DaysLate                 0.086852
NumberOfOpenCreditLinesAndLoans         0.089606
age                                     0.122492
MonthlyIncome                           0.162533
DebtRatio                               0.172253
RevolvingUtilizationOfUnsecuredLines    0.187541
dtype: float64

In [39]:
# drop the columns 'NumberRealEstateLoansOrLines'
data1 = data.drop(columns='NumberRealEstateLoansOrLines')
data1.head()
data1.to_csv('preprocessing\cs_data1.csv')


In [40]:
# drop the columns 'NumberOfTime30-59DaysPastDueNotWorse','NumberOfTimes90DaysLate,NumberOfOpenCreditLinesAndLoans'
data2 = data.drop(columns='NumberOfTime60-89DaysPastDueNotWorse')
data2.head()
data2.to_csv('preprocessing\cs_data2.csv')

In [41]:
# drop the columns 'NumberOfTime30-59DaysPastDueNotWorse','NumberOfTimes90DaysLate,NumberRealEstateLoansOrLines'
data3 = data.drop(columns=['NumberRealEstateLoansOrLines','NumberOfTime60-89DaysPastDueNotWorse'])
data3.head() 
data3.to_csv('preprocessing\cs_data3.csv')

In [42]:
data.shape, data1.shape, data2.shape, data3.shape  


((120269, 11), (120269, 10), (120269, 10), (120269, 9))

In [None]:
# data1, data2, data3 을 따로 분리하여 StandardScaling 과 기본 모델로 roc_auc_socre의 점수를 메겨본다. 

# Data origin remove NA 
# Val_StandardScaler_roc_auc_Score
# knn:0.6999108449752681 | Decisiontree:0.5981121724260852 | RandomForest:0.8389143220379409 | SVC:0.6346719146049915

# Data1 remove NA, NumberRealEstateLoansOrLines
# Val_StandardScaler_roc_auc_Score
# knn:0.7055634214109568 | Decisiontree:0.6085227884570642 | RandomForest:0.8374181642168389 | SVC:0.5989294546664828

# Data2 remove NA, NumberOfTime60-89DaysPastDueNotWorse
# Val_StandardScaler_roc_auc_score
# knn:0.6886573413674218 | Decisiontree:0.5998815513007719 | RandomForest:0.8263761030448962 | SVC:0.6240056002588437

# Data3 remove NA, NumberRealEstateLoansOrLines,NumberOfTime60-89DaysPastDueNotWorse
# Val_StandardScaler_roc_auc_score <br>
# knn:0.6927513756916053 | Decisiontree:0.5998646342350784 | RandomForest:0.8308144100985777 | SVC:0.6484756554235152


# roc_auc_socre_val 의 점수가 가장 높은 것은 Data1 그러나 그 차이는 오리지널 Data와 그리 크지 않다. 
# ==> RamdomForest를 메인 모델로 쓸 예정임으로 RF가 더 높은 Data를 사용하여 모델을 만들겠다.

################## 다음 결과에 대한 참고는 문서 1-1,1-2,1-3 확인


