# note2
> 머신러닝 정리, 특성공학
- toc: true
- branch: master
- badges: false
- comments: true
- author: pinkocto
- categories: [python]

## Machine Learning의 절차
1. 데이터의 결측치/이상치 제거, 처리 (시각화, 가설검정, ...)
2. $X$(설명변수), $Y$(목표변수)를 선언
3. 학습데이터와 검증데이터를 분할
4. 학습데이터를 가져와, 알고리즘을 이용해 학습 실시
5. 검증데이터를 이용하여, 평가작업 실시

In [1]:
import pandas as pd

In [5]:
#hide
df1 = pd.read_csv('./data/Data01.csv', encoding='cp949')

In [8]:
df1.shape

(51304, 18)

In [7]:
df1.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 51304 entries, 0 to 51303
Data columns (total 18 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   Unnamed: 0         51304 non-null  int64  
 1   id                 51304 non-null  int64  
 2   type_of_contract   51300 non-null  object 
 3   type_of_contract2  51303 non-null  object 
 4   channel            51304 non-null  object 
 5   datetime           51304 non-null  object 
 6   Term               51304 non-null  int64  
 7   payment_type       51304 non-null  object 
 8   product            51303 non-null  object 
 9   amount             51304 non-null  int64  
 10  state              51304 non-null  object 
 11  overdue_count      51304 non-null  int64  
 12  overdue            51302 non-null  object 
 13  credit rating      42521 non-null  float64
 14  bank               48544 non-null  object 
 15  cancellation       51279 non-null  object 
 16  age                405

In [9]:
df1.isnull().sum()

Unnamed: 0               0
id                       0
type_of_contract         4
type_of_contract2        1
channel                  0
datetime                 0
Term                     0
payment_type             0
product                  1
amount                   0
state                    0
overdue_count            0
overdue                  2
credit rating         8783
bank                  2760
cancellation            25
age                  10795
Mileage              10795
dtype: int64

In [10]:
df2 = df1.dropna() # 결측치 제거

In [11]:
df2.isnull().sum()

Unnamed: 0           0
id                   0
type_of_contract     0
type_of_contract2    0
channel              0
datetime             0
Term                 0
payment_type         0
product              0
amount               0
state                0
overdue_count        0
overdue              0
credit rating        0
bank                 0
cancellation         0
age                  0
Mileage              0
dtype: int64

In [14]:
df2['state'].value_counts()

계약확정     39776
해약확정       667
기간만료        25
해약진행중       12
Name: state, dtype: int64

In [17]:
Y = df2['state'].replace('계약확정',0).replace('기간만료',0).replace('해약확정',1).replace('해약진행중',1)
X = df2[['Term', 'amount','age','overdue_count','credit rating']]

In [19]:
X.head()

Unnamed: 0,Term,amount,age,overdue_count,credit rating
0,60,96900,43.0,0,9.0
1,60,102900,62.0,0,2.0
2,60,96900,60.0,0,8.0
3,12,66900,60.0,0,5.0
4,12,66900,51.0,12,8.0


- 계약기간, 계약금액, 고객의 연령, 연체건수, 고객의 신용등급을 넣었을 때 이 고객이 해약할 고객인지 아닌지를 예측하는 알고리즘을 만들어보자.

In [21]:
# Scikit Learn
# Scipy + Learning Tool kit
# 특성공학 + 알고리즘

In [20]:
from sklearn.model_selection import train_test_split # 학습/검증 데이터 분할
from sklearn.tree import DecisionTreeClassifier # 알고리즘
from sklearn.metrics import accuracy_score # 정확도 평가지표

In [22]:
X_train, X_test, Y_train, Y_test = train_test_split(X, Y,
                                                    test_size=0.3)

In [23]:
model = DecisionTreeClassifier()
model.fit(X_train, Y_train)

In [24]:
# 학습이 잘 되었는지 확인
# 검증이 잘 이루어지는지 확인 (일반화)
Y_train_pred = model.predict(X_train)
Y_test_pred = model.predict(X_test)

In [25]:
accuracy_score(Y_train, Y_train_pred)

0.9865542066629023

In [26]:
accuracy_score(Y_test, Y_test_pred)

0.980566534914361