## scikit-learn 소개
https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html

모델은 class로 구성되어 있다. 3가지 주요한 추상 class로 부터 만들어 진다.
1. Estimator 추론기
 - fit(X, y) 메소드와 getparams, setparams
2. Transformer 변환기
 - transform 메소드를 갖는다.
 - transform 이전에 fit이 선행되어야 한다.
 - 모든 Transformer는 Estimator이다.
3. Predictor 예측기
 - predict(X) 메소드를 가지고 있다.

## 모델링 과정
1. 분석 기획
 - 목적
 - 기간/조직/비용/수집 데이터에 식별/정의
2. 데이터 수집/저장
 - [train / validation] / test
 - train: 모델을 학습하는 데 사용. w 학습
 - validation: 튜닝 파라미터를 선택
 - test: 모델의 실제 성능을 측정.

In [None]:
import sklearn

In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier

data = load_breast_cancer()
X_train, X_test, y_train, y_test = train_test_split(data.data, data.target, random_state=42)

model = DecisionTreeClassifier(criterion='entropy')
model

$y \in \{0, 1\}$

$p(y=1) + p(y=0) = 1$

$$Loss(p(y=1)) = - y \log p(y=1) - (1-y)log (1 - p(y=1))$$
$$Loss(p(y=1)) = 1 - y P(y=1)^2 - (1-y)(1 - P(y=1))^2$$

In [None]:
data.keys()

입력데이터 X는 data.data에 존재합니다. N=569, D=30

In [None]:
data.data.shape

In [None]:
data.data.mean(axis=0)

In [None]:
data.data.std(axis=1)

In [None]:
print(data.DESCR)

In [None]:
data.target[-10:]

In [None]:
data.feature_names

In [None]:
data.target_names

In [None]:
data.target.shape

In [None]:
print(X_train.shape, X_test.shape, y_train.shape, y_test.shape)

In [None]:
143*4

In [None]:
data.data[0]

In [None]:
data.target[0]

[Quiz] target에서 1과 0의 빈도를 계산해보자.

In [None]:
import numpy as np
uniq, freq = np.unique(data.target, return_counts=True)
uniq, freq

In [None]:
import pandas as pd
pd.Series(data.target).value_counts()

모델을 학습데이터로 학습하고, 테스트 데이터로 예측을 수행하여 예측값 y_pred을 생성합니다.

In [None]:
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
y_pred

이를 테스트데이터의 참값과 비교해 봅니다.

In [None]:
y_test

테스트 데이터에서의 정확도는 아래와 같이 수행될 수 있습니다.

In [None]:
(y_pred == y_test).mean()

위 과정을 score 메소드를 이용하여 한번에 수행할 수 있습니다.

In [None]:
model.score(X_test, y_test)

모델의 학습내용은 tree에 저정되어 있습니다. graphviz는 아래와 같이 간단히 설치한다.
```bash
pip install graphviz
```

In [None]:
tree = model.tree_

In [None]:
import graphviz
from sklearn.tree import export_graphviz, plot_tree

아래와 같이 plot_tree 함수로 split point와 tree를 그려볼 수 있습니다.

In [None]:
plot_tree(model)

아래 작업을 수행하기 위한 준비사항:
1. pip install graphviz
2. [graphviz 사이트](https://graphviz.org/download/)에서 설치파일을 다운로드하고 설치한다.
3. 환경변수에 graphviz 설치폴더의 bin 폴더를 path를 등록.(설치옵션)
4. 컴퓨터를 재부팅시킨다.

In [None]:
dot_data = export_graphviz(model, out_file=None, 
                     feature_names=data.feature_names,  
                     class_names=data.target_names,  
                     filled=True, rounded=True,  
                     special_characters=True)  
graph = graphviz.Source(dot_data)  
graph

In [26]:
export_graphviz(model, out_file='tree.dot',
                feature_names=data.feature_names,  
                class_names=data.target_names,  
                filled=True, rounded=True,  
                special_characters=True)

with open("tree.dot") as f:
    dot_graph = f.read()
dot = graphviz.Source(dot_graph); dot.format='png'
dot.render(filename='tree')

'tree.png'

In [27]:
from sklearn.tree import export_text
treetext = export_text(model, decimals=4,
                       feature_names=list(data.feature_names))
print(treetext)

|--- mean concave points <= 0.0513
|   |--- worst radius <= 16.8300
|   |   |--- area error <= 48.7000
|   |   |   |--- worst texture <= 30.1450
|   |   |   |   |--- class: 1
|   |   |   |--- worst texture >  30.1450
|   |   |   |   |--- worst concavity <= 0.2006
|   |   |   |   |   |--- class: 1
|   |   |   |   |--- worst concavity >  0.2006
|   |   |   |   |   |--- mean concavity <= 0.0473
|   |   |   |   |   |   |--- class: 0
|   |   |   |   |   |--- mean concavity >  0.0473
|   |   |   |   |   |   |--- perimeter error <= 2.6285
|   |   |   |   |   |   |   |--- class: 1
|   |   |   |   |   |   |--- perimeter error >  2.6285
|   |   |   |   |   |   |   |--- class: 0
|   |   |--- area error >  48.7000
|   |   |   |--- concavity error <= 0.0200
|   |   |   |   |--- class: 0
|   |   |   |--- concavity error >  0.0200
|   |   |   |   |--- class: 1
|   |--- worst radius >  16.8300
|   |   |--- mean texture <= 16.1900
|   |   |   |--- class: 1
|   |   |--- mean texture >  16.1900
|   |   |

[Quiz] 위 모델에 대해 DecisionTreeClassifier의 옵션을 조정하여 test 데이터에 대한 정확도를 향상시켜보자.

In [28]:
m = DecisionTreeClassifier(criterion='entropy', max_depth=4, min_samples_leaf=5)
m.fit(X_train, y_train)
score = m.score(X_test, y_test)
score

0.965034965034965

In [29]:
best= {}
best['score'] = 0
mds = [3,4,5]
msls = range(2, 11)
mlns = range(7, 12)
for md in mds:
    for msl in msls:
        for mln in mlns:
            m = DecisionTreeClassifier(criterion='entropy', max_depth=md, 
                                       min_samples_leaf=msl, max_leaf_nodes=mln)
            m.fit(X_train, y_train)
            score = m.score(X_test, y_test)
            if score > best['score']:
                print(f"current best score is {score} at md={md}, msl={msl}, mln={mln}")
                best['score'], best['md'], best['msl'], best['mln'], best['model'] = score, md, msl, mln, m
print(best)

current best score is 0.958041958041958 at md=3, msl=2, mln=7
current best score is 0.965034965034965 at md=3, msl=2, mln=9
current best score is 0.972027972027972 at md=4, msl=4, mln=10
{'score': 0.972027972027972, 'md': 4, 'msl': 4, 'mln': 10, 'model': DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='entropy',
                       max_depth=4, max_features=None, max_leaf_nodes=10,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=4, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=None, splitter='best')}


### Scaler 사용해보기
fit_transform 메소드는 학습과 변환을 한번에 수행한다. 아래 코드는 같은 결과를 생성한다.
```python
X_tr_scaled = scaler.fit_transform(X_train)
X_tr_scaled = scaler.fit(X_train).transform(X_train)
```

StandardScaler는 평균을 0, 표준편차를 1로 정규분포화하는 스케일러이다.

In [33]:
from sklearn.preprocessing import StandardScaler

print(X_train[:1])

scaler = StandardScaler()
X_tr_scaled = scaler.fit_transform(X_train)
print(X_tr_scaled[:1])
X_tst_scaled = scaler.transform(X_test)

[[1.289e+01 1.312e+01 8.189e+01 5.159e+02 6.955e-02 3.729e-02 2.260e-02
  1.171e-02 1.337e-01 5.581e-02 1.532e-01 4.690e-01 1.115e+00 1.268e+01
  4.731e-03 1.345e-02 1.652e-02 5.905e-03 1.619e-02 2.081e-03 1.362e+01
  1.554e+01 8.740e+01 5.770e+02 9.616e-02 1.147e-01 1.186e-01 5.366e-02
  2.309e-01 6.915e-02]]
[[-0.34913849 -1.43851335 -0.41172595 -0.39047943 -1.86366229 -1.26860704
  -0.82617052 -0.95286585 -1.72936805 -0.9415409  -0.86971355 -1.35865347
  -0.83481506 -0.57230673 -0.74586846 -0.65398319 -0.52583524 -0.94677147
  -0.53781728 -0.63449458 -0.54268486 -1.65565452 -0.58986401 -0.52555985
  -1.51066925 -0.89149994 -0.75021715 -0.91671059 -0.92508585 -0.80841115]]


In [31]:
means, stds = X_train.mean(axis=0), X_train.std(axis=0)
((X_train - means)/stds)[:1]

array([[-0.34913849, -1.43851335, -0.41172595, -0.39047943, -1.86366229,
        -1.26860704, -0.82617052, -0.95286585, -1.72936805, -0.9415409 ,
        -0.86971355, -1.35865347, -0.83481506, -0.57230673, -0.74586846,
        -0.65398319, -0.52583524, -0.94677147, -0.53781728, -0.63449458,
        -0.54268486, -1.65565452, -0.58986401, -0.52555985, -1.51066925,
        -0.89149994, -0.75021715, -0.91671059, -0.92508585, -0.80841115]])

In [34]:
X_tr_scaled.mean(0)

array([-1.14384245e-15, -3.57955710e-15, -3.22772586e-15, -1.17016464e-16,
       -2.68851191e-15,  7.47445924e-16,  7.68555798e-16,  4.09166702e-16,
       -3.85711286e-17, -7.70380108e-16,  5.29310555e-16,  4.66502163e-17,
       -2.30384308e-16, -6.49193792e-16, -2.26605380e-15, -2.03019656e-15,
        1.85037171e-17,  1.48524907e-15, -1.44615671e-15,  1.92855643e-16,
        1.53815406e-15,  1.26659246e-15, -1.70338443e-15, -1.86314188e-15,
        5.40777647e-16,  5.78566928e-16, -3.92487295e-16, -1.41514343e-16,
        6.83073837e-16, -2.73933197e-15])

In [35]:
X_tr_scaled.std(0)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [36]:
m2 = DecisionTreeClassifier(criterion='entropy', max_depth=5)
m2.fit(X_tr_scaled, y_train)
m2.score(X_tst_scaled, y_test)

0.958041958041958

In [None]:
export_graphviz(m2, out_file='tree2.dot',
                feature_names=data.feature_names,  
                class_names=data.target_names,  
                filled=True, rounded=True,  
                special_characters=True)

with open("tree2.dot") as f:
    dot_graph = f.read()
dot = graphviz.Source(dot_graph); dot.format='png'
dot.render(filename='tree2')

'tree2.png'

In [None]:
model.score(X_train, y_train)

1.0

In [None]:
m2.score(X_train, y_train)

0.7535211267605634