<a href="https://colab.research.google.com/github/plumwiserim/Data-Analysis/blob/main/Class9_ML_DecisionTree.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 파이썬 머신러닝 실습
## 분류를 위한 Decision Tree 기초 개념


In [None]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split

cancer = load_breast_cancer()

x_train, x_test, y_train, y_test = train_test_split(cancer.data,
                                                    cancer.target,
                                                    test_size=0.3,
                                                    random_state=12)


In [None]:
print(cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry 
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 3 is Mean Radius, f

In [None]:
print(cancer.data.shape)

(569, 30)


In [None]:
print(cancer.feature_names)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
 'mean smoothness' 'mean compactness' 'mean concavity'
 'mean concave points' 'mean symmetry' 'mean fractal dimension'
 'radius error' 'texture error' 'perimeter error' 'area error'
 'smoothness error' 'compactness error' 'concavity error'
 'concave points error' 'symmetry error' 'fractal dimension error'
 'worst radius' 'worst texture' 'worst perimeter' 'worst area'
 'worst smoothness' 'worst compactness' 'worst concavity'
 'worst concave points' 'worst symmetry' 'worst fractal dimension']


    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter) 반경
        - texture (standard deviation of gray-scale values) 질감
        - perimeter 둘레
        - area  면적
        - smoothness (local variation in radius lengths) 매끄러움
        - compactness (perimeter^2 / area - 1.0) 조그만 정도
        - concavity (severity of concave portions of the contour) 오목함
        - concave points (number of concave portions of the contour) 오목함 점의수
        - symmetry  대칭
        - fractal dimension ("coastline approximation" - 1) 프렉탈 차원

### DecisionTreeClassifier API의 Method
- fit(X, y) : 학습 데이터를 이용하여 Decision Tree 모델을 학습
- predict(X) : 모델에 테스트 데이터를 입력하여 계산된 예측값 반환
- score(X, y) : 모델에 테스트 데이터를 입력하여 모델의 성능지표(정확도) 반환
- get_depth() : 학습 된 Decision Tree 의 Depth 확인
- get_n_leaves() : 학습 된 Decision Tree 의 리프 노드 개수 확인

In [None]:
# DecisionTreeClassifier 임포트
from sklearn.tree import DecisionTreeClassifier

# DecisionTreeClassifier 객체 생성
dt = DecisionTreeClassifier(random_state=12)

# fit 함수로 Decision Tree 모델 학습
dt.fit(x_train, y_train)

DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=None, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=12, splitter='best')

In [None]:
# 학습 된 Tree의 Depth 확인 - get_depth() 함수 사용
print("Depth of tree: ", dt.get_depth())

Depth of tree:  6


In [None]:
# 학습 된 Tree의 리프 노드 개수 확인 - get_n_leaves() 함수 사용
print("Number of leaves: ", dt.get_n_leaves())

Number of leaves:  13


In [None]:
# predict 함수로 테스트 데이터 세트 예측
y_pred = dt.predict(x_test)
print(y_pred[0:3])

[0 1 1]


In [None]:
print(y_test[0:3])

[0 1 1]


In [None]:
# 학습 데이터셋의 정확도 계산
dt_train_score = dt.score(x_train, y_train)
print('Accuracy : ', dt_train_score)

Accuracy :  1.0


In [None]:
# 테스트 데이터셋의 정확도 계산
dt_score = dt.score(x_test,  y_test)
print('Accuracy : ', dt_score)

Accuracy :  0.9122807017543859


In [None]:
from sklearn.metrics import accuracy_score, precision_score, recall_score

# 정확도, 정밀도, 재현율 계산 
accuracy = accuracy_score(y_test , y_pred)
precision = precision_score(y_test , y_pred)
recall = recall_score(y_test , y_pred)
  
print('Accuracy: {0:.4f}, Precision: {1:.4f}, Recall: {2:.4f}'
      .format(accuracy , precision ,recall))

Accuracy: 0.9123, Precision: 0.9107, Recall: 0.9533


### 학습 된 Decision Tree 모델은 Feature들의 중요도 
- DecisionTreeClassifier 의 속성 feature_importances_ 를 사용하면 각 feature 별 중요도 확인 가능

In [None]:
# DecisionTreeClassifier 객체 생성
dt = DecisionTreeClassifier(random_state=12)

# fit 함수로 Decision Tree 모델 학습
dt.fit(x_train, y_train)

for i in range(0, len(cancer.feature_names)):
    print('{0}: {1:.3f}'.format(cancer.feature_names[i], dt.feature_importances_ [i]))

mean radius: 0.000
mean texture: 0.000
mean perimeter: 0.000
mean area: 0.014
mean smoothness: 0.007
mean compactness: 0.000
mean concavity: 0.000
mean concave points: 0.041
mean symmetry: 0.000
mean fractal dimension: 0.000
radius error: 0.000
texture error: 0.000
perimeter error: 0.000
area error: 0.009
smoothness error: 0.000
compactness error: 0.000
concavity error: 0.000
concave points error: 0.024
symmetry error: 0.000
fractal dimension error: 0.000
worst radius: 0.011
worst texture: 0.024
worst perimeter: 0.777
worst area: 0.011
worst smoothness: 0.011
worst compactness: 0.000
worst concavity: 0.011
worst concave points: 0.061
worst symmetry: 0.000
worst fractal dimension: 0.000


### Hyperparameter 설정 없이 depth와 leaf 노드 개수 값 확인 
- get_depth() : 학습 된 Decision Tree 의 Depth 확인
- get_n_leaves() : 학습 된 Decision Tree 의 리프 노드 개수 확인

In [None]:
# 학습 된 Tree의 Depth 확인 - get_depth() 함수 사용
print("Depth of tree: ", dt.get_depth())

# 학습 된 Tree의 리프 노드 개수 확인 - get_n_leaves() 함수 사용
print("Number of leaves: ", dt.get_n_leaves())

Depth of tree:  6
Number of leaves:  13


### max_depth 설정 후 depth와 leaf 노드 개수 값 확인 


In [None]:
# DecisionTreeClassifier 객체 생성
dt = DecisionTreeClassifier(max_depth=3, random_state=12)

# fit 함수로 Decision Tree 모델 학습
dt.fit(x_train, y_train)

# predict 함수로 테스트 데이터 세트 예측
y_pred = dt.predict(x_test)

In [None]:
# 학습 된 Tree의 Depth 확인 - get_depth() 함수 사용
print("Depth of tree: ", dt.get_depth())

# 학습 된 Tree의 리프 노드 개수 확인 - get_n_leaves() 함수 사용
print("Number of leaves: ", dt.get_n_leaves())


accuracy = accuracy_score(y_test , y_pred)
precision = precision_score(y_test , y_pred)
recall = recall_score(y_test , y_pred)

print('Accuracy: {0:.4f}, Precision: {1:.4f}, Recall: {2:.4f}'
      .format(accuracy , precision ,recall))

Depth of tree:  3
Number of leaves:  8
Accuracy: 0.9123, Precision: 0.9182, Recall: 0.9439


In [None]:
# 정확도, 정밀도, 재현율 계산 
accuracy = accuracy_score(y_test , y_pred)
precision = precision_score(y_test , y_pred)
recall = recall_score(y_test , y_pred)
  
print('Accuracy: {0:.4f}, Precision: {1:.4f}, Recall: {2:.4f}'
      .format(accuracy , precision ,recall))

Accuracy: 0.9123, Precision: 0.9182, Recall: 0.9439


### max_leaf_nodes 설정 후 depth와 leaf 노드 개수 값 확인 


In [None]:
# DecisionTreeClassifier 객체 생성
dt = DecisionTreeClassifier(max_leaf_nodes=9, random_state=12)

# fit 함수로 Decision Tree 모델 학습
dt.fit(x_train, y_train)

# predict 함수로 테스트 데이터 세트 예측
y_pred = dt.predict(x_test)

In [None]:
# 학습 된 Tree의 Depth 확인 - get_depth() 함수 사용
print("Depth of tree: ", dt.get_depth())

# 학습 된 Tree의 리프 노드 개수 확인 - get_n_leaves() 함수 사용
print("Number of leaves: ", dt.get_n_leaves())

Depth of tree:  4
Number of leaves:  9


In [None]:
# 정확도, 정밀도, 재현율 계산 
accuracy = accuracy_score(y_test , y_pred)
precision = precision_score(y_test , y_pred)
recall = recall_score(y_test , y_pred)
  
print('Accuracy: {0:.4f}, Precision: {1:.4f}, Recall: {2:.4f}'
      .format(accuracy , precision ,recall))

Accuracy: 0.9181, Precision: 0.9189, Recall: 0.9533
