## load_breast_cancer : 유방암 진단하기
커널정보 = base (python 3.9.7)

### 0. 루브릭
***

|**평가문항**|**상세기준**|
|------------|-------------|
|1. 3가지 데이터셋의 구성이 합리적으로 진행되었는가?|feature와 label 선정을 위한 데이터 분석과정이 체계적으로 전개됨|
|2. 3가지 데이터셋에 대해 각각 5가지 모델을 성공적으로 적용하였는가?|모델학습 및 테스트가 정상적으로 수행되었음|
|3. 3가지 데이터셋에 대해 모델의 평가지표가 적절히 선택되었는가?|평가지표 선택 및 이유 설명이 타당함|


### 1. 필요한 모듈 import

In [117]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

print('done')

done


### 2. 데이터 준비
***

In [118]:
breast_cancer = load_breast_cancer() # breast_cancer 자료 불러오기

print(dir(breast_cancer)) # dir()는 객체가 어떤 변수와 메서드를 가지고 있는지 나열함

breast_cancer.keys() # breast_cancer에 담겨있는 정보 확인

['DESCR', 'data', 'feature_names', 'filename', 'frame', 'target', 'target_names']


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename'])

### 3. 데이터 이해하기
***

#### 3.1 Feature Data 지정하기

In [119]:
breast_cancer_data = breast_cancer.data 

print(breast_cancer_data.shape)

breast_cancer_data[0] # breast_cancer_data 중 0번 샘플 확인

(569, 30)


array([1.799e+01, 1.038e+01, 1.228e+02, 1.001e+03, 1.184e-01, 2.776e-01,
       3.001e-01, 1.471e-01, 2.419e-01, 7.871e-02, 1.095e+00, 9.053e-01,
       8.589e+00, 1.534e+02, 6.399e-03, 4.904e-02, 5.373e-02, 1.587e-02,
       3.003e-02, 6.193e-03, 2.538e+01, 1.733e+01, 1.846e+02, 2.019e+03,
       1.622e-01, 6.656e-01, 7.119e-01, 2.654e-01, 4.601e-01, 1.189e-01])

In [120]:
breast_cancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

#### 3.2 Label Data 지정하기

In [121]:
breast_cancer_label = breast_cancer.target
print(breast_cancer_label.shape)
breast_cancer_label

(569,)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

#### 3.3 Target Names 출력하기

In [122]:
breast_cancer.target_names

array(['malignant', 'benign'], dtype='<U9')

#### 3.4 데이터 Describe 읽어보기

In [123]:
print(breast_cancer.DESCR)

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

### 4. 학습 데이터 & 테스트 데이터 준비
***

In [124]:
breast_cancer_df = pd.DataFrame(data=breast_cancer_data, columns=[breast_cancer.feature_names]) # pandas dataframe으로 변환
breast_cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst radius,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,25.380,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,24.990,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,23.570,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,14.910,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,22.540,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,25.450,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,23.690,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,18.980,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,25.740,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400


#### 4.1 학습데이터와 테스트 데이터 분리

In [125]:
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data,  # 모델이 맞춰야하는 정답값, label
                                                    breast_cancer_label, 
                                                    test_size=0.25, # test dataest의 크기 조절. 전체의 25%를 정답지로
                                                    random_state=7) # 데이터를 분리하는데 적용되는 랜덤성

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

X_train 개수:  426 , X_test 개수:  143


In [126]:
X_train.shape, y_train.shape

((426, 30), (426,))

In [127]:
X_test.shape, y_test.shape

((143, 30), (143,))

In [128]:
y_train, y_test

(array([1, 0, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 0,
        1, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1,
        0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 0, 1,
        0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0,
        1, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 0,
        1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 1, 1,
        1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 0, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 0,
        1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1,
        0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1,
        0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 1, 0, 1, 1, 1, 0, 0, 1, 1, 0,
        1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 0,
        0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 0,
        0, 1, 0, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1,
        0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 

### 5. 다양한 모델을 활용해 학습시키기
***
사용할 모델 list  
    1. DecisionTree  
    2. Random Forest  
    3. svm(support Vector Machine)  
    4. sgd(Stocchastic Gradient De)  
    5. Logistic Regression  

In [129]:
# 의사결정나무(DecisionTree)
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

# Random Forest
random_forest = RandomForestClassifier(random_state=5)
random_forest.fit(X_train, y_train)

# SVM
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)

# SGD
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)

# Logistic Regression
logistic_model = LogisticRegression(max_iter=2300) # 2300 미만일 때 컨버전 워닝 발생
logistic_model.fit(X_train, y_train) 

print('done')

done


### 6. 모델 평가
***

#### 6.1 학습된 모델들의 테스트데이터 예측 결과 해석

##### 6.1.1 의사결정나무(DecisionTree)

In [130]:
y_pred = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(decision_tree.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(decision_tree.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       0.95      0.78      0.85        45
           1       0.91      0.98      0.94        98

    accuracy                           0.92       143
   macro avg       0.93      0.88      0.90       143
weighted avg       0.92      0.92      0.91       143

훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.916


##### 6.1.2 Random Forest

In [131]:
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(random_forest.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(random_forest.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.89      0.94        45
           1       0.95      1.00      0.98        98

    accuracy                           0.97       143
   macro avg       0.98      0.94      0.96       143
weighted avg       0.97      0.97      0.96       143

훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.965


##### 6.1.3 Support Vector Machine (SVM)

In [132]:
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(svm_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(svm_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.78      0.88        45
           1       0.91      1.00      0.95        98

    accuracy                           0.93       143
   macro avg       0.95      0.89      0.91       143
weighted avg       0.94      0.93      0.93       143

훈련 세트 정확도: 0.918
테스트 세트 정확도: 0.930


##### 6.1.4 Stochastic Gradient Descent Classifier (SGDClassifier)

In [133]:
y_pred = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred)) # 
print("훈련 세트 정확도: {:.3f}".format(sgd_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(sgd_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.71      0.83        45
           1       0.88      1.00      0.94        98

    accuracy                           0.91       143
   macro avg       0.94      0.86      0.88       143
weighted avg       0.92      0.91      0.90       143

훈련 세트 정확도: 0.908
테스트 세트 정확도: 0.909


##### 6.1.5 Logistic Regression

In [134]:
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(logistic_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(logistic_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       0.97      0.87      0.92        45
           1       0.94      0.99      0.97        98

    accuracy                           0.95       143
   macro avg       0.96      0.93      0.94       143
weighted avg       0.95      0.95      0.95       143

훈련 세트 정확도: 0.967
테스트 세트 정확도: 0.951


#### 6.2 해당 주제에서 모델 성능 평가의 지표로 중요한 것과 그 이유
(sklearn.metrics 평가지표 중 선택)

**Recall** 값이 가장 중요하다고 생각한다.  
암 진단의 경우 **암인 사람을 암이 아니라고 진단하는 케이스**가 다른 오차 케이스들 보다 훨씬 위험하기 때문에 FN을 최소화하는 모델이 중요하다고 생각한다.

다섯가지 모델 중 유방암 진단 모델로는 recall 값이 가장 높은 random forest가 적합해보인다.