# 루브릭
|평가문항|상세기준|self check|
|:-------|:-------|:-----------------------|
|1. 3가지 데이터셋의 구성을 파악하고, 데이터를 이해하는 과정이 포함되어있는가?|feature와 label 선정을 위한 데이터 분석과정을 전개함|OK|
|2. 3가지 데이터셋에 대해 각각 5가지 모델을 이용하여 학습을 수행하고 결과값을 얻었는가?|모델학습 및 테스트가 정상적으로 수행되고, 결과값을 얻었음|OK|
|3. 3가지 데이터셋에 대해 모델의 평가 지표를 선택하고, 그 이유를 근거를 바탕으로 서술하였는가?|모델학습을 통해 얻은 결과물을 바탕으로 평가지표를 선택하고, 본인의 의견을 서술하였음|OK|


### dataset
1. load_digit
2. load_wine
3. load_breast_caner

### 적용할 알고리즘
1. Decision Tree 
2. Random Foreast
3. SVM(Support Vector Machine)
4. SGD Classifier
5. Logistic Regression  

### 분류 성능 평가 지표  
1. Accuracy(정확도)
2. Confusion Matrix
3. Precision(정밀도)
4. Recall(재현율)
5. F1 
6. ROC AUC

## 문제 정의  
> 3가지 분류 용도의 dataset에 5가지 분류기를 적용하여, 분류 예측 성능 비교  
> 3가지 dataset에서 데이터 유형에 따른 성능 평가 지표 선택



## 분석 결과  
 **1. 손글씨 분류**
 * 성능평가 지표로써 Accuracy 선택  
  - label 클래스 값이 0 ~ 9 의 숫자를 예측하는 다중 분류이므로,   
 실제 데이터에서 예측 데이터가 얼마나 같은 지를 판단하는 지표인 Accuracy를 성능지료로 선택함   
  - 학습 데이터도 각 클래스마다 10% 비율로 동일하여, Accuracy를 성능지표로 선택
 
 
|Evaluation|DecisonTree|RandomForest|SVM|SGDClassifier|Logistic Regression|  
|:---------:|:----------:|:----------:|:--:|:----------:|:------------:|  
|Accuracy|87 %|98 %|99 %|96 %|97 %|
  
<hr/>

 **2. 와인 분류**
 * 성능평가 지표로써 Accuracy 선택
  - label 클래스 값이 0,1,2 중에서 와인의 등급을 예측하는 다중 분류이므로,   
 실제 데이터에서 예측 데이터가 얼마나 같은 지를 판단하는 지표인 Accuracy를 성능지료로 선택함   
  - 하지만, 학습 데이터는 각 클래스마다 비율 동일하지 않아, recall 도 같이 참조하면 좋을 꺼 같음
 
|Evaluation|DecisonTree|RandomForest|SVM|SGDClassifier|Logistic Regression|  
|:---------:|:----------:|:----------:|:--:|:----------:|:------------:|  
|Accuracy|94 %|94 %|64 %|53 %|92 %|

<hr/>  

 **3. 유방암 여부 예측**
 * 성능평가 지표로써 Positive(암인 경우)의 Recall 선택
  - 암이 양성일 경우(Malignant)는 Positive(1),  암이 음성일 경우(Benign) negative(0)으로 label 값을 할당  
  - 불균등한 label 클래스값을 가지는 이진 분류로 Positive 데이터 건수가 Negative에 비해서 적어서(40:60 비율), 많은 비율로 학습한 데이터로 예측하는 경향이 강함
  - 실제 Positive(암인경우)를 Negative(암이 아닌 경우)로 잘못 예측하면, 그 반대의 경우인 Negative(암이 아닌경우)를 Positive(암인경우)로 예측하는 경우보다 더  안좋은 영향이 발생  
  - Positive(암인 경우) 데이터의 예측 성능에 좀 더 초점을 맞춘 Recall를 성능 지표로 선택함
 
|Evaluation|DecisonTree|RandomForest|SVM|SGDClassifier|Logistic Regression|  
|:---------:|:----------:|:----------:|:--:|:----------:|:------------:|  
|Recall|93 %|88 %|76 %|88 %|95 %|

<hr/>
모델링 시에 data 전처리 및 hyper parameter 값 변경 없이 default 값으로 모델을 구현하였기에, 현재의 성능결과로 성능이 좋은 모델을 선정하기 어려움


## 분석환경

In [1]:
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
from IPython.display import Image

import sklearn
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np

print(sklearn.__version__)

1.0


# (1)load_digits : 손글씨 분류

## 1. 필요한 모듈 import

In [2]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2. DATA 준비

In [3]:
digits = load_digits()
digits

{'data': array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
        [ 0.,  0.,  0., ..., 10.,  0.,  0.],
        [ 0.,  0.,  0., ..., 16.,  9.,  0.],
        ...,
        [ 0.,  0.,  1., ...,  6.,  0.,  0.],
        [ 0.,  0.,  2., ..., 12.,  0.,  0.],
        [ 0.,  0., 10., ..., 12.,  1.,  0.]]),
 'target': array([0, 1, 2, ..., 8, 9, 8]),
 'frame': None,
 'feature_names': ['pixel_0_0',
  'pixel_0_1',
  'pixel_0_2',
  'pixel_0_3',
  'pixel_0_4',
  'pixel_0_5',
  'pixel_0_6',
  'pixel_0_7',
  'pixel_1_0',
  'pixel_1_1',
  'pixel_1_2',
  'pixel_1_3',
  'pixel_1_4',
  'pixel_1_5',
  'pixel_1_6',
  'pixel_1_7',
  'pixel_2_0',
  'pixel_2_1',
  'pixel_2_2',
  'pixel_2_3',
  'pixel_2_4',
  'pixel_2_5',
  'pixel_2_6',
  'pixel_2_7',
  'pixel_3_0',
  'pixel_3_1',
  'pixel_3_2',
  'pixel_3_3',
  'pixel_3_4',
  'pixel_3_5',
  'pixel_3_6',
  'pixel_3_7',
  'pixel_4_0',
  'pixel_4_1',
  'pixel_4_2',
  'pixel_4_3',
  'pixel_4_4',
  'pixel_4_5',
  'pixel_4_6',
  'pixel_4_7',
  'pixel_5_0',
  'pixel_5_1',
 

## 3. 데이터 이해하기

> Feature data 지정하기  
> Label data 지정하기  
> Target Names 출력해보기  
> 데이터 Describe 해보기

In [4]:
dir(digits)
digits.keys()

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

In [5]:
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

In [6]:
# Feature data 지정
digits_data = digits.data
digits_data
type(digits_data)
digits_data.shape

array([[ 0.,  0.,  5., ...,  0.,  0.,  0.],
       [ 0.,  0.,  0., ..., 10.,  0.,  0.],
       [ 0.,  0.,  0., ..., 16.,  9.,  0.],
       ...,
       [ 0.,  0.,  1., ...,  6.,  0.,  0.],
       [ 0.,  0.,  2., ..., 12.,  0.,  0.],
       [ 0.,  0., 10., ..., 12.,  1.,  0.]])

numpy.ndarray

(1797, 64)

In [7]:
# Label data 지정
digits_label = digits.target
digits_label
type(digits_label)
digits_label.shape

array([0, 1, 2, ..., 8, 9, 8])

numpy.ndarray

(1797,)

In [8]:
# Feature names 출력
digits.feature_names

['pixel_0_0',
 'pixel_0_1',
 'pixel_0_2',
 'pixel_0_3',
 'pixel_0_4',
 'pixel_0_5',
 'pixel_0_6',
 'pixel_0_7',
 'pixel_1_0',
 'pixel_1_1',
 'pixel_1_2',
 'pixel_1_3',
 'pixel_1_4',
 'pixel_1_5',
 'pixel_1_6',
 'pixel_1_7',
 'pixel_2_0',
 'pixel_2_1',
 'pixel_2_2',
 'pixel_2_3',
 'pixel_2_4',
 'pixel_2_5',
 'pixel_2_6',
 'pixel_2_7',
 'pixel_3_0',
 'pixel_3_1',
 'pixel_3_2',
 'pixel_3_3',
 'pixel_3_4',
 'pixel_3_5',
 'pixel_3_6',
 'pixel_3_7',
 'pixel_4_0',
 'pixel_4_1',
 'pixel_4_2',
 'pixel_4_3',
 'pixel_4_4',
 'pixel_4_5',
 'pixel_4_6',
 'pixel_4_7',
 'pixel_5_0',
 'pixel_5_1',
 'pixel_5_2',
 'pixel_5_3',
 'pixel_5_4',
 'pixel_5_5',
 'pixel_5_6',
 'pixel_5_7',
 'pixel_6_0',
 'pixel_6_1',
 'pixel_6_2',
 'pixel_6_3',
 'pixel_6_4',
 'pixel_6_5',
 'pixel_6_6',
 'pixel_6_7',
 'pixel_7_0',
 'pixel_7_1',
 'pixel_7_2',
 'pixel_7_3',
 'pixel_7_4',
 'pixel_7_5',
 'pixel_7_6',
 'pixel_7_7']

In [9]:
# Target Names 출력
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### DF로 변환

In [10]:
digits_df = pd.DataFrame(data=digits_data, columns=digits.feature_names)
digits_df['label'] = digits.target
digits_df

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7,label
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1792,0.0,0.0,4.0,10.0,13.0,6.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,2.0,14.0,15.0,9.0,0.0,0.0,9
1793,0.0,0.0,6.0,16.0,13.0,11.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,16.0,14.0,6.0,0.0,0.0,0
1794,0.0,0.0,1.0,11.0,15.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,9.0,13.0,6.0,0.0,0.0,8
1795,0.0,0.0,2.0,10.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,12.0,16.0,12.0,0.0,0.0,9


In [11]:
# label data 분포도 확인 - 10% 비율로 고르게 분포
digits_df['label'].value_counts().sort_index()/len(digits_df['label']) * 100

0     9.905398
1    10.127991
2     9.849750
3    10.183639
4    10.072343
5    10.127991
6    10.072343
7     9.961046
8     9.682805
9    10.016694
Name: label, dtype: float64

## 4. train, test data 분리

In [12]:
from sklearn.model_selection import train_test_split
# sklearn model_selection패키지의 train_test_split 함수를 임포트

# stratify=digits_label 지정 - 학습데이터 분류 비율이 골고루 분포하도록 설정함
X_train, X_test, y_train, y_test = train_test_split(digits_data,digits_label,test_size=0.2,random_state=7,stratify=digits_label)


print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))
# len은 배열의 길이를 출력

X_train 개수:  1437 , X_test 개수:  360


In [13]:
# 학습 라벨 데이터 분포도 확인 - 10% 비율로 고르게 분포
df_y_train = pd.DataFrame(y_train)
print("--학습 데이터 분포--")
df_y_train.value_counts().sort_index()/len(df_y_train) * 100

--학습 데이터 분포--


0     9.881698
1    10.090466
2     9.881698
3    10.160056
4    10.090466
5    10.160056
6    10.090466
7     9.951287
8     9.672930
9    10.020877
dtype: float64

In [14]:
# 테스트 라벨 데이터 분포도 확인 - 10% 비율로 고르게 분포
df_y_test = pd.DataFrame(y_test)
print("--테스트 데이터 분포--")
df_y_test.value_counts().sort_index()/len(df_y_test) * 100

--테스트 데이터 분포--


0    10.000000
1    10.277778
2     9.722222
3    10.277778
4    10.000000
5    10.000000
6    10.000000
7    10.000000
8     9.722222
9    10.000000
dtype: float64

## 5. 다양한 모델로 학습/예측

> Decision Tree    
> Random Forest   
> SVM  
> SGD Classifier  
> Logistic Regreesion  

In [15]:
from sklearn.tree import DecisionTreeClassifier # 의사결정트리 모델 import
from sklearn.ensemble import RandomForestClassifier # 랜덤포레스트라는 분류기를 import
from sklearn import svm #Support Vector Machine mport
from sklearn.linear_model import SGDClassifier #선형분류기인 SGDClassifier mport
from sklearn.linear_model import LogisticRegression # LogisticRegression import
from sklearn.metrics import classification_report  #분류 결과 리포트 import
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix,roc_auc_score

# Classifier 객체 생성
decision_tree = DecisionTreeClassifier(random_state=32)
random_forest = RandomForestClassifier(random_state=32)
svm_model = svm.SVC()
sgd_model = SGDClassifier()
logistic_model = LogisticRegression(solver='liblinear')

# DecisionTreeClassifier 학습/예측
decision_tree.fit(X_train, y_train) #학습
y_pred_dt = decision_tree.predict(X_test)#예측

# Random Forest 학습/예측
random_forest.fit(X_train, y_train) # 학습
y_pred_rf = random_forest.predict(X_test) # 예측

# SVM 학습 및 예측
svm_model.fit(X_train, y_train) # 훈련.
y_pred_svm = svm_model.predict(X_test) # 예측

# SGD Classifier 학습 및 예측
sgd_model.fit(X_train, y_train) # 학습
y_pred_sgd = sgd_model.predict(X_test) # 예측

# Logistic Regreesion 학습 및 예측
logistic_model.fit(X_train, y_train) # 학습
y_pred_lr = logistic_model.predict(X_test) # 예측


# 각 모델별 예측값 확인
pred_df = pd.DataFrame(y_test,columns=['label']) # traget data
pred_df['DT']= y_pred_dt
pred_df['RF']= y_pred_rf
pred_df['SVM']= y_pred_svm
pred_df['sgd']= y_pred_sgd
pred_df['LR']= y_pred_lr
pred_df

DecisionTreeClassifier(random_state=32)

RandomForestClassifier(random_state=32)

SVC()

SGDClassifier()

LogisticRegression(solver='liblinear')

Unnamed: 0,label,DT,RF,SVM,sgd,LR
0,9,3,9,9,9,9
1,4,4,4,4,4,4
2,6,6,6,6,6,6
3,4,4,4,4,4,4
4,8,8,8,8,8,8
...,...,...,...,...,...,...
355,3,3,3,3,3,3
356,2,2,2,2,2,2
357,0,0,0,0,0,0
358,4,4,4,4,4,4


## 6. 모델 성능 평가

In [16]:
# DecisionTreeClassifier 평가
print("\n---------- Decision Tree -------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_dt),'\n')
print(classification_report(y_test, y_pred_dt)) 


---------- Decision Tree -------------

confusion matrix
 [[33  0  0  0  2  0  0  0  1  0]
 [ 0 30  0  0  1  1  0  0  1  4]
 [ 0  0 31  1  0  0  0  0  3  0]
 [ 0  0  0 30  1  0  0  2  2  2]
 [ 1  1  0  0 32  0  0  0  1  1]
 [ 0  1  0  1  1 31  0  1  0  1]
 [ 0  0  0  0  4  0 32  0  0  0]
 [ 0  0  0  0  1  0  0 35  0  0]
 [ 1  2  1  1  0  2  0  0 27  1]
 [ 0  0  0  2  1  1  0  1  0 31]] 

              precision    recall  f1-score   support

           0       0.94      0.92      0.93        36
           1       0.88      0.81      0.85        37
           2       0.97      0.89      0.93        35
           3       0.86      0.81      0.83        37
           4       0.74      0.89      0.81        36
           5       0.89      0.86      0.87        36
           6       1.00      0.89      0.94        36
           7       0.90      0.97      0.93        36
           8       0.77      0.77      0.77        35
           9       0.78      0.86      0.82        36

    accuracy

In [17]:
# Random Forest 평가
print("\n---------- Random Forest --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_rf),'\n')
print(classification_report(y_test, y_pred_rf)) 


---------- Random Forest --------------

confusion matrix
 [[35  0  0  0  1  0  0  0  0  0]
 [ 0 37  0  0  0  0  0  0  0  0]
 [ 0  0 35  0  0  0  0  0  0  0]
 [ 0  0  0 36  0  1  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  1  1 34  0  0  0  0]
 [ 0  0  0  0  0  0 36  0  0  0]
 [ 0  0  0  0  0  0  0 36  0  0]
 [ 0  1  0  0  0  1  0  0 33  0]
 [ 0  0  0  0  0  1  0  0  0 35]] 

              precision    recall  f1-score   support

           0       1.00      0.97      0.99        36
           1       0.97      1.00      0.99        37
           2       1.00      1.00      1.00        35
           3       0.97      0.97      0.97        37
           4       0.95      1.00      0.97        36
           5       0.92      0.94      0.93        36
           6       1.00      1.00      1.00        36
           7       1.00      1.00      1.00        36
           8       1.00      0.94      0.97        35
           9       1.00      0.97      0.99        36

    accurac

In [18]:
# SVM 평가
print("\n---------- SVM --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_svm),'\n')
print(classification_report(y_test, y_pred_svm)) 


---------- SVM --------------

confusion matrix
 [[36  0  0  0  0  0  0  0  0  0]
 [ 0 37  0  0  0  0  0  0  0  0]
 [ 0  0 35  0  0  0  0  0  0  0]
 [ 0  0  0 37  0  0  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  0  0 35  0  0  0  1]
 [ 0  1  0  0  0  0 35  0  0  0]
 [ 0  0  0  0  0  0  0 36  0  0]
 [ 0  0  0  0  0  0  0  0 35  0]
 [ 0  0  0  0  0  1  0  0  1 34]] 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       0.97      1.00      0.99        37
           2       1.00      1.00      1.00        35
           3       1.00      1.00      1.00        37
           4       1.00      1.00      1.00        36
           5       0.97      0.97      0.97        36
           6       1.00      0.97      0.99        36
           7       1.00      1.00      1.00        36
           8       0.97      1.00      0.99        35
           9       0.97      0.94      0.96        36

    accuracy         

In [19]:
# SGD Classifier 평가
print("\n----------SGD Classifier--------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_sgd),'\n')
print(classification_report(y_test, y_pred_sgd)) 


----------SGD Classifier--------------

confusion matrix
 [[36  0  0  0  0  0  0  0  0  0]
 [ 0 34  1  1  0  0  0  0  1  0]
 [ 0  0 34  1  0  0  0  0  0  0]
 [ 0  0  0 37  0  0  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  1  0 34  0  0  1  0]
 [ 0  0  0  0  0  0 36  0  0  0]
 [ 0  0  0  0  0  0  0 35  1  0]
 [ 0  0  0  0  0  1  0  0 34  0]
 [ 0  0  0  1  0  1  0  0  2 32]] 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       1.00      0.92      0.96        37
           2       0.97      0.97      0.97        35
           3       0.90      1.00      0.95        37
           4       1.00      1.00      1.00        36
           5       0.94      0.94      0.94        36
           6       1.00      1.00      1.00        36
           7       1.00      0.97      0.99        36
           8       0.87      0.97      0.92        35
           9       1.00      0.89      0.94        36

    accuracy

In [20]:
#  Logistic Regreesion 평가
print("\n---------- Logistic Regreesion --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_lr),'\n')
print(classification_report(y_test, y_pred_lr)) 


---------- Logistic Regreesion --------------

confusion matrix
 [[36  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  0  0  1  0  0  0]
 [ 0  0 34  1  0  0  0  0  0  0]
 [ 0  0  0 36  0  1  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  0  0 35  0  0  1  0]
 [ 0  0  0  0  0  0 36  0  0  0]
 [ 0  0  0  0  0  0  0 35  0  1]
 [ 0  2  0  0  0  1  0  0 31  1]
 [ 0  0  0  0  0  1  0  0  1 34]] 

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       0.95      0.97      0.96        37
           2       1.00      0.97      0.99        35
           3       0.97      0.97      0.97        37
           4       1.00      1.00      1.00        36
           5       0.92      0.97      0.95        36
           6       0.97      1.00      0.99        36
           7       1.00      0.97      0.99        36
           8       0.94      0.89      0.91        35
           9       0.94      0.94      0.94        36

    a

# (2) load_wine : 와인 분류

## 1. 필요한 모듈 import

In [21]:
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2. DATA 준비

In [22]:
wine = load_wine()
wine

{'data': array([[1.423e+01, 1.710e+00, 2.430e+00, ..., 1.040e+00, 3.920e+00,
         1.065e+03],
        [1.320e+01, 1.780e+00, 2.140e+00, ..., 1.050e+00, 3.400e+00,
         1.050e+03],
        [1.316e+01, 2.360e+00, 2.670e+00, ..., 1.030e+00, 3.170e+00,
         1.185e+03],
        ...,
        [1.327e+01, 4.280e+00, 2.260e+00, ..., 5.900e-01, 1.560e+00,
         8.350e+02],
        [1.317e+01, 2.590e+00, 2.370e+00, ..., 6.000e-01, 1.620e+00,
         8.400e+02],
        [1.413e+01, 4.100e+00, 2.740e+00, ..., 6.100e-01, 1.600e+00,
         5.600e+02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
        1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1

## 3. 데이터 이해하기

> Feature data 지정하기  
> Label data 지정하기  
> Target Names 출력해보기  
> 데이터 Describe 해보기

In [23]:
dir(wine)
wine.keys()

['DESCR', 'data', 'feature_names', 'frame', 'target', 'target_names']

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])

In [24]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [25]:
#feature name 출력
wine.feature_names

['alcohol',
 'malic_acid',
 'ash',
 'alcalinity_of_ash',
 'magnesium',
 'total_phenols',
 'flavanoids',
 'nonflavanoid_phenols',
 'proanthocyanins',
 'color_intensity',
 'hue',
 'od280/od315_of_diluted_wines',
 'proline']

In [26]:
#feature data 지정
wine_data = wine.data
wine_data.shape
wine_data[0]

(178, 13)

array([1.423e+01, 1.710e+00, 2.430e+00, 1.560e+01, 1.270e+02, 2.800e+00,
       3.060e+00, 2.800e-01, 2.290e+00, 5.640e+00, 1.040e+00, 3.920e+00,
       1.065e+03])

In [27]:
# Target names 출력
wine.target_names

array(['class_0', 'class_1', 'class_2'], dtype='<U7')

In [28]:
# Label data 지정
wine_label = wine.target
wine_label.shape
wine_label # 0,1,2

(178,)

array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2])

#### DF로 변환

In [29]:
wine_df = pd.DataFrame(data=wine_data,columns=wine.feature_names)
wine_df['label'] = wine.target
wine_df

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline,label
0,14.23,1.71,2.43,15.6,127.0,2.80,3.06,0.28,2.29,5.64,1.04,3.92,1065.0,0
1,13.20,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.40,1050.0,0
2,13.16,2.36,2.67,18.6,101.0,2.80,3.24,0.30,2.81,5.68,1.03,3.17,1185.0,0
3,14.37,1.95,2.50,16.8,113.0,3.85,3.49,0.24,2.18,7.80,0.86,3.45,1480.0,0
4,13.24,2.59,2.87,21.0,118.0,2.80,2.69,0.39,1.82,4.32,1.04,2.93,735.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
173,13.71,5.65,2.45,20.5,95.0,1.68,0.61,0.52,1.06,7.70,0.64,1.74,740.0,2
174,13.40,3.91,2.48,23.0,102.0,1.80,0.75,0.43,1.41,7.30,0.70,1.56,750.0,2
175,13.27,4.28,2.26,20.0,120.0,1.59,0.69,0.43,1.35,10.20,0.59,1.56,835.0,2
176,13.17,2.59,2.37,20.0,120.0,1.65,0.68,0.53,1.46,9.30,0.60,1.62,840.0,2


In [30]:
# label data 분포도 확인 : 0,1,2 중에 1의 비율이 높음 
wine_df['label'].value_counts().sort_index()/len(wine_df['label']) * 100

0    33.146067
1    39.887640
2    26.966292
Name: label, dtype: float64

## 4. train, test 데이터 분리

In [31]:
from sklearn.model_selection import train_test_split
# sklearn model_selection패키지의 train_test_split 함수를 임포트

# stratify=digits_label 지정 - 학습데이터 분류 비율이 label 데이터의 비율과 같도록 설정함
X_train, X_test, y_train, y_test = train_test_split(wine_data,wine_label,test_size=0.2,random_state=7,stratify=wine_label)

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

X_train 개수:  142 , X_test 개수:  36


In [32]:
# 학습용 label data 분포 확인 - 원본 label 데이터 분포와 같음
df_y_train = pd.DataFrame(y_train)
df_y_train.value_counts().sort_index()/len(df_y_train) * 100

0    33.098592
1    40.140845
2    26.760563
dtype: float64

In [33]:
# 테스트용 label data 분포 확인 - 원본 label 데이터 분포와 같음
df_y_test = pd.DataFrame(y_test)
df_y_test.value_counts().sort_index()/len(df_y_test) * 100

0    33.333333
1    38.888889
2    27.777778
dtype: float64

## 5. 다양한 모델로 학습/예측

> Decision Tree    
> Random Forest   
> SVM  
> SGD Classifier  
> Logistic Regreesion  

In [34]:
from sklearn.tree import DecisionTreeClassifier # 의사결정트리 모델 import
from sklearn.ensemble import RandomForestClassifier # 랜덤포레스트라는 분류기를 import
from sklearn import svm #Support Vector Machine mport
from sklearn.linear_model import SGDClassifier #선형분류기인 SGDClassifier mport
from sklearn.linear_model import LogisticRegression # LogisticRegression import
from sklearn.metrics import classification_report  #분류 결과 리포트 import
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix,roc_auc_score

# 5개 모델 Classifier 객체 생성
decision_tree = DecisionTreeClassifier(random_state=32)
random_forest = RandomForestClassifier(random_state=32)
svm_model = svm.SVC()
sgd_model = SGDClassifier(random_state=32,max_iter=100000)
logistic_model = LogisticRegression(solver='liblinear')

# DecisionTree 학습/예측
decision_tree.fit(X_train, y_train) #학습
y_pred_dt = decision_tree.predict(X_test)#예측

# Random Forest 학습/예측
random_forest.fit(X_train, y_train) # 학습
y_pred_rf = random_forest.predict(X_test) # 예측

# SVM 학습 및 예측
svm_model.fit(X_train, y_train) # 훈련.
y_pred_svm = svm_model.predict(X_test) # 예측

# SGD Classifier 학습 및 예측
sgd_model.fit(X_train, y_train) # 학습
y_pred_sgd = sgd_model.predict(X_test) # 예측

# Logistic Regreesion 학습 및 예측
logistic_model.fit(X_train, y_train) # 학습
y_pred_lr = logistic_model.predict(X_test) # 예측

# 각 모델별 예측값 확인
pred_df = pd.DataFrame(y_test,columns=['label']) # traget data
pred_df['DT']= y_pred_dt
pred_df['RF']= y_pred_rf
pred_df['SVM']= y_pred_svm
pred_df['sgd']= y_pred_sgd
pred_df['LR']= y_pred_lr
pred_df

DecisionTreeClassifier(random_state=32)

RandomForestClassifier(random_state=32)

SVC()

SGDClassifier(max_iter=100000, random_state=32)

LogisticRegression(solver='liblinear')

Unnamed: 0,label,DT,RF,SVM,sgd,LR
0,1,1,1,1,2,1
1,0,0,0,2,2,0
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,1,0,2,2,1
5,0,0,0,0,0,0
6,1,1,1,1,2,1
7,2,2,2,1,2,2
8,2,2,2,2,0,2
9,2,2,2,1,2,2


## 6. 모델 성능 평가

In [35]:
# DecisionTreeClassifier 평가
print("\n---------- Decision Tree --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_dt),'\n')
print(classification_report(y_test, y_pred_dt)) 


---------- Decision Tree --------------

confusion matrix
 [[11  1  0]
 [ 0 13  1]
 [ 0  0 10]] 

              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.93      0.93      0.93        14
           2       0.91      1.00      0.95        10

    accuracy                           0.94        36
   macro avg       0.95      0.95      0.95        36
weighted avg       0.95      0.94      0.94        36



In [36]:
# Random Forest 평가
print("\n---------- Random Forest --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_rf),'\n')
print(classification_report(y_test, y_pred_dt)) 


---------- Random Forest --------------

confusion matrix
 [[12  0  0]
 [ 0 13  1]
 [ 0  0 10]] 

              precision    recall  f1-score   support

           0       1.00      0.92      0.96        12
           1       0.93      0.93      0.93        14
           2       0.91      1.00      0.95        10

    accuracy                           0.94        36
   macro avg       0.95      0.95      0.95        36
weighted avg       0.95      0.94      0.94        36



In [37]:
# SVM 평가
print("\n---------- SVM --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_svm),'\n')
print(classification_report(y_test, y_pred_svm)) 


---------- SVM --------------

confusion matrix
 [[ 9  0  3]
 [ 0 12  2]
 [ 0  8  2]] 

              precision    recall  f1-score   support

           0       1.00      0.75      0.86        12
           1       0.60      0.86      0.71        14
           2       0.29      0.20      0.24        10

    accuracy                           0.64        36
   macro avg       0.63      0.60      0.60        36
weighted avg       0.65      0.64      0.63        36



In [38]:
# SGD Classifier 평가
print("\n---------- SGD Classifier --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_sgd),'\n')
print(classification_report(y_test, y_pred_sgd)) 


---------- SGD Classifier --------------

confusion matrix
 [[ 9  0  3]
 [ 2  2 10]
 [ 2  0  8]] 

              precision    recall  f1-score   support

           0       0.69      0.75      0.72        12
           1       1.00      0.14      0.25        14
           2       0.38      0.80      0.52        10

    accuracy                           0.53        36
   macro avg       0.69      0.56      0.50        36
weighted avg       0.73      0.53      0.48        36



In [39]:
#  Logistic Regreesion 평가
print("\n---------- Logistic Regreesion --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_lr),'\n')
print(classification_report(y_test, y_pred_lr))


---------- Logistic Regreesion --------------

confusion matrix
 [[10  2  0]
 [ 0 13  1]
 [ 0  0 10]] 

              precision    recall  f1-score   support

           0       1.00      0.83      0.91        12
           1       0.87      0.93      0.90        14
           2       0.91      1.00      0.95        10

    accuracy                           0.92        36
   macro avg       0.93      0.92      0.92        36
weighted avg       0.92      0.92      0.92        36



# (3) load_breast_caner : 유방암 여부 예측

## 1. 필요한 모듈 import

In [40]:
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

## 2. DATA 준비

In [41]:
bcancer = load_breast_cancer()
bcancer

{'data': array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
         1.189e-01],
        [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
         8.902e-02],
        [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
         8.758e-02],
        ...,
        [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
         7.820e-02],
        [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
         1.240e-01],
        [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
         7.039e-02]]),
 'target': array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
        0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
        1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
        1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
        1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0

## 3. 데이터 이해하기

> Feature data 지정하기  
> Label data 지정하기  
> Target Names 출력해보기  
> 데이터 Describe 해보기

In [42]:
dir(bcancer)
bcancer.keys()

['DESCR',
 'data',
 'data_module',
 'feature_names',
 'filename',
 'frame',
 'target',
 'target_names']

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [43]:
print(bcancer['DESCR'])

.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instance, field 0 is Mean Radi

In [44]:
# featuer name 출력
bcancer.feature_names

array(['mean radius', 'mean texture', 'mean perimeter', 'mean area',
       'mean smoothness', 'mean compactness', 'mean concavity',
       'mean concave points', 'mean symmetry', 'mean fractal dimension',
       'radius error', 'texture error', 'perimeter error', 'area error',
       'smoothness error', 'compactness error', 'concavity error',
       'concave points error', 'symmetry error',
       'fractal dimension error', 'worst radius', 'worst texture',
       'worst perimeter', 'worst area', 'worst smoothness',
       'worst compactness', 'worst concavity', 'worst concave points',
       'worst symmetry', 'worst fractal dimension'], dtype='<U23')

In [45]:
# feature data 지정하기
bcancer_data = bcancer.data
bcancer_data
bcancer_data.shape

array([[1.799e+01, 1.038e+01, 1.228e+02, ..., 2.654e-01, 4.601e-01,
        1.189e-01],
       [2.057e+01, 1.777e+01, 1.329e+02, ..., 1.860e-01, 2.750e-01,
        8.902e-02],
       [1.969e+01, 2.125e+01, 1.300e+02, ..., 2.430e-01, 3.613e-01,
        8.758e-02],
       ...,
       [1.660e+01, 2.808e+01, 1.083e+02, ..., 1.418e-01, 2.218e-01,
        7.820e-02],
       [2.060e+01, 2.933e+01, 1.401e+02, ..., 2.650e-01, 4.087e-01,
        1.240e-01],
       [7.760e+00, 2.454e+01, 4.792e+01, ..., 0.000e+00, 2.871e-01,
        7.039e-02]])

(569, 30)

In [46]:
# target name 출력
#악성종양이면 0, 양성 종양이면 1
bcancer.target_names

array(['malignant', 'benign'], dtype='<U9')

In [47]:
# Target data 지정하기
# dataset target data는 악성종양이면 0, 양성 종양이면 1로 "종양의 종류"로 구분되어 있음. 
# 예측시 label은 유암방이면 1, 유방암이 아니면 0으로 변경함
import numpy as np
bcancer_label = np.where(bcancer.target == 0 , 1, 0)
bcancer_label 
bcancer_label.shape

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 0, 1, 0, 1, 1,
       0, 0, 0, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 1, 1,
       0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0,
       0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 0, 1, 0, 1,
       1, 0, 1, 1, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 0,
       0, 1, 0, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 1, 1, 1, 1, 0, 0, 1, 1,
       0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 1,

(569,)

#### DF 변환

In [48]:
cancer_df = pd.DataFrame(data=bcancer_data,columns=bcancer.feature_names)
cancer_df['label']= bcancer_label
cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,1
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,1
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,1
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,1
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,1
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,1
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,1
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,1


In [49]:
cancer_df['label'].value_counts()

0    357
1    212
Name: label, dtype: int64

In [50]:
# label data 분포도 확인
# 유방암인 경우의 data가 유방암 아닌 경우보다 적음 
cancer_df['label'].value_counts()/len(cancer_df['label']) * 100

0    62.741652
1    37.258348
Name: label, dtype: float64

## 4. train, test 데이터 분리

In [51]:
from sklearn.model_selection import train_test_split
# sklearn model_selection패키지의 train_test_split 함수를 임포트

# stratify=digits_label 지정 - 학습데이터 클래스 비율이 원본 비율과 같도록 하기 위함
X_train, X_test, y_train, y_test = train_test_split(bcancer_data,bcancer_label,test_size=0.2,random_state=11,stratify=bcancer_label)

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

X_train 개수:  455 , X_test 개수:  114


In [52]:
# 학습용 label data 분포도 확인 - 원본 라벨 데이터와 비율이 같음
# 유방암인 경우의 data가 유방암 아닌 경우보다 적음 
df_y_train = pd.DataFrame(y_train)
df_y_train.value_counts()/len(df_y_train) * 100

0    62.637363
1    37.362637
dtype: float64

In [53]:
# 테스트용 label data 분포도 확인 - 원본 라벨 데이터와 비율이 같음
# 유방암인 경우의 data가 유방암 아닌 경우보다 적음 
df_y_test = pd.DataFrame(y_test)
df_y_test.value_counts()/len(df_y_test) * 100

0    63.157895
1    36.842105
dtype: float64

## 5. 다양한 모델로 학습/예측

> Decision Tree    
> Random Forest   
> SVM  
> SGD Classifier  
> Logistic Regreesion  

In [54]:
from sklearn.tree import DecisionTreeClassifier # 의사결정트리 모델 import
from sklearn.ensemble import RandomForestClassifier # 랜덤포레스트라는 분류기를 import
from sklearn import svm #Support Vector Machine mport
from sklearn.linear_model import SGDClassifier #선형분류기인 SGDClassifier mport
from sklearn.linear_model import LogisticRegression # LogisticRegression import
from sklearn.metrics import classification_report  #분류 결과 리포트 import
from sklearn.metrics import accuracy_score, precision_score, recall_score, confusion_matrix,roc_auc_score

# 5개의 Classifier 객체 생성
decision_tree = DecisionTreeClassifier(random_state=52)
random_forest = RandomForestClassifier(random_state=52)
svm_model = svm.SVC()
sgd_model = SGDClassifier()
logistic_model = LogisticRegression(solver='liblinear')

# DecisionTree 학습/예측
decision_tree.fit(X_train, y_train) #학습
y_pred_dt = decision_tree.predict(X_test)#예측

# Random Forest 학습/예측
random_forest.fit(X_train, y_train) # 학습
y_pred_rf = random_forest.predict(X_test) # 예측

# SVM 학습 및 예측
svm_model.fit(X_train, y_train) # 훈련.
y_pred_svm = svm_model.predict(X_test) # 예측

# SGD Classifier 학습 및 예측
sgd_model.fit(X_train, y_train) # 학습
y_pred_sgd = sgd_model.predict(X_test) # 예측

# Logistic Regreesion 학습 및 예측
logistic_model.fit(X_train, y_train) # 학습
y_pred_lr = logistic_model.predict(X_test) # 예측

# 각 모델별 예측값 확인
pred_df = pd.DataFrame(y_test,columns=['label']) # traget data
pred_df['DT']= y_pred_dt
pred_df['RF']= y_pred_rf
pred_df['SVM']= y_pred_svm
pred_df['sgd']= y_pred_sgd
pred_df['LR']= y_pred_lr
pred_df

DecisionTreeClassifier(random_state=52)

RandomForestClassifier(random_state=52)

SVC()

SGDClassifier()

LogisticRegression(solver='liblinear')

Unnamed: 0,label,DT,RF,SVM,sgd,LR
0,1,1,1,1,1,1
1,1,1,1,1,1,1
2,0,0,0,0,0,0
3,0,0,0,0,0,0
4,0,0,0,0,0,0
...,...,...,...,...,...,...
109,1,1,1,1,0,1
110,0,0,0,0,0,0
111,0,0,0,0,0,0
112,0,0,0,0,0,0


## 6. 모델 성능 평가

In [55]:
# DecisionTreeClassifier 평가
print("\n---------- Decision Tree --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_dt),'\n')
print(classification_report(y_test, y_pred_dt)) 


---------- Decision Tree --------------

confusion matrix
 [[70  2]
 [ 3 39]] 

              precision    recall  f1-score   support

           0       0.96      0.97      0.97        72
           1       0.95      0.93      0.94        42

    accuracy                           0.96       114
   macro avg       0.96      0.95      0.95       114
weighted avg       0.96      0.96      0.96       114



In [56]:
# Random Forest 평가
print("\n---------- Random Forest --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_rf),'\n')
print(classification_report(y_test, y_pred_rf)) 


---------- Random Forest --------------

confusion matrix
 [[72  0]
 [ 5 37]] 

              precision    recall  f1-score   support

           0       0.94      1.00      0.97        72
           1       1.00      0.88      0.94        42

    accuracy                           0.96       114
   macro avg       0.97      0.94      0.95       114
weighted avg       0.96      0.96      0.96       114



In [57]:
# SVM 평가
print("\n---------- SVM --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_svm),'\n')
print(classification_report(y_test, y_pred_svm)) 


---------- SVM --------------

confusion matrix
 [[72  0]
 [10 32]] 

              precision    recall  f1-score   support

           0       0.88      1.00      0.94        72
           1       1.00      0.76      0.86        42

    accuracy                           0.91       114
   macro avg       0.94      0.88      0.90       114
weighted avg       0.92      0.91      0.91       114



In [58]:
# SGD Classifier 평가
print("\n----------SGD Classifier--------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_sgd),'\n')
print(classification_report(y_test, y_pred_sgd)) 


----------SGD Classifier--------------

confusion matrix
 [[72  0]
 [ 8 34]] 

              precision    recall  f1-score   support

           0       0.90      1.00      0.95        72
           1       1.00      0.81      0.89        42

    accuracy                           0.93       114
   macro avg       0.95      0.90      0.92       114
weighted avg       0.94      0.93      0.93       114



In [59]:
#  Logistic Regreesion 평가
print("\n---------- Logistic Regreesion --------------\n")
print("confusion matrix\n", confusion_matrix(y_test, y_pred_lr),'\n')
print(classification_report(y_test, y_pred_lr)) 


---------- Logistic Regreesion --------------

confusion matrix
 [[71  1]
 [ 2 40]] 

              precision    recall  f1-score   support

           0       0.97      0.99      0.98        72
           1       0.98      0.95      0.96        42

    accuracy                           0.97       114
   macro avg       0.97      0.97      0.97       114
weighted avg       0.97      0.97      0.97       114



## 회고

* SGDClassifier 모델 객체 생성 시에 아래와 같은 warning 발생
> /opt/conda/lib/python3.9/site-packages/sklearn/linear_model/_logistic.py:814:  ConvergenceWarning: lbfgs failed to converge (status=1):  
>STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.
> Increase the number of iterations (max_iter) or scale the data as shown in:  
    https://scikit-learn.org/stable/modules/preprocessing.html  
> Please also refer to the documentation for alternative solver options:  
>    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression 
> n_iter_i = _check_optimize_result(

 - default solver: lbfgs,  max_iter = 100 인데, lbfgs는 최적화하는데, 100번보다 많은 반복 횟수가 필요해 보임
 - solver 또는  max_iter 파라미터 값을 변경하는 방법이 있는데, solver='libnear'로 설정해서 문제 해결함
  - lbfgs는 메모리공간을 절약할 수 있고, cpu코어수가 많다면 최적화로 병렬로 처리를 수행할 수 있다고 함(LMS system에서 cpu core는 2개 뿐이여서, merit가 없음)
  - libnear는 다차원이고 작은 데이터 셋에 효과적으로 동작한다고 하여 선택함  

</br>
</br>

* 분류 모델에서 학습데이터와 테스트 데이터 분리 시에 random_state,stratify 파라미터 값을 변경해 가며 테스트 해 본 결과, 학습 데이터 클래스 잘 선정해야 성능이 높게 나온다는 사실을 알았다.(학습 라벨 데이터 고르게 분포)
 * stratify = label data 로 셋팅해 주어서, 학습데이터 내에서 클래스 데이터 비율을 고르게 함
  * random_state 값 변경하면, 성능의 차이가 보임. 어떤 데이터로 학습하느냐에 따라 성능이 달라짐

</br>
</br>

* 유방암 여부 예측모델에서는 암인 경우를 1로 암이 아닌경우를 0으로 설정하여 테스트함.
 * target data는 Malignant(악성종양)일 경우 0, Benign(양성종양)인 경우는 1로 되어 있으나, 암진단 예측 모델에서는 중점적으로 봐야 하는 결과값를 Positive(1)로 설정하는 것이 일반적이라, 암인 경우는 Positive(1), 암이 아닌 경우는 Negative(0)로 재설정하여 테스트함


## 참고문헌
* https://en.wikipedia.org/wiki/Confusion_matrix
* https://bskyvision.com/entry/python-scikit-learn%EC%9D%98-confusion-matrix-%ED%95%B4%EC%84%9D%ED%95%98%EA%B8%B0
* https://datascienceschool.net/03%20machine%20learning/09.04%20%EB%B6%84%EB%A5%98%20%EC%84%B1%EB%8A%A5%ED%8F%89%EA%B0%80.html
* https://wikibook.co.kr/pymldg-rev/

