### scikit-learn

- keyword: sklearn, 지도학습/비지도학습, 분류/회귀, 수치형/범주형, decisiontree, randomforest, svm, sgd, rogisticregression, confusion matrix, TN/FP/FN/FP, Precision, Negative Predictive Value, Sensitivity(==recall), Specificity, Accuracy

Scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.


## EXPLORATION_02

### Classification of Digits, Wine, and Breast Cancer Dataset provided by sklearn.datasets

Scikit-learn 라이브러리에서는 Toy Datasts(연습용 데이터셋) 와 Real World Datasets(실제 데이터셋) 을 제공한다.

이번 챕터에서 Toy datasets 중 다음의 세가지 데이터 셋을 이용하여 여러 분류모델을 사용 해보고, 해당 데이터 마다 어떤 분류모델이 가장 높은 성능 을 보이는지 확인해보겠다.

- Optical recognition of handwritten digits dataset : 손글씨 이미지 데이터
- Wine recognition dataset: 와인 데이터
- Breast cancer wisconsin (diagnostic) dataset : 유방암 데이터

#### standard
1. 3가지 데이터셋의 구성이 합리적으로 진행되었는가? feature와 label 선정을 위한 데이터 분석과정이 체계적으로 전개됨
2. 3가지 데이터셋에 대해 각각 5가지 모델을 성공적으로 적용하였는가? 모델학습 및 테스트가 정상적으로 수행되었음
3. 3가지 데이터셋에 대해 모델의 평가지표가 적절히 선택되었는가? 평가지표 선택 및 이유 설명이 타당함

#### sequence
1. digits -> wine -> vreast cancer 
2. import 모듈 -> 데이터 준비 -> 데이터 확인 -> 데이터 분리
3. 다양한 모델로 학습 진행: Decision Tree/ RandomForestClassifier/ SVM/ SGD/ Logistic Regression
4. 모델 평가하기

##### trial and error


In [1]:
import sklearn
print("sklearn : {}".format(sklearn.__version__))

sklearn : 1.0


### Optical recognition of handwritten digits dataset

In [2]:
from sklearn.datasets import load_digits                          
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
digits = load_digits() 
digits_data = digits.data   #feature 데이터 지정
digits_label = digits.target #label 데이터 지정

# print(digits.target_names)  #Target Names 출력
# print(digits.DESCR)   #데이터 Describe

# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
decision_tree = DecisionTreeClassifier(random_state=7) 
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.95      0.98        43
           1       0.75      0.79      0.77        42
           2       0.77      0.82      0.80        40
           3       0.84      0.91      0.87        34
           4       0.83      0.92      0.87        37
           5       0.90      0.96      0.93        28
           6       0.90      0.93      0.91        28
           7       0.90      0.82      0.86        33
           8       0.90      0.63      0.74        43
           9       0.74      0.81      0.78        32

    accuracy                           0.85       360
   macro avg       0.85      0.85      0.85       360
weighted avg       0.85      0.85      0.85       360



In [38]:
from sklearn.datasets import load_digits                                
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
digits = load_digits() 
digits_data = digits.data   
digits_label = digits.target 


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

# (4) 모델 학습 및 예측 
random_forest = RandomForestClassifier(random_state=7)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       0.93      1.00      0.97        42
           2       1.00      1.00      1.00        40
           3       0.92      0.97      0.94        34
           4       0.90      0.97      0.94        37
           5       0.90      0.96      0.93        28
           6       1.00      0.96      0.98        28
           7       0.94      0.97      0.96        33
           8       1.00      0.81      0.90        43
           9       0.94      0.91      0.92        32

    accuracy                           0.95       360
   macro avg       0.95      0.95      0.95       360
weighted avg       0.96      0.95      0.95       360



In [3]:
from sklearn.datasets import load_digits                                
from sklearn.model_selection import train_test_split
from sklearn import svm    
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
digits = load_digits() 
digits_data = digits.data  
digits_label = digits.target 


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

# (4) 모델 학습 및 예측 
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.95      1.00      0.98        42
           2       1.00      1.00      1.00        40
           3       1.00      1.00      1.00        34
           4       1.00      1.00      1.00        37
           5       0.93      1.00      0.97        28
           6       1.00      1.00      1.00        28
           7       1.00      1.00      1.00        33
           8       1.00      0.93      0.96        43
           9       1.00      0.97      0.98        32

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360



In [4]:
from sklearn.datasets import load_digits                                
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


# (2) 데이터 준비
digits = load_digits() 
digits_data = digits.data   #feature 데이터 지정
digits_label = digits.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

# (4) 모델 학습 및 예측 
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.95      1.00      0.98        42
           2       1.00      1.00      1.00        40
           3       1.00      1.00      1.00        34
           4       1.00      1.00      1.00        37
           5       0.93      1.00      0.97        28
           6       1.00      1.00      1.00        28
           7       1.00      1.00      1.00        33
           8       1.00      0.93      0.96        43
           9       1.00      0.97      0.98        32

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360



In [5]:
from sklearn.datasets import load_digits                                
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix


# (2) 데이터 준비
digits = load_digits() 
digits_data = digits.data   #feature 데이터 지정
digits_label = digits.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(digits_data, 
                                                    digits_label, 
                                                    test_size=0.2, 
                                                    random_state=7)

# (4) 모델 학습 및 예측 
logistic_model = LogisticRegression(solver = 'liblinear')
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))

#solver = 'liblinear'

              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.90      0.90      0.90        42
           2       0.97      0.97      0.97        40
           3       0.86      0.94      0.90        34
           4       0.95      0.97      0.96        37
           5       0.93      1.00      0.97        28
           6       0.96      0.93      0.95        28
           7       0.97      0.97      0.97        33
           8       0.93      0.88      0.90        43
           9       0.97      0.88      0.92        32

    accuracy                           0.94       360
   macro avg       0.95      0.95      0.94       360
weighted avg       0.95      0.94      0.94       360



#### accuracy
- Decision Tree: 85%
- RandomForestClassifier: 95%
- SVM: 99%
- SGD: 99%
- Logistic Regression: 94%

### >>SVM, SGD

## Wine recognition datase

In [6]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정

print(wine.target_names)  #Target Names 출력
print(wine.DESCR)   #데이터 Describe

['class_0' 'class_1' 'class_2']
.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
   

In [7]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정

# print(wine.target_names)  #Target Names 출력
# print(wine.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   #이건 어케 지정해야하는걸까

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

X_train 개수:  142 , X_test 개수:  36


In [8]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정

# print(wine.target_names)  #Target Names 출력
# print(wine.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
decision_tree = DecisionTreeClassifier(random_state=7) 
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.88      1.00      0.93         7
           1       0.89      0.94      0.91        17
           2       1.00      0.83      0.91        12

    accuracy                           0.92        36
   macro avg       0.92      0.92      0.92        36
weighted avg       0.92      0.92      0.92        36



In [9]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   


# (4) 모델 학습 및 예측 
random_forest = RandomForestClassifier(random_state=7)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      1.00      1.00        17
           2       1.00      1.00      1.00        12

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36



In [10]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   


# (4) 모델 학습 및 예측 
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86         7
           1       0.58      0.88      0.70        17
           2       0.33      0.08      0.13        12

    accuracy                           0.61        36
   macro avg       0.59      0.61      0.56        36
weighted avg       0.55      0.61      0.54        36



In [11]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   


# (4) 모델 학습 및 예측 
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred, zero_division = 0))

# zero_division = 0 ?

              precision    recall  f1-score   support

           0       0.86      0.86      0.86         7
           1       0.58      0.88      0.70        17
           2       0.33      0.08      0.13        12

    accuracy                           0.61        36
   macro avg       0.59      0.61      0.56        36
weighted avg       0.55      0.61      0.54        36



In [12]:
from sklearn.datasets import load_wine                          
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
wine = load_wine() 
wine_data = wine.data   #feature 데이터 지정
wine_label = wine.target #label 데이터 지정


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(wine_data, 
                                                    wine_label, 
                                                    test_size=0.2, 
                                                    random_state=7)   


# (4) 모델 학습 및 예측 
logistic_model = LogisticRegression(solver = 'liblinear')
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))

# scaler = preprocessing.StandardScaler().fit(X_train)
# solver = 'liblinear'

              precision    recall  f1-score   support

           0       0.88      1.00      0.93         7
           1       1.00      0.94      0.97        17
           2       1.00      1.00      1.00        12

    accuracy                           0.97        36
   macro avg       0.96      0.98      0.97        36
weighted avg       0.98      0.97      0.97        36



#### accuracy
- Decision Tree: 92%
- RandomForestClassifier: 100%
- SVM: 61%
- SGD: 61%
- Logistic Regression: 97%

### >>RandomForestClassifier

## Breast cancer wisconsin (diagnostic) dataset

In [13]:
from sklearn.datasets import load_breast_cancer                      

breast_cancer  = load_breast_cancer()

print(dir(breast_cancer))
breast_cancer.keys()

['DESCR', 'data', 'data_module', 'feature_names', 'filename', 'frame', 'target', 'target_names']


dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])

In [14]:
breast_cancer_data = breast_cancer.data

print(breast_cancer_data.shape) 

(569, 30)


In [15]:
breast_cancer_label = breast_cancer.target
print(breast_cancer_label.shape)
breast_cancer_label

(569,)


array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 1, 1, 1, 1, 0, 1, 0, 0,
       1, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 0, 0,
       1, 1, 1, 0, 1, 1, 0, 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1,
       1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 1, 0, 0, 1, 0, 1, 0,
       0, 1, 0, 0, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 1,
       1, 0, 1, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 1, 0, 1, 1, 0, 0, 1, 0, 0,
       0, 0, 1, 0, 0, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 0, 0, 1, 1, 0, 0,
       1, 1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 0, 1, 1, 0, 0, 1, 0, 1, 1,
       1, 1, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 1, 1, 1, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 0,

In [16]:
import pandas as pd

breast_cancer_df = pd.DataFrame(data=breast_cancer_data, columns=breast_cancer.feature_names)
breast_cancer_df

breast_cancer_df["label"] = breast_cancer.target
breast_cancer_df

Unnamed: 0,mean radius,mean texture,mean perimeter,mean area,mean smoothness,mean compactness,mean concavity,mean concave points,mean symmetry,mean fractal dimension,...,worst texture,worst perimeter,worst area,worst smoothness,worst compactness,worst concavity,worst concave points,worst symmetry,worst fractal dimension,label
0,17.99,10.38,122.80,1001.0,0.11840,0.27760,0.30010,0.14710,0.2419,0.07871,...,17.33,184.60,2019.0,0.16220,0.66560,0.7119,0.2654,0.4601,0.11890,0
1,20.57,17.77,132.90,1326.0,0.08474,0.07864,0.08690,0.07017,0.1812,0.05667,...,23.41,158.80,1956.0,0.12380,0.18660,0.2416,0.1860,0.2750,0.08902,0
2,19.69,21.25,130.00,1203.0,0.10960,0.15990,0.19740,0.12790,0.2069,0.05999,...,25.53,152.50,1709.0,0.14440,0.42450,0.4504,0.2430,0.3613,0.08758,0
3,11.42,20.38,77.58,386.1,0.14250,0.28390,0.24140,0.10520,0.2597,0.09744,...,26.50,98.87,567.7,0.20980,0.86630,0.6869,0.2575,0.6638,0.17300,0
4,20.29,14.34,135.10,1297.0,0.10030,0.13280,0.19800,0.10430,0.1809,0.05883,...,16.67,152.20,1575.0,0.13740,0.20500,0.4000,0.1625,0.2364,0.07678,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
564,21.56,22.39,142.00,1479.0,0.11100,0.11590,0.24390,0.13890,0.1726,0.05623,...,26.40,166.10,2027.0,0.14100,0.21130,0.4107,0.2216,0.2060,0.07115,0
565,20.13,28.25,131.20,1261.0,0.09780,0.10340,0.14400,0.09791,0.1752,0.05533,...,38.25,155.00,1731.0,0.11660,0.19220,0.3215,0.1628,0.2572,0.06637,0
566,16.60,28.08,108.30,858.1,0.08455,0.10230,0.09251,0.05302,0.1590,0.05648,...,34.12,126.70,1124.0,0.11390,0.30940,0.3403,0.1418,0.2218,0.07820,0
567,20.60,29.33,140.10,1265.0,0.11780,0.27700,0.35140,0.15200,0.2397,0.07016,...,39.42,184.60,1821.0,0.16500,0.86810,0.9387,0.2650,0.4087,0.12400,0


In [17]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

print(breast_cancer.target_names)  #Target Names 출력
print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

['malignant' 'benign']
.. _breast_cancer_dataset:

Breast cancer wisconsin (diagnostic) dataset
--------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 569

    :Number of Attributes: 30 numeric, predictive attributes and the class

    :Attribute Information:
        - radius (mean of distances from center to points on the perimeter)
        - texture (standard deviation of gray-scale values)
        - perimeter
        - area
        - smoothness (local variation in radius lengths)
        - compactness (perimeter^2 / area - 1.0)
        - concavity (severity of concave portions of the contour)
        - concave points (number of concave portions of the contour)
        - symmetry
        - fractal dimension ("coastline approximation" - 1)

        The mean, standard error, and "worst" or largest (mean of the three
        worst/largest values) of these features were computed for each image,
        resulting in 30 features.  For instanc

In [18]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

# print(breast_cancer.target_names)  #Target Names 출력
# print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
decision_tree = DecisionTreeClassifier(random_state=7) 
decision_tree.fit(X_train, y_train)
y_pred = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.92      0.82      0.87        40
           1       0.91      0.96      0.93        74

    accuracy                           0.91       114
   macro avg       0.91      0.89      0.90       114
weighted avg       0.91      0.91      0.91       114



In [19]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier 
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

# print(breast_cancer.target_names)  #Target Names 출력
# print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
random_forest = RandomForestClassifier(random_state=7)
random_forest.fit(X_train, y_train)
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.93      0.96        40
           1       0.96      1.00      0.98        74

    accuracy                           0.97       114
   macro avg       0.98      0.96      0.97       114
weighted avg       0.97      0.97      0.97       114



In [20]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn import svm
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

# print(breast_cancer.target_names)  #Target Names 출력
# print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.72      0.84        40
           1       0.87      1.00      0.93        74

    accuracy                           0.90       114
   macro avg       0.94      0.86      0.89       114
weighted avg       0.92      0.90      0.90       114



In [21]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn.linear_model import SGDClassifier
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

# print(breast_cancer.target_names)  #Target Names 출력
# print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.72      0.84        40
           1       0.87      1.00      0.93        74

    accuracy                           0.90       114
   macro avg       0.94      0.86      0.89       114
weighted avg       0.92      0.90      0.90       114



In [22]:
from sklearn.datasets import load_breast_cancer                        
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix

# (2) 데이터 준비
breast_cancer  = load_breast_cancer()
breast_cancer_data = breast_cancer.data
breast_cancer_label = breast_cancer.target

# print(breast_cancer.target_names)  #Target Names 출력
# print(breast_cancer.DESCR)   #데이터 Describe


# (3) train, test 데이터 분리
X_train, X_test, y_train, y_test = train_test_split(breast_cancer_data, 
                                                    breast_cancer_label, 
                                                    test_size=0.2, 
                                                    random_state=7)  

# print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

# (4) 모델 학습 및 예측 
logistic_model = LogisticRegression(solver = 'liblinear')
logistic_model.fit(X_train, y_train)
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      0.85      0.92        40
           1       0.93      1.00      0.96        74

    accuracy                           0.95       114
   macro avg       0.96      0.93      0.94       114
weighted avg       0.95      0.95      0.95       114



#### accuracy
- Decision Tree: 91%
- RandomForestClassifier: 97%
- SVM: 90%
- SGD: 90%
- Logistic Regression: 95%

###  >> RandomForestClassifier

## Trial and error

사이킷 런을 처음 구현해 보면서 tensorflow 때처럼 열심히 했던 기억이 난다. 여기서의 관건은 모델을 만드는 것이 아니라 "데이터 셋에 따른 
최적의 평가 방법을 찾아라"인 것 같다. 다른 부분에선 문제가 없었지만 logistic regression에서 정규화 또는 solver를 적용해야만 limit을 풀 수 있었던 게 가장 기억에 남고 5개의 평가 모델에 대해 깊이 공부해 볼 수 있는 계기가 되었던 것 같다. confusion metrix에 대해선 더 공부해 볼 것.
