# E2-2. 와인 분류하기
- AIFFEL 두번째 exploration 중 2번째 프로젝트.
- scikit-learn의 toy datasets 중 와인 데이터셋을 분류해보기.
- scikit-learn API에서 5가지 모델(decision tree, random forest, SVM, SGD Classifier, Logistic Regression)을 활용함.

## 1. 모델 훈련 전 데이터 준비하기

In [9]:
# 모듈 import
import sklearn
from sklearn.datasets import load_wine
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import confusion_matrix

In [13]:
# wine 데이터 불러오기
wine = load_wine()

# wine 데이터 살펴보기
print(wine.keys()) # 딕셔너리 키 값
print(wine.data) # data 키에 해당하는 value
print(wine.feature_names)
print(wine.target_names)

dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names'])
[[1.423e+01 1.710e+00 2.430e+00 ... 1.040e+00 3.920e+00 1.065e+03]
 [1.320e+01 1.780e+00 2.140e+00 ... 1.050e+00 3.400e+00 1.050e+03]
 [1.316e+01 2.360e+00 2.670e+00 ... 1.030e+00 3.170e+00 1.185e+03]
 ...
 [1.327e+01 4.280e+00 2.260e+00 ... 5.900e-01 1.560e+00 8.350e+02]
 [1.317e+01 2.590e+00 2.370e+00 ... 6.000e-01 1.620e+00 8.400e+02]
 [1.413e+01 4.100e+00 2.740e+00 ... 6.100e-01 1.600e+00 5.600e+02]]
['alcohol', 'malic_acid', 'ash', 'alcalinity_of_ash', 'magnesium', 'total_phenols', 'flavanoids', 'nonflavanoid_phenols', 'proanthocyanins', 'color_intensity', 'hue', 'od280/od315_of_diluted_wines', 'proline']
['class_0' 'class_1' 'class_2']


In [15]:
print(wine.DESCR)

.. _wine_dataset:

Wine recognition dataset
------------------------

**Data Set Characteristics:**

    :Number of Instances: 178 (50 in each of three classes)
    :Number of Attributes: 13 numeric, predictive attributes and the class
    :Attribute Information:
 		- Alcohol
 		- Malic acid
 		- Ash
		- Alcalinity of ash  
 		- Magnesium
		- Total phenols
 		- Flavanoids
 		- Nonflavanoid phenols
 		- Proanthocyanins
		- Color intensity
 		- Hue
 		- OD280/OD315 of diluted wines
 		- Proline

    - class:
            - class_0
            - class_1
            - class_2
		
    :Summary Statistics:
    
                                   Min   Max   Mean     SD
    Alcohol:                      11.0  14.8    13.0   0.8
    Malic Acid:                   0.74  5.80    2.34  1.12
    Ash:                          1.36  3.23    2.36  0.27
    Alcalinity of Ash:            10.6  30.0    19.5   3.3
    Magnesium:                    70.0 162.0    99.7  14.3
    Total Phenols:                0

In [12]:
# feature data, label 변수 만들기
wine_data = wine.data
wine_label = wine.target

# feature data, label 살펴보기
print(wine_data.shape)
print(wine_label.shape)

wine_df = pd.DataFrame(wine.data, columns=wine.feature_names)
wine_df.head()

Unnamed: 0,alcohol,malic_acid,ash,alcalinity_of_ash,magnesium,total_phenols,flavanoids,nonflavanoid_phenols,proanthocyanins,color_intensity,hue,od280/od315_of_diluted_wines,proline
0,14.23,1.71,2.43,15.6,127.0,2.8,3.06,0.28,2.29,5.64,1.04,3.92,1065.0
1,13.2,1.78,2.14,11.2,100.0,2.65,2.76,0.26,1.28,4.38,1.05,3.4,1050.0
2,13.16,2.36,2.67,18.6,101.0,2.8,3.24,0.3,2.81,5.68,1.03,3.17,1185.0
3,14.37,1.95,2.5,16.8,113.0,3.85,3.49,0.24,2.18,7.8,0.86,3.45,1480.0
4,13.24,2.59,2.87,21.0,118.0,2.8,2.69,0.39,1.82,4.32,1.04,2.93,735.0


In [25]:
# train, test 데이터 분리하기
X_train, X_test, y_train, y_test = train_test_split(wine_data, wine_label,
                                                   test_size=0.2, random_state=7)
print(X_train.shape)
print(X_test.shape)
print(y_train.shape)
print(y_test.shape)

(142, 13)
(36, 13)
(142,)
(36,)


## 2. scikit-learn 머신러닝 모델 학습, 예측, 평가
- 각 모델의 분류 결과에 대해 평가할 때, precision, recall, f1-score과 accuracy를 살펴볼 수 있다.
- 이 중 와인 분류기는 precision과 recall의 조화평균인 f1-score과 accuracy가 높은 모델일수록 좋다고 생각한다.  
  왜냐하면, precision = TP / (TP + FP), recall = TP / (TP + FN) 인데, 와인 분류기에서는 맞는 걸 아니라고 하든 틀린걸 맞다고 하든 둘다 똑같이 잘못판단한 것이므로 두개 다 반영해야 한다. 덧붙이면, positive 혹은 negative 중 어떤걸 더 정확하게 맞춰야하는지가 문제되지 않기 때문에 f1-score 혹은 accuracy 두 지표로 모델을 평가해야 한다.

In [26]:
# Decision Tree 모델로 학습
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)
y_pred=decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.89      1.00      0.94        17
           2       1.00      0.83      0.91        12

    accuracy                           0.94        36
   macro avg       0.96      0.94      0.95        36
weighted avg       0.95      0.94      0.94        36

[[ 7  0  0]
 [ 0 17  0]
 [ 0  2 10]]


In [27]:
# Random Forest 모델로 학습
random_forest = RandomForestClassifier(random_state=32)
random_forest.fit(X_train, y_train)
y_pred=random_forest.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       1.00      1.00      1.00        17
           2       1.00      1.00      1.00        12

    accuracy                           1.00        36
   macro avg       1.00      1.00      1.00        36
weighted avg       1.00      1.00      1.00        36

[[ 7  0  0]
 [ 0 17  0]
 [ 0  0 12]]


In [28]:
# SVM 모델로 학습
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)
y_pred=svm_model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86         7
           1       0.58      0.88      0.70        17
           2       0.33      0.08      0.13        12

    accuracy                           0.61        36
   macro avg       0.59      0.61      0.56        36
weighted avg       0.55      0.61      0.54        36

[[ 6  0  1]
 [ 1 15  1]
 [ 0 11  1]]


In [29]:
# SGD Classifier 모델로 학습
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)
y_pred=sgd_model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       0.86      0.86      0.86         7
           1       0.71      0.88      0.79        17
           2       0.75      0.50      0.60        12

    accuracy                           0.75        36
   macro avg       0.77      0.75      0.75        36
weighted avg       0.75      0.75      0.74        36

[[ 6  0  1]
 [ 1 15  1]
 [ 0  6  6]]


In [37]:
# Logistic Regression 모델로 학습
logistic_model = LogisticRegression(solver='newton-cg')
logistic_model.fit(X_train, y_train)
y_pred=logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))
print(confusion_matrix(y_test, y_pred))

              precision    recall  f1-score   support

           0       1.00      1.00      1.00         7
           1       0.94      1.00      0.97        17
           2       1.00      0.92      0.96        12

    accuracy                           0.97        36
   macro avg       0.98      0.97      0.98        36
weighted avg       0.97      0.97      0.97        36

[[ 7  0  0]
 [ 0 17  0]
 [ 0  1 11]]


## 3. 회고
- 5가지 모델로 와인 데이터셋을 학습하고, 테스트해본 결과는 다음과 같다.
- f1-score 및 accuracy 가 높은 순으로 순위를 매기면,  
  Random Forest Classifier (f1-score: 100%, accuracy: 100%),  
  Logistic Regression (f1-score: 97%, accuracy: 97%),  
  Decision Tree Classifier (f1-score: 94%, accuracy: 95%),  
  SGD Classifier (f1-score: 75%, accuracy: 75%),  
  SVM (f1-score: 56%, accuracy: 61%)
- Random Forest Classifier 모델이 와인 데이터셋을 가장 잘 분류하는 것으로 보인다.