## load_digits : 손글씨 분류하기
커널정보 = base (python 3.9.7)

### 0. 루브릭
***

|**평가문항**|**상세기준**|
|------------|-------------|
|1. 3가지 데이터셋의 구성이 합리적으로 진행되었는가?|feature와 label 선정을 위한 데이터 분석과정이 체계적으로 전개됨|
|2. 3가지 데이터셋에 대해 각각 5가지 모델을 성공적으로 적용하였는가?|모델학습 및 테스트가 정상적으로 수행되었음|
|3. 3가지 데이터셋에 대해 모델의 평가지표가 적절히 선택되었는가?|평가지표 선택 및 이유 설명이 타당함|


### 1. 필요한 모듈 import

In [26]:
from sklearn.datasets import load_digits
from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
import numpy as np

print('done')

done


### 2. 데이터 준비
***

In [27]:
digits = load_digits() # digits 자료 불러오기

print(dir(digits)) # dir()는 객체가 어떤 변수와 메서드를 가지고 있는지 나열함

digits.keys() # digits에 담겨있는 정보 확인

['DESCR', 'data', 'feature_names', 'frame', 'images', 'target', 'target_names']


dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])

### 3. 데이터 이해하기
***

#### 3.1 Feature Data 지정하기

In [28]:
digits_data = digits.data 

print(digits_data.shape)

digits_data[0] # digits_data 중 0번 샘플 확인

(1797, 64)


array([ 0.,  0.,  5., 13.,  9.,  1.,  0.,  0.,  0.,  0., 13., 15., 10.,
       15.,  5.,  0.,  0.,  3., 15.,  2.,  0., 11.,  8.,  0.,  0.,  4.,
       12.,  0.,  0.,  8.,  8.,  0.,  0.,  5.,  8.,  0.,  0.,  9.,  8.,
        0.,  0.,  4., 11.,  0.,  1., 12.,  7.,  0.,  0.,  2., 14.,  5.,
       10., 12.,  0.,  0.,  0.,  0.,  6., 13., 10.,  0.,  0.,  0.])

In [29]:
digits.feature_names

['pixel_0_0',
 'pixel_0_1',
 'pixel_0_2',
 'pixel_0_3',
 'pixel_0_4',
 'pixel_0_5',
 'pixel_0_6',
 'pixel_0_7',
 'pixel_1_0',
 'pixel_1_1',
 'pixel_1_2',
 'pixel_1_3',
 'pixel_1_4',
 'pixel_1_5',
 'pixel_1_6',
 'pixel_1_7',
 'pixel_2_0',
 'pixel_2_1',
 'pixel_2_2',
 'pixel_2_3',
 'pixel_2_4',
 'pixel_2_5',
 'pixel_2_6',
 'pixel_2_7',
 'pixel_3_0',
 'pixel_3_1',
 'pixel_3_2',
 'pixel_3_3',
 'pixel_3_4',
 'pixel_3_5',
 'pixel_3_6',
 'pixel_3_7',
 'pixel_4_0',
 'pixel_4_1',
 'pixel_4_2',
 'pixel_4_3',
 'pixel_4_4',
 'pixel_4_5',
 'pixel_4_6',
 'pixel_4_7',
 'pixel_5_0',
 'pixel_5_1',
 'pixel_5_2',
 'pixel_5_3',
 'pixel_5_4',
 'pixel_5_5',
 'pixel_5_6',
 'pixel_5_7',
 'pixel_6_0',
 'pixel_6_1',
 'pixel_6_2',
 'pixel_6_3',
 'pixel_6_4',
 'pixel_6_5',
 'pixel_6_6',
 'pixel_6_7',
 'pixel_7_0',
 'pixel_7_1',
 'pixel_7_2',
 'pixel_7_3',
 'pixel_7_4',
 'pixel_7_5',
 'pixel_7_6',
 'pixel_7_7']

#### 3.2 Label Data 지정하기

In [30]:
digits_label = digits.target
print(digits_label.shape)
digits_label

(1797,)


array([0, 1, 2, ..., 8, 9, 8])

#### 3.3 Target Names 출력하기

In [31]:
digits.target_names

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

#### 3.4 데이터 Describe 읽어보기

In [32]:
print(digits.DESCR)

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 1797
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

### 4. 학습 데이터 & 테스트 데이터 준비
***

In [33]:
digits_df = pd.DataFrame(data=digits_data, columns=[digits.feature_names]) # pandas dataframe으로 변환
digits_df

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_6,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,5.0,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,9.0,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1792,0.0,0.0,4.0,10.0,13.0,6.0,0.0,0.0,0.0,1.0,...,4.0,0.0,0.0,0.0,2.0,14.0,15.0,9.0,0.0,0.0
1793,0.0,0.0,6.0,16.0,13.0,11.0,1.0,0.0,0.0,0.0,...,1.0,0.0,0.0,0.0,6.0,16.0,14.0,6.0,0.0,0.0
1794,0.0,0.0,1.0,11.0,15.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,9.0,13.0,6.0,0.0,0.0
1795,0.0,0.0,2.0,10.0,7.0,0.0,0.0,0.0,0.0,0.0,...,2.0,0.0,0.0,0.0,5.0,12.0,16.0,12.0,0.0,0.0


#### 4.1 학습데이터와 테스트 데이터 분리

In [34]:
X_train, X_test, y_train, y_test = train_test_split(digits_data,  # 모델이 맞춰야하는 정답값, label
                                                    digits_label, 
                                                    test_size=0.2, # test dataest의 크기 조절. 전체의 20%를 정답지로
                                                    random_state=9) # 데이터를 분리하는데 적용되는 랜덤성

print('X_train 개수: ', len(X_train),', X_test 개수: ', len(X_test))

X_train 개수:  1437 , X_test 개수:  360


In [35]:
X_train.shape, y_train.shape

((1437, 64), (1437,))

In [36]:
X_test.shape, y_test.shape

((360, 64), (360,))

In [37]:
y_train, y_test

(array([4, 8, 4, ..., 8, 9, 0]),
 array([1, 1, 7, 2, 4, 0, 1, 8, 8, 3, 1, 0, 5, 3, 6, 2, 3, 8, 2, 5, 3, 5,
        0, 0, 6, 8, 3, 2, 3, 8, 0, 1, 3, 2, 8, 0, 1, 7, 1, 3, 9, 2, 1, 4,
        1, 1, 2, 8, 4, 4, 0, 2, 8, 4, 8, 5, 7, 3, 8, 8, 9, 2, 4, 1, 5, 2,
        0, 5, 1, 4, 8, 4, 7, 6, 1, 9, 5, 1, 7, 6, 4, 0, 2, 5, 9, 1, 9, 7,
        8, 7, 6, 4, 1, 5, 3, 4, 8, 7, 2, 6, 2, 9, 4, 1, 6, 4, 0, 5, 7, 8,
        1, 3, 4, 3, 1, 3, 8, 6, 2, 5, 0, 7, 8, 9, 0, 1, 9, 7, 5, 6, 7, 9,
        9, 2, 4, 3, 8, 9, 0, 5, 2, 2, 1, 5, 4, 0, 1, 8, 5, 5, 4, 5, 2, 5,
        1, 7, 5, 5, 7, 4, 9, 3, 5, 4, 6, 9, 0, 3, 4, 1, 6, 0, 6, 3, 2, 8,
        3, 9, 2, 2, 2, 8, 3, 4, 2, 2, 8, 3, 7, 4, 2, 8, 5, 0, 1, 8, 9, 0,
        7, 5, 1, 6, 9, 0, 7, 5, 1, 3, 7, 3, 0, 9, 2, 9, 9, 8, 9, 4, 0, 7,
        8, 3, 5, 3, 4, 6, 6, 5, 0, 9, 6, 0, 6, 9, 4, 1, 5, 5, 0, 4, 2, 2,
        2, 3, 4, 0, 8, 0, 9, 4, 5, 1, 4, 1, 3, 8, 4, 9, 2, 8, 2, 2, 7, 1,
        8, 2, 0, 2, 9, 6, 2, 9, 3, 7, 4, 5, 7, 4, 9, 5, 6, 4, 5, 9, 2, 9,
     

### 5. 다양한 모델을 활용해 학습시키기
***
사용할 모델 list  
    1. DecisionTree  
    2. Random Forest  
    3. svm(support Vector Machine)  
    4. sgd(Stocchastic Gradient De)  
    5. Logistic Regression  

In [38]:
# 의사결정나무(DecisionTree)
decision_tree = DecisionTreeClassifier(random_state=32)
decision_tree.fit(X_train, y_train)

# Random Forest
random_forest = RandomForestClassifier(random_state=5)
random_forest.fit(X_train, y_train)

# SVM
svm_model = svm.SVC()
svm_model.fit(X_train, y_train)

# SGD
sgd_model = SGDClassifier()
sgd_model.fit(X_train, y_train)

# Logistic Regression
logistic_model = LogisticRegression(max_iter=2300) # 2300 미만일 때 컨버전 워닝 발생
logistic_model.fit(X_train, y_train) 

print('done')

done


### 6. 모델 평가
***

#### 6.1 학습된 모델들의 테스트데이터 예측 결과 해석

##### 6.1.1 의사결정나무(DecisionTree)

In [39]:
y_pred = decision_tree.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(decision_tree.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(decision_tree.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.88      0.94        33
           1       0.79      0.93      0.85        40
           2       0.85      0.87      0.86        45
           3       0.89      0.97      0.93        34
           4       0.91      0.73      0.81        41
           5       0.92      0.87      0.89        39
           6       0.82      0.97      0.89        29
           7       0.75      0.86      0.80        28
           8       0.78      0.78      0.78        36
           9       0.90      0.74      0.81        35

    accuracy                           0.86       360
   macro avg       0.86      0.86      0.86       360
weighted avg       0.86      0.86      0.86       360

훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.856


##### 6.1.2 Random Forest

In [40]:
y_pred = random_forest.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(random_forest.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(random_forest.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        33
           1       0.98      1.00      0.99        40
           2       1.00      0.98      0.99        45
           3       1.00      1.00      1.00        34
           4       0.95      0.90      0.92        41
           5       1.00      0.97      0.99        39
           6       1.00      1.00      1.00        29
           7       0.90      0.96      0.93        28
           8       0.97      0.94      0.96        36
           9       0.89      0.97      0.93        35

    accuracy                           0.97       360
   macro avg       0.97      0.97      0.97       360
weighted avg       0.97      0.97      0.97       360

훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.969


##### 6.1.3 Support Vector Machine (SVM)

In [41]:
y_pred = svm_model.predict(X_test)

print(classification_report(y_test, y_pred, labels=np.unique(y_pred))) # labels=np.unique(y_pred)로 warning 문구 제거
print("훈련 세트 정확도: {:.3f}".format(svm_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(svm_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        33
           1       1.00      1.00      1.00        40
           2       1.00      1.00      1.00        45
           3       1.00      1.00      1.00        34
           4       0.98      0.98      0.98        41
           5       1.00      0.97      0.99        39
           6       1.00      1.00      1.00        29
           7       1.00      0.96      0.98        28
           8       1.00      1.00      1.00        36
           9       0.92      1.00      0.96        35

    accuracy                           0.99       360
   macro avg       0.99      0.99      0.99       360
weighted avg       0.99      0.99      0.99       360

훈련 세트 정확도: 0.995
테스트 세트 정확도: 0.989


##### 6.1.4 Stochastic Gradient Descent Classifier (SGDClassifier)

In [42]:
y_pred = sgd_model.predict(X_test)

print(classification_report(y_test, y_pred, labels=np.unique(y_pred))) # labels=np.unique(y_pred)로 warning 문구 제거
print("훈련 세트 정확도: {:.3f}".format(sgd_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(sgd_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.94      0.97        33
           1       0.95      0.97      0.96        40
           2       0.98      1.00      0.99        45
           3       1.00      0.97      0.99        34
           4       0.95      0.95      0.95        41
           5       1.00      0.95      0.97        39
           6       1.00      1.00      1.00        29
           7       0.89      0.89      0.89        28
           8       0.97      0.89      0.93        36
           9       0.85      1.00      0.92        35

    accuracy                           0.96       360
   macro avg       0.96      0.96      0.96       360
weighted avg       0.96      0.96      0.96       360

훈련 세트 정확도: 0.982
테스트 세트 정확도: 0.958


##### 6.1.5 Logistic Regression

In [43]:
y_pred = logistic_model.predict(X_test)

print(classification_report(y_test, y_pred))
print("훈련 세트 정확도: {:.3f}".format(logistic_model.score(X_train,y_train))) # 학습 정확도 표기
print("테스트 세트 정확도: {:.3f}".format(logistic_model.score(X_test,y_test))) # 테스트 정확도 표기

              precision    recall  f1-score   support

           0       1.00      0.97      0.98        33
           1       0.93      0.95      0.94        40
           2       1.00      0.98      0.99        45
           3       0.97      0.94      0.96        34
           4       0.95      0.95      0.95        41
           5       0.95      0.95      0.95        39
           6       0.97      1.00      0.98        29
           7       0.96      0.93      0.95        28
           8       0.94      0.94      0.94        36
           9       0.92      0.97      0.94        35

    accuracy                           0.96       360
   macro avg       0.96      0.96      0.96       360
weighted avg       0.96      0.96      0.96       360

훈련 세트 정확도: 1.000
테스트 세트 정확도: 0.958


#### 6.2 해당 주제에서 모델 성능 평가의 지표로 중요한 것과 그 이유
(sklearn.metrics 평가지표 중 선택)

**accuracy** 값이 가장 중요하다고 생각한다.  
load_digits test data는 10개의 클래스가 크게 편향되지 않고 분포되어 데이터의 불균형이 있다고 볼 수 없고, 모델이 진단하는 내용 또한 예측의 정확도가 중요하다. 따라서 recall이나 f1-score보다 accuracy 값을 성능 평가의 기준으로 생각한다.

다섯가지 모델 중 손글씨 분류 모델로는 accuracy 값이 가장 높은 svm이 적합해보인다.