# [Project 1] load_digits

#### * 목표
손글씨 이미지를 열 가지 카테고리(0~9)로 분류

### 1. 모듈 import

In [1]:
from sklearn.datasets import load_digits 
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.linear_model import SGDClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix # 오차행렬

print("Import Success")

Import Success


### 2. 데이터 준비

In [2]:
digits = load_digits()

print(digits.keys()) # 7개 key

dict_keys(['data', 'target', 'frame', 'feature_names', 'target_names', 'images', 'DESCR'])


### 3. 데이터 이해

#### 1) Feature Data 지정

In [3]:
digits_feature = digits.data
print(digits_feature.shape) #1797개 데이터 x 64개 feature

(1797, 64)


#### 2) Label Data 지정

In [4]:
digits_label = digits.target
print(digits_label.shape) # 1797개 데이터

(1797,)


In [5]:
# DataFrame으로 Feature, Label Data 확인
digits_df = pd.DataFrame(data = digits_feature, columns = digits.feature_names)
digits_df["label"] = digits.target
digits_df

Unnamed: 0,pixel_0_0,pixel_0_1,pixel_0_2,pixel_0_3,pixel_0_4,pixel_0_5,pixel_0_6,pixel_0_7,pixel_1_0,pixel_1_1,...,pixel_6_7,pixel_7_0,pixel_7_1,pixel_7_2,pixel_7_3,pixel_7_4,pixel_7_5,pixel_7_6,pixel_7_7,label
0,0.0,0.0,5.0,13.0,9.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,13.0,10.0,0.0,0.0,0.0,0
1,0.0,0.0,0.0,12.0,13.0,5.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,11.0,16.0,10.0,0.0,0.0,1
2,0.0,0.0,0.0,4.0,15.0,12.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,3.0,11.0,16.0,9.0,0.0,2
3,0.0,0.0,7.0,15.0,13.0,1.0,0.0,0.0,0.0,8.0,...,0.0,0.0,0.0,7.0,13.0,13.0,9.0,0.0,0.0,3
4,0.0,0.0,0.0,1.0,11.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,2.0,16.0,4.0,0.0,0.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1792,0.0,0.0,4.0,10.0,13.0,6.0,0.0,0.0,0.0,1.0,...,0.0,0.0,0.0,2.0,14.0,15.0,9.0,0.0,0.0,9
1793,0.0,0.0,6.0,16.0,13.0,11.0,1.0,0.0,0.0,0.0,...,0.0,0.0,0.0,6.0,16.0,14.0,6.0,0.0,0.0,0
1794,0.0,0.0,1.0,11.0,15.0,1.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,2.0,9.0,13.0,6.0,0.0,0.0,8
1795,0.0,0.0,2.0,10.0,7.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,5.0,12.0,16.0,12.0,0.0,0.0,9


#### 3) Target Names 출력

In [6]:
print(digits.target_names) # 0 ~ 9 숫자

[0 1 2 3 4 5 6 7 8 9]


#### 4) 데이터 Describe

In [7]:
print(digits.DESCR) # 데이터셋 설명

.. _digits_dataset:

Optical recognition of handwritten digits dataset
--------------------------------------------------

**Data Set Characteristics:**

    :Number of Instances: 5620
    :Number of Attributes: 64
    :Attribute Information: 8x8 image of integer pixels in the range 0..16.
    :Missing Attribute Values: None
    :Creator: E. Alpaydin (alpaydin '@' boun.edu.tr)
    :Date: July; 1998

This is a copy of the test set of the UCI ML hand-written digits datasets
https://archive.ics.uci.edu/ml/datasets/Optical+Recognition+of+Handwritten+Digits

The data set contains images of hand-written digits: 10 classes where
each class refers to a digit.

Preprocessing programs made available by NIST were used to extract
normalized bitmaps of handwritten digits from a preprinted form. From a
total of 43 people, 30 contributed to the training set and different 13
to the test set. 32x32 bitmaps are divided into nonoverlapping blocks of
4x4 and the number of on pixels are counted in each blo

## 4. train, test 데이터 분리

In [8]:
x_train, x_test, y_train, y_test = train_test_split(digits_feature, 
                                                    digits_label, 
                                                    test_size = 0.2, 
                                                    random_state = 7)

print('x_train 개수: ', len(x_train), ', x_test 개수: ', len(x_test))
print(x_train.shape, y_train.shape)
print(x_test.shape, y_test.shape)

x_train 개수:  1437 , x_test 개수:  360
(1437, 64) (1437,)
(360, 64) (360,)


## 5. 다양한 모델 학습

#### 1) Decision Tree

In [9]:
decision_tree = DecisionTreeClassifier(random_state = 32) # 변수 모델 저장, random_state - 난수 설정
decision_tree.fit(x_train, y_train) # 모델 학습
y_pred = decision_tree.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[42  0  0  1  0  0  0  0  0  0]
 [ 0 34  3  1  0  1  1  0  0  2]
 [ 0  0 33  2  0  0  1  1  2  1]
 [ 0  1  0 31  0  0  0  0  1  1]
 [ 0  0  1  0 35  0  0  0  1  0]
 [ 0  1  0  0  0 27  0  0  0  0]
 [ 0  0  0  0  2  0 26  0  0  0]
 [ 0  0  0  1  2  1  0 27  0  2]
 [ 0  5  4  1  1  0  3  0 28  1]
 [ 0  1  1  2  2  1  0  0  0 25]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       0.81      0.81      0.81        42
           2       0.79      0.82      0.80        40
           3       0.79      0.91      0.85        34
           4       0.83      0.95      0.89        37
           5       0.90      0.96      0.93        28
           6       0.84      0.93      0.88        28
           7       0.96      0.82      0.89        33
           8       0.88      0.65      0.75        43
           9       0.78      0.78      0.78        32

    accuracy                           0.86       360
   macro avg       

#### 2) Random Forest

In [10]:
random_forest = RandomForestClassifier(random_state = 32) # 변수 모델 저장, random_state - 난수 설정
random_forest.fit(x_train, y_train) # 모델 학습
y_pred = random_forest.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[42  0  0  0  1  0  0  0  0  0]
 [ 0 42  0  0  0  0  0  0  0  0]
 [ 0  0 40  0  0  0  0  0  0  0]
 [ 0  0  0 34  0  0  0  0  0  0]
 [ 0  0  0  0 37  0  0  0  0  0]
 [ 0  0  0  0  0 27  0  0  0  1]
 [ 0  0  0  0  1  0 27  0  0  0]
 [ 0  0  0  0  0  0  0 32  0  1]
 [ 0  3  0  0  1  1  0  2 36  0]
 [ 0  0  0  0  0  2  0  0  0 30]]
              precision    recall  f1-score   support

           0       1.00      0.98      0.99        43
           1       0.93      1.00      0.97        42
           2       1.00      1.00      1.00        40
           3       1.00      1.00      1.00        34
           4       0.93      1.00      0.96        37
           5       0.90      0.96      0.93        28
           6       1.00      0.96      0.98        28
           7       0.94      0.97      0.96        33
           8       1.00      0.84      0.91        43
           9       0.94      0.94      0.94        32

    accuracy                           0.96       360
   macro avg       

#### 3) SVM

In [11]:
_svm = svm.SVC() # 변수 모델 저장
_svm.fit(x_train, y_train) # 모델 학습
y_pred = _svm.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[43  0  0  0  0  0  0  0  0  0]
 [ 0 42  0  0  0  0  0  0  0  0]
 [ 0  0 40  0  0  0  0  0  0  0]
 [ 0  0  0 34  0  0  0  0  0  0]
 [ 0  0  0  0 37  0  0  0  0  0]
 [ 0  0  0  0  0 28  0  0  0  0]
 [ 0  0  0  0  0  0 28  0  0  0]
 [ 0  0  0  0  0  0  0 33  0  0]
 [ 0  2  0  0  0  1  0  0 40  0]
 [ 0  0  0  0  0  1  0  0  0 31]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.95      1.00      0.98        42
           2       1.00      1.00      1.00        40
           3       1.00      1.00      1.00        34
           4       1.00      1.00      1.00        37
           5       0.93      1.00      0.97        28
           6       1.00      1.00      1.00        28
           7       1.00      1.00      1.00        33
           8       1.00      0.93      0.96        43
           9       1.00      0.97      0.98        32

    accuracy                           0.99       360
   macro avg       

#### 4) SGD Classifier

In [12]:
sgd = SGDClassifier() # 변수 모델 저장
sgd.fit(x_train, y_train) # 모델 학습
y_pred = sgd.predict(x_test)

print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[41  0  0  0  1  0  0  1  0  0]
 [ 0 31  1  1  0  0  0  0  7  2]
 [ 0  0 40  0  0  0  0  0  0  0]
 [ 0  0  0 30  0  1  0  1  2  0]
 [ 0  0  0  0 36  0  1  0  0  0]
 [ 0  0  0  0  0 28  0  0  0  0]
 [ 0  1  0  0  0  0 26  0  1  0]
 [ 0  0  0  0  1  0  0 32  0  0]
 [ 0  2  1  0  1  0  0  1 38  0]
 [ 0  0  0  0  0  1  0  1  1 29]]
              precision    recall  f1-score   support

           0       1.00      0.95      0.98        43
           1       0.91      0.74      0.82        42
           2       0.95      1.00      0.98        40
           3       0.97      0.88      0.92        34
           4       0.92      0.97      0.95        37
           5       0.93      1.00      0.97        28
           6       0.96      0.93      0.95        28
           7       0.89      0.97      0.93        33
           8       0.78      0.88      0.83        43
           9       0.94      0.91      0.92        32

    accuracy                           0.92       360
   macro avg       

#### 5) Logistic Regression

In [13]:
logistic_regression = LogisticRegression(max_iter = 3000) # 변수 모델 저장, max_iter - 학습 반복 횟수 지정
logistic_regression.fit(x_train, y_train) # 모델 학습
y_pred = logistic_regression.predict(x_test)


print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[43  0  0  0  0  0  0  0  0  0]
 [ 0 40  0  0  0  0  0  0  1  1]
 [ 0  0 40  0  0  0  0  0  0  0]
 [ 0  0  0 33  0  0  0  1  0  0]
 [ 0  0  0  0 37  0  0  0  0  0]
 [ 0  0  0  0  0 27  0  0  1  0]
 [ 0  0  0  0  0  0 27  0  1  0]
 [ 0  0  0  1  0  0  0 32  0  0]
 [ 0  2  1  0  0  4  0  1 35  0]
 [ 0  0  0  1  0  3  0  0  0 28]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        43
           1       0.95      0.95      0.95        42
           2       0.98      1.00      0.99        40
           3       0.94      0.97      0.96        34
           4       1.00      1.00      1.00        37
           5       0.79      0.96      0.87        28
           6       1.00      0.96      0.98        28
           7       0.94      0.97      0.96        33
           8       0.92      0.81      0.86        43
           9       0.97      0.88      0.92        32

    accuracy                           0.95       360
   macro avg       

## 6. 모델 평가

#### * 예측 결과 (classification_report)

* confusion matrix을 통해 성능을 비교한 결과,     
5가지 모델 중, SVM 모델이 99%의 정확도로 가장 높은 분류 성능을 보였다.   

* SVM 모델은 데이터를 더 높은 차원으로 대응시켜 분리를 쉽게 하는 방법을 사용한다고 한다.      
이처럼 많은 클래스에 효과적이라고 생각된다.

#### * 성능 평가 지표 (Confusion Matrix)

* 데이터 라벨이 균형적임을 알 수 있고, 각 숫자 판단의 정확성이 중요하므로, Accuracy으로 평가하였다. 