# 我們與數字辨識的距離
林晉宏

此投影片由 Jupyter 製作  
原始檔請見下方連結  
https://github.com/jephianlin/outreach/NSYSU-digits/NSYSU-digits.ipynb

- [MNIST 手寫數字資料庫](#MNIST-database)
- [NSYSU-digits 手寫數字資料集](#NSYSU-digits-dataset)

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()

## MNIST database
MNIST 手寫數字資料庫

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

![Webpage of MNIST dataset](MNIST-webpage.png "Webpage of MNIST dataset")

http://yann.lecun.com/exdb/mnist/

#### 資料集內容
- 訓練集：60,000 張圖片
- 測試集：10,000 張圖片
- 以 `idx` 格式儲存

![MNIST examples](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png "MNIST examples")

(Source: [Wikipedia of MNIST database](https://en.wikipedia.org/wiki/MNIST_database)  
author: Josef Steppan)

#### 資料來源
- 取自 [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology) 中的兩個資料集
- Special Database 3：公務員寫的
- Special Database 1：中學生寫的
- MNIST training = 30,000 SD3 + 30,000 SD1
- MNIST testing = 1,000 SD3 + 1,000 SD1

原先 NIST 把 SD3 當訓練集  
把 SD1 當測試集  
由於兩邊手寫作者身份差太多  
MNIST 將兩資料集重新混合

Census Bureau 的公務員

#### 資料前處理
- 每張圖大小為 28x28
- 數字部份包在 20x20 的方框中
- 白 0 ~ 255 黑
- 依顏料重心置中

In [None]:
### check shape, bounding box, and mass center
i = 0
arr = X_train[i]
print("image shape is", arr.shape)
ink_x,ink_y = np.where(arr > 0)
print("vertical ink range is", ink_x.min(), "~", ink_x.max())
print("horizontal ink range is", ink_y.min(), "~", ink_y.max())
row_sum = np.sum(arr, axis=1)
print("vertical mass center at", (row_sum * np.arange(28)).sum() / row_sum.sum()) # ~ 13.5
col_sum = np.sum(arr, axis=0)
print("horizontal mass center at", (col_sum * np.arange(28)).sum() / col_sum.sum()) # ~ 13.5

![Records for models](MNIST-models.png "Records for models")

http://yann.lecun.com/exdb/mnist/

#### 沒辦法中的辦法
- 亂猜 ~10%
- 看墨水用量 ~ 22%

In [None]:
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix

In [None]:
### random guess
guess = np.random.randint(0, 10, (10000,))
accuracy_score(guess, y_test)

In [None]:
### ink density histogram
fig = plt.figure(figsize=(12,4))
for i in range(10):
    mask = (y_train == i)
    wanted = X_train.reshape(60000, -1)[mask].mean(axis=1)
    plt.hist(wanted, bins=100, label='%s'%i)
plt.legend()

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = X_test.reshape(10000,-1).mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(guess, y_test)

In [None]:
### confusion matrix
### need to run the previous cell first
fig = plt.figure(figsize=(8,8))
ax = plt.gca()
plot_confusion_matrix(model, inks, y_test, normalize='true', values_format='.2f', ax=ax)

![](ink-density.png)

![](ink-confusion-matrix.png)

## NSYSU-digits dataset
NSYSU-digits 手寫數字資料集

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)