# 我們與數字辨識的距離
林晉宏 (Jephian Lin)

此投影片由 Jupyter 製作  
原始檔請見下方連結  
https://github.com/jephianlin/outreach/NSYSU-digits/NSYSU-digits.ipynb

- [MNIST 手寫數字資料庫](#MNIST-database)
- [NSYSU-digits 手寫數字資料集](#NSYSU-digits-dataset)
- [專案中遇到的困難、犯過的錯誤](#Difficulties/Mistakes-in-NSYSU-digits-project)
- [Questions](#Questions)

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from joblib import load, dump
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
from skimage.feature import hog

In [None]:
import tensorflow as tf
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Xf_train = X_train.reshape(60000, -1)
Xf_test = X_test.reshape(10000, -1)
Xcnn_train = X_train.reshape(60000, 28, 28, 1)
Xcnn_test = X_test.reshape(10000, 28, 28, 1)
yone_train = tf.keras.utils.to_categorical(y_train)
yone_test = tf.keras.utils.to_categorical(y_test)

## MNIST database
MNIST 手寫數字資料庫

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

![Webpage of MNIST dataset](MNIST-webpage.png "Webpage of MNIST dataset")

http://yann.lecun.com/exdb/mnist/

#### 資料集內容
- 訓練集：60,000 張圖片
- 測試集：10,000 張圖片
- 以 `idx` 格式儲存

![MNIST examples](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png "MNIST examples")

(Source: [Wikipedia of MNIST database](https://en.wikipedia.org/wiki/MNIST_database)  
author: Josef Steppan)

#### 資料來源
- 取自 [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology) 中的兩個資料集
- Special Database 3：公務員寫的
- Special Database 1：中學生寫的
- MNIST training = 30,000 SD3 + 30,000 SD1
- MNIST testing = 1,000 SD3 + 1,000 SD1

原先 NIST 把 SD3 當訓練集  
把 SD1 當測試集  
由於兩邊手寫作者身份差太多  
MNIST 將兩資料集重新混合

Census Bureau 的公務員

#### 資料前處理
- 每張圖大小為 28x28
- 數字部份包在 20x20 的方框中
- 白 0 ~ 255 黑
- 依顏料重心置中

In [None]:
### check shape, bounding box, and mass center
i = 0
arr = X_train[i]
print("image shape is", arr.shape)
ink_x,ink_y = np.where(arr > 0)
print("vertical ink range is", ink_x.min(), "~", ink_x.max())
print("horizontal ink range is", ink_y.min(), "~", ink_y.max())
row_sum = np.sum(arr, axis=1)
print("vertical mass center at", (row_sum * np.arange(28)).sum() / row_sum.sum()) # ~ 13.5
col_sum = np.sum(arr, axis=0)
print("horizontal mass center at", (col_sum * np.arange(28)).sum() / col_sum.sum()) # ~ 13.5

![Records for models](MNIST-models.png "Records for models")

http://yann.lecun.com/exdb/mnist/

#### 沒辦法中的辦法
- 亂猜 ~10%
- 看墨水用量 ~ 22%

![Distribution of ink densities](ink-density.png "Distribution of ink densities")

![Confusion matrix of ink estimator](ink-confusion-matrix.png "Confusion matrix of ink estimator")

In [None]:
### random guess
guess = np.random.randint(0, 10, (10000,))
accuracy_score(y_test, guess)

In [None]:
### ink density histogram
fig = plt.figure(figsize=(12,4))
for i in range(10):
    mask = (y_train == i)
    wanted = X_train.reshape(60000, -1)[mask].mean(axis=1)
    plt.hist(wanted, bins=100, label='%s'%i)
plt.legend()

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = X_test.reshape(10000,-1).mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(y_test, guess)

In [None]:
### confusion matrix
### need to run the previous cell first
fig = plt.figure(figsize=(8,8))
ax = plt.gca()
plot_confusion_matrix(model, inks, y_test, normalize='true', values_format='.2f', ax=ax)

#### 將圖片拉平

```python
a = np.array([[1,2],
              [2,3]])
a.reshape(-1)
```
會得到  
```
[1,2,2,3]
```
每個圖片都可以看成是 28x28 = 784 維的向量

#### 距離
若 $x = (x_1, \ldots, x_n)$,  
$y = (y_1, \ldots, y_n)$  

則兩點之間距離為  
$\|x-y\| = \sum_{i=1}^n (x_i - y_i)^2$

#### k-nearest neighbors
??% ~ 99.37%

![k-nearest neighbors](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/220px-KnnClassification.svg.png "k-nearest neighbors")

(Source: [Wikipedia of k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)  
author: Antti Ajanki)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Linear classifier
88% ~ 92.4%

![Linear classifier](https://raw.githubusercontent.com/jephianlin/ModularPython/master/linear_classifier.png "Linear classifier")

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Support vector machine
??% ~ 99.44%

![Kernel function](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/500px-Kernel_Machine.svg.png "Kernel function")

(Source: [Wikipedia of Support vector machine](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)  
author: Alisneaky)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Neural network
92% ~ 99.17%

- 每層神經網路 = 一個矩陣 $W$、一個向量 $b$、一個非線性函數 $\sigma$
- 輸入 $x$ 和輸出 $y$ 的關係  
$y = \sigma(xW +b)$

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xf_train, yone_train, epochs=10, batch_size=100, validation_data=(Xf_test, yone_test))

#### 距離？
算一下 $x$, $y$, $z$ 之間的距離：
```python
x = [0,1,0,0,1,0,0,1,0]
y = [0,0,1,0,0,1,0,0,1]
z = [1,1,1,1,0,1,1,1,1]
```

![Distance between pictures](xyz-distance.png "Distance between pictures")

In [None]:
### draw xyz-distance.png
x = np.array([0,1,0,0,1,0,0,1,0]).reshape(3,3)
y = np.array([0,0,1,0,0,1,0,0,1]).reshape(3,3)
z = np.array([1,1,1,1,0,1,1,1,1]).reshape(3,3)


fig = plt.figure(figsize=(5,5))
back = fig.add_axes([0,0,1,1])
ax1 = fig.add_axes([0.1,0.2,0.2,0.2])
ax1.imshow(x, cmap='Greys')
ax2 = fig.add_axes([0.7,0.2,0.2,0.2])
ax2.imshow(y, cmap='Greys')
ax3 = fig.add_axes([0.4,0.7,0.2,0.2])
ax3.imshow(z, cmap='Greys')
back.text(0.3, 0.5, '$\sqrt{7}$', size='xx-large', usetex=True, ha='center')
back.text(0.7, 0.5, '$\sqrt{5}$', size='xx-large', usetex=True, ha='center')
back.text(0.5, 0.3, '$\sqrt{6}$', size='xx-large', usetex=True, ha='center')

for ax in [back, ax1, ax2, ax3]:
    ax.axis('off')

#### Convolution
![Convolution](convolution.png "Convolution")

In [None]:
### draw convolution.png
arr = np.abs(X_train[1][:,2:] - X_train[1][:,:-2])
fea,vis = hog(X_train[1], pixels_per_cell=(4,4), cells_per_block=(2,2), visualize=True)

fig = plt.figure(figsize=(9,3))
axs = fig.subplots(1,3)
axs[0].imshow(X_train[1], cmap='Greys')
axs[0].set_title('original')
axs[1].imshow(arr, cmap='Greys')
axs[1].set_title('[-1,0,1] filter')
axs[2].imshow(vis, cmap='Greys')
axs[2].set_title('HOG feature')

#### Convolution neural network
97% ~ 99.77%

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xcnn_train, yone_train, epochs=1, batch_size=100, validation_data=(Xcnn_test, yone_test))

## NSYSU-digits dataset
NSYSU-digits 手寫數字資料集

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

In [None]:
### load nsysu

import urllib
import numpy as np

base = r"https://github.com/SageLabTW/auto-grading/raw/master/nsysu-digits/"
urllib.request.urlretrieve(base + "X.csv", "nsysu-digits-X.csv")
urllib.request.urlretrieve(base + "y.csv", "nsysu-digits-y.csv")

Xsys = np.genfromtxt('nsysu-digits-X.csv', dtype=int, delimiter=',') ### flattened already
ysys = np.genfromtxt('nsysu-digits-y.csv', dtype=int, delimiter=',')

In [None]:
num = Xsys.shape[0] ### 552
Xsyscnn = Xsys.reshape(num, 28, 28, 1)
ysysone = tf.keras.utils.to_categorical(ysys)

#### 資料集內容
- 共 552 張圖片
- 未區分訓練集及測試集
- 以 `png` 格式儲存

![NSYSU digits examples](nsysu-digits-examples.png "NSYSU digits examples")

#### 資料來源
- 來自學生小考考卷
- 課前詢問學生是否同意以匿名方式貢獻資料
- 由研究助理加標籤
- 公開在 GitHub: [SageLabTW/auto-grading.git](https://github.com/SageLabTW/auto-grading)

#### 資料前處理
- 每張圖大小為 28x28
- 數字大小不一致
- 白 0 ~ 255 黑（顏色偏淺）
- 未置中

#### MNIST 訓練、NSYSU-digits 測試
- random: ~10%
- ink: 6%
- kNN:
- linear: 21%
- SVM: 
- NN: 28%
- CNN: 44% (one epoch) ~ 58%


![Les Misérables](https://upload.wikimedia.org/wikipedia/en/6/67/LesMisLogo.png "Les Misérables")

(Source: [Wikipedia of Les Misérables (musical)](https://en.wikipedia.org/wiki/Les_Mis%C3%A9rables_(musical)  
[Details of copyright](https://en.wikipedia.org/wiki/File:LesMisLogo.png))

 悲慘世界

In [None]:
### random guess
guess = np.random.randint(0, 10, (num,))
accuracy_score(ysys, guess)

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = Xsys.mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(ysys, guess)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xf_train, yone_train, epochs=10, batch_size=100, validation_data=(Xsys, ysysone))

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xcnn_train, yone_train, epochs=1, batch_size=100, validation_data=(Xsyscnn, ysysone))

#### NSYSU-digits 3/4 訓練、1/4 測試
- random: ~10%
- ink: 19%
- kNN: 45%
- linear: **34%**
- SVM: 52%
- NN: **25%**
- CNN: **32%** (one epoch) ~ 75%


In [None]:
### split the data
Xsys_train, Xsys_test, ysys_train, ysys_test = train_test_split(Xsys, ysys)
num_train,num_test =  Xsys_train.shape[0], Xsys_test.shape[0]
Xsyscnn_train = Xsys_train.reshape(num_train, 28, 28, 1)
Xsyscnn_test = Xsys_test.reshape(num_test, 28, 28, 1)
ysysone_train = tf.keras.utils.to_categorical(ysys_train)
ysysone_test = tf.keras.utils.to_categorical(ysys_test)
print(num_train, num_test)

In [None]:
### random guess
guess = np.random.randint(0, 10, (num_test,))
accuracy_score(ysys_test, guess)

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (ysys_train == i)
    centers[i,0] = Xsys_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = Xsys_test.mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(ysys_test, guess)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xsys_train, ysysone_train, epochs=10, batch_size=100, validation_data=(Xsys_test, ysysone_test))

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xsyscnn_train, ysysone_train, epochs=20, batch_size=100, validation_data=(Xsyscnn_test, ysysone_test))

## Difficulties/Mistakes in NSYSU-digits project
專案中遇到的困難、犯過的錯誤

機械上的誤差

人為的誤差

尺寸錯誤

程式碼有錯

不知道怎麼克服的障礙 清洗

無法克服的障礙 ID

#### 總結
- 機器在學習、人類也在學習，任何專案都應該是不斷修正的過程。
- 資料是否前處理過會大幅影響訓練的成過。
- 所學應該用在生活上；如果有什麼想法，就應該去試試看。
- 一次完成一件事，才有可能累積可大的成果。

## Questions
1. 用 MNIST 訓練、NSYSU-digits 測試合理嗎？
2. 如果用 NSYSU-digits 訓練加測試，我可以期待下一批學生寫出來的字有同樣的準確率嗎？
3. 如果是你，你會如何將 NSYSU-digits 資料做前處理？