# 我們與數字辨識的距離
林晉宏 (Jephian Lin)

此投影片由 Jupyter 製作  
原始檔請見下方連結  
https://github.com/jephianlin/outreach/blob/master/NSYSU-digits/NSYSU-digits.ipynb

- [MNIST 手寫數字資料庫](#MNIST-database)
- [NSYSU-digits 手寫數字資料集](#NSYSU-digits-dataset)
- [資料前處理](#Data-processing)
- [專案中遇到的困難、犯過的錯誤](#Difficulties/Mistakes-in-the-NSYSU-digits-project)
- [總結](#Summary)

In [None]:
import os
import numpy as np
import matplotlib.pyplot as plt
from joblib import load, dump
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, confusion_matrix, plot_confusion_matrix
from skimage.feature import hog

In [None]:
### load MNIST

import tensorflow as tf
from tensorflow.keras.datasets import mnist
(X_train, y_train), (X_test, y_test) = mnist.load_data()
Xf_train = X_train.reshape(60000, -1)
Xf_test = X_test.reshape(10000, -1)
Xcnn_train = X_train.reshape(60000, 28, 28, 1)
Xcnn_test = X_test.reshape(10000, 28, 28, 1)
yone_train = tf.keras.utils.to_categorical(y_train)
yone_test = tf.keras.utils.to_categorical(y_test)

In [None]:
### load nsysu

import urllib
import numpy as np

base = r"https://github.com/SageLabTW/auto-grading/raw/master/nsysu-digits/"
for c in ['X', 'y']:
    filename = "nsysu-digits-%s.csv"%c
    if filename not in os.listdir('.'):
        print(filename, 'not found --- will download')
        urllib.request.urlretrieve(base + c + ".csv", filename)

Xsys = np.genfromtxt('nsysu-digits-X.csv', dtype=int, delimiter=',') ### flattened already
ysys = np.genfromtxt('nsysu-digits-y.csv', dtype=int, delimiter=',')

In [None]:
### adjust data format

num = Xsys.shape[0] ### 552
Xsyscnn = Xsys.reshape(num, 28, 28, 1)
ysysone = tf.keras.utils.to_categorical(ysys)

## MNIST database
MNIST 手寫數字資料庫

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

![Webpage of MNIST dataset](MNIST-webpage.png "Webpage of MNIST dataset")

http://yann.lecun.com/exdb/mnist/

#### 資料集內容
- 訓練集：60,000 張圖片
- 測試集：10,000 張圖片
- 以 `idx` 格式儲存

![MNIST examples](https://upload.wikimedia.org/wikipedia/commons/2/27/MnistExamples.png "MNIST examples")

(Source: [Wikipedia of MNIST database](https://en.wikipedia.org/wiki/MNIST_database)  
author: Josef Steppan)

#### 資料來源
- 取自 [NIST](https://en.wikipedia.org/wiki/National_Institute_of_Standards_and_Technology) 中的兩個資料集
- Special Database 3：公務員寫的
- Special Database 1：中學生寫的
- MNIST training = 30,000 SD3 + 30,000 SD1
- MNIST testing = 1,000 SD3 + 1,000 SD1

原先 NIST 把 SD3 當訓練集  
把 SD1 當測試集  
由於兩邊手寫作者身份差太多  
MNIST 將兩資料集重新混合

Census Bureau 的公務員

#### 資料前處理
- 每張圖大小為 28x28
- 數字部份包在 20x20 的方框中
- 白 0 ~ 255 黑
- 依顏料重心置中

In [None]:
### check shape, bounding box, and mass center
i = 0
arr = X_train[i]
print("image shape is", arr.shape)
ink_x,ink_y = np.where(arr > 0)
print("vertical ink range is", ink_x.min(), "~", ink_x.max())
print("horizontal ink range is", ink_y.min(), "~", ink_y.max())
row_sum = np.sum(arr, axis=1)
print("vertical mass center at", (row_sum * np.arange(28)).sum() / row_sum.sum()) # ~ 13.5
col_sum = np.sum(arr, axis=0)
print("horizontal mass center at", (col_sum * np.arange(28)).sum() / col_sum.sum()) # ~ 13.5

![Records for models](MNIST-models.png "Records for models")

http://yann.lecun.com/exdb/mnist/

#### 沒辦法中的辦法
- 亂猜 ~10%
- 看墨水用量 ~ 22%

各數字的墨水用量分佈  
![Distribution of ink densities](ink-density.png "Distribution of ink densities")

用墨水猜測的答對率（confusion matrix）

![Confusion matrix of ink estimator](ink-confusion-matrix.png "Confusion matrix of ink estimator")

In [None]:
### random guess
guess = np.random.randint(0, 10, (10000,))
accuracy_score(y_test, guess)

In [None]:
### ink density histogram
fig = plt.figure(figsize=(12,4))
for i in range(10):
    mask = (y_train == i)
    wanted = X_train.reshape(60000, -1)[mask].mean(axis=1)
    plt.hist(wanted, bins=100, label='%s'%i)
plt.legend()

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = X_test.reshape(10000,-1).mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(y_test, guess)

In [None]:
### confusion matrix
### need to run the previous cell first
fig = plt.figure(figsize=(8,8))
ax = plt.gca()
plot_confusion_matrix(model, inks, y_test, normalize='true', values_format='.2f', ax=ax)

#### 將圖片拉平

```python
a = np.array([[1,2],
              [2,3]])
a.reshape(-1)
```
會得到  
```
[1,2,2,3]
```
每個圖片都可以看成是 28x28 = 784 維的向量

#### 距離
若 $x = (x_1, \ldots, x_n)$,  
$y = (y_1, \ldots, y_n)$  

則兩點之間距離為  
$\|x-y\| = \sqrt{\sum_{i=1}^n (x_i - y_i)^2}$

#### k-nearest neighbors
96.8% ~ 99.37%  
(10 minutes on i7-8700 12 cores)

![k-nearest neighbors](https://upload.wikimedia.org/wikipedia/commons/thumb/e/e7/KnnClassification.svg/220px-KnnClassification.svg.png "k-nearest neighbors")

(Source: [Wikipedia of k-nearest neighbors algorithm](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)  
author: Antti Ajanki)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Linear classifier
88% ~ 92.4%

![Linear classifier](https://raw.githubusercontent.com/jephianlin/ModularPython/master/linear_classifier.png "Linear classifier")

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Support vector machine
97.9% ~ 99.44%
(7 minutes on i7-8700 12 cores)

![Kernel function](https://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Kernel_Machine.svg/500px-Kernel_Machine.svg.png "Kernel function")

(Source: [Wikipedia of Support vector machine](https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm)  
author: Alisneaky)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xf_train, y_train)
guess = model.predict(Xf_test)
accuracy_score(y_test, guess)

#### Neural network
92% ~ 99.17%

- 每層神經網路 = 一個矩陣 $W$、一個向量 $b$、一個非線性函數 $\sigma$
- 輸入 $x$ 和輸出 $y$ 的關係  
$y = \sigma(xW +b)$

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xf_train, yone_train, epochs=10, batch_size=100, validation_data=(Xf_test, yone_test))

#### 距離？
算一下 $x$, $y$, $z$ 之間的距離：
```python
x = [0,1,0,0,1,0,0,1,0]
y = [0,0,1,0,0,1,0,0,1]
z = [1,1,1,1,0,1,1,1,1]
```

![Distance between pictures](xyz-distance.png "Distance between pictures")

In [None]:
### draw xyz-distance.png
x = np.array([0,1,0,0,1,0,0,1,0]).reshape(3,3)
y = np.array([0,0,1,0,0,1,0,0,1]).reshape(3,3)
z = np.array([1,1,1,1,0,1,1,1,1]).reshape(3,3)


fig = plt.figure(figsize=(5,5))
back = fig.add_axes([0,0,1,1])
ax1 = fig.add_axes([0.1,0.2,0.2,0.2])
ax1.imshow(x, cmap='Greys')
ax2 = fig.add_axes([0.7,0.2,0.2,0.2])
ax2.imshow(y, cmap='Greys')
ax3 = fig.add_axes([0.4,0.7,0.2,0.2])
ax3.imshow(z, cmap='Greys')
back.text(0.3, 0.5, '$\sqrt{7}$', size='xx-large', usetex=True, ha='center')
back.text(0.7, 0.5, '$\sqrt{5}$', size='xx-large', usetex=True, ha='center')
back.text(0.5, 0.3, '$\sqrt{6}$', size='xx-large', usetex=True, ha='center')

back.set_axis_off()
for ax in [ax1, ax2, ax3]:
    ax.set_xticks([])
    ax.set_yticks([])

#### Convolution
![Convolution](convolution.png "Convolution")

In [None]:
### draw convolution.png
arr = np.abs(X_train[1][:,2:] - X_train[1][:,:-2])
fea,vis = hog(X_train[1], pixels_per_cell=(4,4), cells_per_block=(2,2), visualize=True)

fig = plt.figure(figsize=(9,3))
axs = fig.subplots(1,3)
axs[0].imshow(X_train[1], cmap='Greys')
axs[0].set_title('original')
axs[1].imshow(arr, cmap='Greys')
axs[1].set_title('[-1,0,1] filter')
axs[2].imshow(vis, cmap='Greys')
axs[2].set_title('HOG feature')

#### Convolution neural network
97% ~ 99.77%

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xcnn_train, yone_train, epochs=1, batch_size=100, validation_data=(Xcnn_test, yone_test))

## NSYSU-digits dataset
NSYSU-digits 手寫數字資料集

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

#### 資料集內容
- 共 552 張圖片
- 未區分訓練集及測試集
- 以 `png` 格式儲存

![NSYSU digits examples](nsysu-digits-examples.png "NSYSU digits examples")

#### 資料來源
- 來自學生小考考卷
- 課前詢問學生是否同意以匿名方式貢獻資料
- 由研究助理加標籤
- 公開在 GitHub: [SageLabTW/auto-grading.git](https://github.com/SageLabTW/auto-grading) [ [LICENSE](https://github.com/SageLabTW/auto-grading/blob/master/nsysu-digits/LICENSE) ]

#### 資料前處理
- 每張圖大小為 28x28
- 數字大小不一致
- 白 0 ~ 255 黑（顏色偏淺）
- 未置中

#### MNIST 訓練、NSYSU-digits 測試
- random: ~10%
- ink: 6%
- kNN: 13% (1 minutes)
- linear: 21%
- SVM: 27% (5 minutes)
- NN: 28%
- CNN: 44% (one epoch) ~ 58%


![Les Misérables](https://upload.wikimedia.org/wikipedia/en/6/67/LesMisLogo.png "Les Misérables")

(Source: [Wikipedia of Les Misérables (musical)](https://en.wikipedia.org/wiki/Les_Mis%C3%A9rables_(musical)  
[Details of copyright](https://en.wikipedia.org/wiki/File:LesMisLogo.png))

 悲慘世界

In [None]:
### random guess
guess = np.random.randint(0, 10, (num,))
accuracy_score(ysys, guess)

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = Xsys.mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(ysys, guess)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xf_train, yone_train, epochs=10, batch_size=100, validation_data=(Xsys, ysysone))

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xcnn_train, yone_train, epochs=1, batch_size=100, validation_data=(Xsyscnn, ysysone))

#### NSYSU-digits 3/4 訓練、1/4 測試
- random: ~10%
- ink: 19%
- kNN: 45%
- linear: **34%**
- SVM: 52%
- NN: **25%**
- CNN: **32%** (one epoch) ~ 75%

In [None]:
### split the data
Xsys_train, Xsys_test, ysys_train, ysys_test = train_test_split(Xsys, ysys)
num_train,num_test =  Xsys_train.shape[0], Xsys_test.shape[0]
Xsyscnn_train = Xsys_train.reshape(num_train, 28, 28, 1)
Xsyscnn_test = Xsys_test.reshape(num_test, 28, 28, 1)
ysysone_train = tf.keras.utils.to_categorical(ysys_train)
ysysone_test = tf.keras.utils.to_categorical(ysys_test)
print(num_train, num_test)

In [None]:
### random guess
guess = np.random.randint(0, 10, (num_test,))
accuracy_score(ysys_test, guess)

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (ysys_train == i)
    centers[i,0] = Xsys_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = Xsys_test.mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(ysys_test, guess)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xsys_train, ysys_train)
guess = model.predict(Xsys_test)
accuracy_score(ysys_test, guess)

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xsys_train, ysysone_train, epochs=10, batch_size=100, validation_data=(Xsys_test, ysysone_test))

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xsyscnn_train, ysysone_train, epochs=20, batch_size=100, validation_data=(Xsyscnn_test, ysysone_test))

#### 整體比較（準確率 in %）
| | M to M | M to N | N to N |
|-------|-------|-------|-------|
|random| 10 | 10 | 10 |
|ink   | 22 | 6 | 19 |
|kNN   | 98.8 | 13 | 45 |
|linear| 88 | 21 | 34 |
|SVM   | 97 | 27 | 52 |
|NN    | 92 | 28 | 25 |
|CNN   | 97 | 44 | 32 |

M: MNIST  
N: NSYSU-digits

資料科學的用意不只在於使用模型  
更重要的是**在結果不如預期的時候了解可能的原因**

## Data processing
資料前處理

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

**這區的 code 須要[開頭](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)的程式碼讀入 MNIST 資料、下載 NSYSU-digits**  
**而且會重設 Xsys, ysys 等變數**

In [None]:
### required functions

from PIL import Image

def shift_n_scale(arr, shift, scale=1):
    p,q = shift
    pad = max(abs(p), abs(q))
    a,b = arr.shape
    new = np.zeros((a + 2*pad, b + 2*pad), dtype=arr.dtype)
    new[pad+p:pad+p+a, pad+q:pad+q+b] = scale*arr
    return new[pad:pad+a, pad:pad+b]

def thicken(arr, rad=2, decay=0.8):
    moves = []
    for i in range(-rad, rad+1):
        for j in range(-rad, rad+1):
            dist = abs(i) + abs(j)
            if dist <= rad - 1:
                moves.append(shift_n_scale(arr, (i,j), scale=decay**dist))
    new_arr = np.array(moves)
    return new_arr.max(axis=0)

def level(arr, thres=10, a=1, b=100):
    m,n = arr.shape
    new_arr = arr.copy()
    new_arr[arr > thres] = new_arr[arr > thres] * a + b
    upd = np.zeros_like(arr) + 255
    thick = np.concatenate([new_arr[np.newaxis,:,:], upd[np.newaxis,:,:]], axis=0)
    return thick.min(axis=0)

def bounding_box(arr, thres=10, out='subarray'):
    """
    out can be 'bounds' or 'subarray'
    """
    xs,ys = np.where(arr > 10)
    if out == 'bounds':
        return xs.min(), xs.max(), ys.min(), ys.max()
    if out == 'subarray':
        return arr[xs.min():xs.max()+1, ys.min():ys.max()+1]
    
def out_size(in_size, target=20):
    x,y = in_size
    big = max(x,y)
    ratio = float(target) / big 
    out_x = target if x == big else int(np.ceil(x*ratio))
    out_y = target if y == big else int(np.ceil(y*ratio))
    return (out_x, out_y)

def arr_centers(arr):
    m,n = arr.shape
    row_sum = np.sum(arr, axis=1)
    v_cen = (row_sum * np.arange(m)).sum() / row_sum.sum()
    col_sum = np.sum(arr, axis=0)
    h_cen = (col_sum * np.arange(n)).sum() / col_sum.sum()
    return (v_cen, h_cen)

def centerize(arr, target=20):
    m,n = arr.shape
    new_arr = np.zeros((m + 2*target, n + 2*target), dtype=arr.dtype)
    
    img = Image.fromarray(bounding_box(arr).astype('uint8'))
    o_size = out_size(img.size, target=target)   
    re_arr = np.array(img.resize(o_size), dtype=arr.dtype)
    v_cen,h_cen = arr_centers(re_arr)
    v = target + int(np.round(0.5*n - v_cen))
    h = target + int(np.round(0.5*m - h_cen))
    vp = v + re_arr.shape[0]
    hp = h + re_arr.shape[1]
    
    new_arr[v:vp, h:hp] = re_arr
    return new_arr[target:target+m, target:target+n]

In [None]:
### load nsysu

raw_Xsys = np.genfromtxt('nsysu-digits-X.csv', dtype=int, delimiter=',') ### flattened already
ysys = np.genfromtxt('nsysu-digits-y.csv', dtype=int, delimiter=',')

![Comparison of two datasets](comparison.png "Comparison of two datasets")

可能的選項：加粗、加深、大小統一、置中

In [None]:
### some zeros and ones
### required for generating pictures

Ninds = np.concatenate([np.where(ysys == 0)[0][:6], 
                np.where(ysys == 1)[0][:6]])
Minds = np.concatenate([np.where(y_train == 0)[0][:6], 
                np.where(y_train == 1)[0][:6]])

In [None]:
### generate comparison.png

fig = plt.figure(figsize=(7.2,4.8))
back = fig.add_axes([0,0,1,1])
back.set_axis_off()
back.text(0.3,0.9, 'NSYSU-digits', horizontalalignment='center')
back.text(0.7,0.9, 'MNIST', horizontalalignment='center')
axs = fig.subplots(4,6)
for i in range(4):
    for j in range(6):
        ax = axs[i,j]
        ax.set_axis_off()
        if j < 3:
            ax.imshow(raw_Xsys[Ninds[3*i+j]].reshape(28,28), cmap='Greys')
        else:
            ax.imshow(X_train[Minds[3*i+j-3]], cmap='Greys')

#### 加粗

![Thicken the data](thicken.png "Thicken the data")

In [None]:
### for generating thicken.png

fig,axs = plt.subplots(3,4,figsize=(8,6))
for i in range(3):
    for j in range(4):
        ax = axs[i,j]
        ax.set_axis_off()
        if j == 0:
            ax.imshow(raw_Xsys[Ninds[i]].reshape(28,28), cmap='Greys')
            if i == 0:
                ax.set_title('rad=1')
        elif j == 1:
            ax.imshow(thicken(raw_Xsys[Ninds[i]].reshape(28,28), rad=2), cmap='Greys')
            if i == 0:
                ax.set_title('rad=2')
        elif j == 2:
            ax.imshow(thicken(raw_Xsys[Ninds[i]].reshape(28,28), rad=3), cmap='Greys')
            if i == 0:
                ax.set_title('rad=3')
        else:
            ax.imshow(X_train[Minds[i]], cmap='Greys')
            if i == 0:
                ax.set_title('MNIST')

參數：`radius`, `decay`

![Thicken illustration](thicken-param.png "Thicken illustration")  
`decay = 0.8`

In [None]:
### for generating thicken-param.png

arr = np.zeros((5,5), dtype=float)
arr[2][2] = 255

fig,axs = plt.subplots(1,3, figsize=(6,2))
for i in range(3):
    axs[i].set_xticks([])
    axs[i].set_yticks([])
    axs[i].set_title('rad=%d'%(i+1))
    axs[i].imshow(thicken(arr, rad=i+1, decay=0.6), cmap='Greys')

#### 加深

![Darken the data](darken.png "Darken the data")

In [None]:
### for generating darken.png

fig,axs = plt.subplots(3,4,figsize=(8,6))
for i in range(3):
    for j in range(4):
        ax = axs[i,j]
        ax.set_axis_off()
        if j <= 2:
            ax.imshow(level(raw_Xsys[Ninds[i]].reshape(28,28), b=50*j), cmap='Greys')
            if i == 0:
                ax.set_title('b=%d'%(50*j))
        else:
            ax.imshow(X_train[Minds[i]], cmap='Greys')
            if i == 0:
                ax.set_title('MNIST')

參數：`thres`, `ax + b` (leveling function)

```python
new_arr[arr > thres] = new_arr[arr > thres] * a + b
```

#### 大小統一

![Resize the data](centerize.png "Resize the data")

In [None]:
### for generating resize.png

fig,axs = plt.subplots(3,3,figsize=(6,6))
for i in range(3):
    for j in range(3):
        ax = axs[i,j]
        ax.set_xticks([])
        ax.set_yticks([])
        if j == 0:
            ax.imshow(raw_Xsys[Ninds[i]].reshape(28,28), cmap='Greys')
            if i == 0:
                ax.set_title('original')
        elif j == 1:
            ax.imshow(centerize(raw_Xsys[Ninds[i]].reshape(28,28)), cmap='Greys')
            if i == 0:
                ax.set_title('scaled & centerized')
        else:
            ax.imshow(X_train[Minds[i]], cmap='Greys')
            if i == 0:
                ax.set_title('MNIST')

參數：`thres`, `target` (size of the bounding box)

![Threshold illustration](thres.png "Threshold illustration")

In [None]:
### for generating thres.png

fig,axs = plt.subplots(3,4,figsize=(8,6))
for i in range(3):
    for j in range(4):
        ax = axs[i,j]
        ax.set_xticks([])
        ax.set_yticks([])
        thres = 10 * j + 10
        if j <= 2:
            ax.imshow(raw_Xsys[Ninds[i]].reshape(28,28) >= thres, cmap='Greys')
            if i == 0:
                ax.set_title('thres=%d'%thres)
        else:
            ax.imshow(X_train[Minds[i]], cmap='Greys')
            if i == 0:
                ax.set_title('MNIST')

參數：`thres`, `target` (size of the bounding box)

![Resize illustration](resize-param.png "Resize illustration")

In [None]:
### for generating resize-param.png

arr = raw_Xsys[Ninds[0]].reshape(28,28)
bounds = bounding_box(arr, out='bounds')
height,width = bounds[1]-bounds[0], bounds[3]-bounds[2]
h_aug,w_aug = (20-height)/2, (20-width)/2

plt.imshow(arr, cmap='Greys')
in_rec = plt.Rectangle((bounds[2]-1, bounds[0]-1), width+1, height+1, 
                       edgecolor='red', lw=2, fill=False)
out_rec = plt.Rectangle((bounds[2]-1-w_aug, bounds[0]-1-h_aug), 21, 21, 
                        edgecolor='blue', lw=2, fill=False)
ax = plt.gca()
ax.add_patch(in_rec)
ax.add_patch(out_rec)

#### 置中

![Centerize the data](centerize.png "Centerize the data")

參數：`thres`

    [   0    1    2    3    4    5    6    7    8    9   10   11   12   13  (index)
    [   0    0    0    0    0    0    0    0  233 1080  550  310  295  240  (column sum)    
    
       14   15   16   17   18   19   20   21   22   23   24   25   26   27] (index)
      289  402  542  119    0    0    0    0    0    0    0    0    0    0] (column sum)

In [None]:
arr = raw_Xsys[Ninds[0]].reshape(28,28)
col_sum = np.sum(arr, axis=0)
print(np.concatenate([np.arange(28)[np.newaxis,:], col_sum[np.newaxis,:]], axis=0))
print("horizontal mass center at", (col_sum * np.arange(28)).sum() / col_sum.sum()) # shift to 13.5

#### 加粗、加深、統一大小並置中
| | M to N | M to thicken | M to dark | M to center | M to ??? |
|-------|-------|-------|-------|-------|-------|
|random| 10 | 10 | 10 | 10 | 10 |
|ink   | 6 | 9 | 7 | 7 | 12 |
|kNN   | 13 | 27 | 26 | 52 | 87 |
|linear| 21 | 22 | 25 | 74 | 76 |
|SVM   | 27 | 38 | 40 | 75 | 91 |
|NN    | 28 | 28 | 29 | 70 | 82 |
|CNN   | 44 | 45 | 49 | 90 | 95 |

M: MNIST  
N: NSYSU-digits  
thicken: thicken NSYSU-digits (rad=2, decay=0.8)  
dark: darkened NSYSU-digits (+100)  
center: centerized NSYSU-digits (fit to 20x20 and centered by mass)  
???: some formula

In [None]:
Xsys = np.zeros_like(raw_Xsys)
for i in range(raw_Xsys.shape[0]):
    arr = raw_Xsys[i].reshape(28,28)
    arr = thicken(arr) ### decide whether to thicken
#     arr = level(arr) ### decide whether to darken
#     arr = centerize(arr) ### decide whether to center
    Xsys[i] = arr.reshape(784)    

In [None]:
### check your new data

fig,axs = plt.subplots(4,3, figsize=(6,8))
for i in range(4):
    for j in range(3):
        ax = axs[i,j]
        ax.set_axis_off()
        if j == 0:
            ax.imshow(raw_Xsys[i].reshape(28,28), cmap='Greys')
        elif j == 1:
            ax.imshow(Xsys[i].reshape(28,28), cmap='Greys')
        else:
            ax.imshow(X_train[i], cmap='Greys')

In [None]:
num = Xsys.shape[0] ### 552
Xsyscnn = Xsys.reshape(num, 28, 28, 1)
ysysone = tf.keras.utils.to_categorical(ysys)

In [None]:
### random guess
guess = np.random.randint(0, 10, (num,))
accuracy_score(ysys, guess)

In [None]:
### ink density guess
centers = np.zeros((10,1))
label = np.arange(10)
for i in range(10):
    mask = (y_train == i)
    centers[i,0] = X_train[mask].mean()

from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier(1)
model.fit(centers, label)
inks = Xsys.mean(axis=1)[:,np.newaxis]
guess = model.predict(inks)
accuracy_score(ysys, guess)

In [None]:
### kNN
from sklearn.neighbors import KNeighborsClassifier
model = KNeighborsClassifier()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### linear
from sklearn.neural_network import MLPClassifier
model = MLPClassifier(hidden_layer_sizes=(), activation='identity')
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### SVM
from sklearn.svm import SVC
model = SVC()
model.fit(Xf_train, y_train)
guess = model.predict(Xsys)
accuracy_score(ysys, guess)

In [None]:
### neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.Input(shape=(784,)))
model.add(tf.keras.layers.Dense(32, activation='relu'))
model.add(tf.keras.layers.Dense(10, activation='softmax'))
model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xf_train, yone_train, epochs=10, batch_size=100, validation_data=(Xsys, ysysone))

In [None]:
### convolution neural network
model = tf.keras.models.Sequential()
model.add(tf.keras.layers.Conv2D(32, kernel_size=(3, 3), 
                                 activation='relu', 
                                 input_shape=(28, 28, 1)))
model.add(tf.keras.layers.Conv2D(64, (3, 3), activation='relu'))
model.add(tf.keras.layers.MaxPool2D(pool_size=(2, 2)))
model.add(tf.keras.layers.Dropout(0.25))
model.add(tf.keras.layers.Flatten())
model.add(tf.keras.layers.Dense(128, activation='relu'))
model.add(tf.keras.layers.Dropout(0.5))
model.add(tf.keras.layers.Dense(10, activation='softmax'))

model.compile(loss='categorical_crossentropy', 
              optimizer='adam', 
              metrics=['accuracy'])
model.fit(Xcnn_train, yone_train, epochs=1, batch_size=100, validation_data=(Xsyscnn, ysysone))

#### 須要處理的變數
- `thres`
- `ax + b` (leveling function)
- `radius`
- `decay`
- `target` (size of the bounding box) 

這些參數的設定目前看起來不錯  
**適用於所有狀況嗎？**（掃描器、筆...)

In [None]:
### for finding bad examples

start = 175
fig,axs = plt.subplots(5,5, figsize=(10,10))
for i in range(5):
    for j in range(5):
        ind = start + 5*i + j
        ax = axs[i,j]
        ax.set_axis_off()
        ax.set_title(ind)
        ax.imshow(raw_Xsys[ind].reshape(28,28), cmap='Greys')

In [None]:
### for generating 
### out_box.png, unusual_writing.png, extra_dot.png

ob = [72, 94, 179]
uw = [3, 54, 104, 147, 125, 166, 155]
ed = [22, 49, 131, 64, 140, 131, 176]
l = ob
n = len(l)

fig,axs = plt.subplots(1,n, figsize=(n*1.3,1.3))
for j in range(n):
    ax = axs[j]
    ax.set_axis_off()
    ax.imshow(raw_Xsys[l[j]].reshape(28,28), cmap='Greys')

#### 其它問題
超出格子、字體歪斜、額外的一點、非預期的答案或空白...

超出格子  

![Out of box](out_box.png "Out of box")

字體歪斜  

![Unusual writing](unusual_writing.png "Unusual writing")

額外的一點  

![Extra dot](extra_dot.png "Extra dot")

非預期的答案或空白  

- 負數
- 三位數
- 空白

## Difficulties/Mistakes in the NSYSU-digits project
專案中遇到的困難、犯過的錯誤

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

#### I have a dream — an auto-grading system

![](https://media.giphy.com/media/3oriffF0Lt6ioJie08/giphy.gif)

(Source [Giphy](https://giphy.com/gifs/season-3-the-simpsons-3x16-3oriffF0Lt6ioJie08))

#### 小考出題自動化
[jephianlin/QuizGenerator.git](https://github.com/jephianlin/QuizGenerator.git) [ [sample](http://www.math.nsysu.edu.tw/~chlin/2020FMath203/SampleQuiz1.pdf) ]

![Sample of quiz](quiz_p1.png "Sample of quiz")

分數採全有全無制、滿分 5 分  
小考分數 = 課堂小考、所有補考的平均  

如果課堂小考 0 分，但想拿 4.5  
就要再寫對 9 張小考  
**而且在這之中沒有犯錯**

**驗證碼區**位於右下角往左上 (1cm, 1cm) 處  
大小為 (2cm, 2cm)

![Bottom of a quiz paper](quiz_bottom.png "Bottom of a quiz paper")

**QR code 區**位於左下角往右上 (1cm,1cm) 處  
大小為 (2cm, 2cm)

![Bottom of a quiz paper](quiz_bottom.png "Bottom of a quiz paper")

![Workflow 1](workflow1.jpg "Workflow 1")

#### 2019F ~ 2020S
在線性代數課中試行小考系統  

助教幫我跑程式，發現：
- 字體大小不一、不置中、有時候截圖還會截到線
- QR code 有時候會讀不到
- 預測準確率大約 60%（當時他用 FF design of CNN）

C.-C. Jay Kuo, Min Zhang, Siyang Li, Jiali Duan, Yueru Chen  
[Interpretable convolutional neural networks via feedforward design](https://www.sciencedirect.com/science/article/abs/pii/S104732031930104X)  
Journal of Visual Communication and Image Representation, 60: 346–359, 2019

我當時相信只要神經網路處理好
> 字體大小不一、不置中、有時候截圖還會截到線  

都不是問題    

就這樣擺了一年  
整年的考卷其實都是助教人工改的  

_有好助教的老師像個寶_  
由衷感謝 `<(_ _)>`

#### 機械上的誤差
- 印表機
- 掃描器
- 角度大致沒錯、但位差 ~ 2mm
- 每臺掃描器掃出來的深度也不同

#### 人為的誤差
每個人也不一定會寫在中心位置、大小也不同

#### Auto-boxing by 潘昶余
![Auto-boxing](auto-boxing.png "Auto-boxing")

#### 2020 Summer

一年後我開始準備下一年的課程  
決心要在開學前把 auto-grading 搞定

#### 尺寸錯誤
我一直告訴助教驗證碼和 QR code 的位置應該在 (1cm, 1cm) 的位置

![Wrong paper size](badbad_quiz_bottom.png "Wrong paper size")

LaTeX 產出的檔案預設是 US letter  
印表機自動縮放印在 A4 的紙上  

![Wrong paper size](bad_quiz_bottom.png "Wrong paper size")

（更精確來說，每一頁的原始檔我有調成 A4，但合併的時候忘了；所以 A4 紙被嵌在 US letter 裡，又被 A4 紙印出來...）

#### 程式碼有錯
圖檔截取出來以後  
我開始用各種模型來做數字辨識  

每個模型的**準確率都只有 ~10%!**

匯入資料庫的時候  
圖片有重新洗牌一次  

程式沒寫好...  
圖片和答案分兩次洗牌  

**答案跟亂數沒兩樣，得到 10% 天經地義**

#### QR code 抓不到
- 用 pyzbar 這個套件來讀 QR code  
- 常常讀不到
- 將圖片對比拉大以後有改進  
(let `bright = 255` if `bright > 100`)




#### QR code fixer by 潘昶余
![QR code fixer](bug4.png "QR code fixer")

加深、補洞  
![Fill gaps in the QR code](bug1.png "Fill gaps in the QR code")

#### 無法克服的障礙
學生的名字或學號  
目前還是須要助教輸入 `T_T`

#### 2020F
自動閱卷系統  
掃描 → QR & 辨識 → email 學生  
已經幾乎完成

**資料沒清洗的狀況下辨識成功率只有 60%**

↪ 為了做這投影片才認真清洗  
（但目前還是人工辦識）

**學號還是沒辦法處理**

↪ 目前還是人工

**QR code**

↪ 出錯一次，程式除錯

不斷修正改進中：）

## Summary
總結

[back to top](#%E6%88%91%E5%80%91%E8%88%87%E6%95%B8%E5%AD%97%E8%BE%A8%E8%AD%98%E7%9A%84%E8%B7%9D%E9%9B%A2)

#### 關於人工智慧
- 資料是否前處理過會大幅影響訓練的成果。
- 前處理仍仰賴人為參數調整。
- 參數調整取決於設備（掃描器）及資料特性（學生用的筆）

#### 關於專案
- 所學應該用在生活上；如果有什麼想法，就應該去試試看。
- 一次完成一件事，才有可能累積更大的成果。
- 機器在學習、人類也在學習，任何專案都應該是不斷修正的過程。
- 提升準確率很好，但如何處理（無可避免的）判斷錯誤也很重要。
- 人工和自動的切換，可增強產品的穩健性。

#### 推薦閱讀
- [Neural Networks and Deep Learning](http://neuralnetworksanddeeplearning.com/) by Michael Nielsen  
自製神經網路、universality theorem
- [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/) by Jake VanderPlas  
資料處理的詳細介紹（NumPy, pandas, matplotlib）、broadcasting in NumPy、face detection

#### Questions
1. 用 MNIST 訓練、NSYSU-digits 測試合理嗎？
2. 目前的資料經過人工前處理後，可以訓練出 95% 的模型；我可以期待這個模型對下一批學生寫出來的字也有同樣的準確率嗎？
3. 如果是你，你會如何將 NSYSU-digits 資料做前處理？有沒有更智慧的方法調整前處理的參數？