# COMP5318 - Machine Learning and Data Mining: Assignment 2

## CONTENTS

-  [0. Set up](#0)
-  [1. Obtain data](#1)
    -  [1.1. Data info](#1.1)
    -  [1.2. Load data](#1.2)
    -  [1.3. Set train/test data](#1.3)
-  [2. Pre-process data](#2)
    -  [2.1. Standardized data](#2.1)
    -  [2.2. PCA](#2.2)
-  [3. Algorithms](#3)
    -  [3.1. Nearest Neighbor](#3.1)
    -  [3.2. MLP](#3.2)
    -  [3.3. CNN](#3.3)
-  [4. Classifier comparisons](#4)
-  [5. Computer details](#5)
-  [6. Easy to use](#6)

## 0. Set up <a id='0'></a>

In [1]:
import sklearn
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix, classification_report
from sklearn.model_selection import GridSearchCV
import tensorflow as tf
from tensorflow import keras
from keras import layers
import numpy as np
import pandas as pd
import os
import time
import cv2

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# used to compare result of all models
all_res = {'algorithms': ['knn', 'MLP', 'cnn'],
           'accuracy_score':[],
           'accuracy_score':[],
           'precision_score':[],
           'recall_score':[], 
           'f1_score':[]}

## 1. Obtain data <a id='1'></a>

### 1.1. Data info <a id='1.1'></a>

There is a "Fnt" folder including 62 main folders, each folder including 1016 png files as input data (which downloaded from http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/):

    EnglishFnt.tgz (51.1 MB): characters from computer fonts with 4 variations (combinations of italic, bold and normal).

### 1.2. Load data <a id='1.2'></a>

In [2]:
# read image and resize image
def img_resizing(i_path):
    img = cv2.imread(i_path, cv2.IMREAD_GRAYSCALE)
    data = cv2.resize(img, (28, 28), interpolation=cv2.INTER_CUBIC)
    return data

In [6]:
# converting the images into sets of data
data = []
label = []

for i in range(62):
    path = 'English/Fnt/Sample%03d/' % (i+1)
    for filename in os.listdir(path):
        img = img_resizing(path+filename)
        tmp = img.reshape([1, img.shape[0]*img.shape[1]])
        data.append(np.asarray(tmp, dtype = "int32"))
        label.append(i)


data_img = []
label_img = []

for i in range(62):
    path = 'English/Img/GoodImg/Bmp/Sample%03d/' % (i+1)
    for filename in os.listdir(path):
        img = img_resizing(path+filename)
        tmp = img.reshape([1, img.shape[0]*img.shape[1]])
        data_img.append(np.asarray(tmp, dtype = "int32"))
        label_img.append(i)
        
for i in range(62):
    path = 'English/Img/BadImag/Bmp/Sample%03d/' % (i+1)
    for filename in os.listdir(path):
        img = img_resizing(path+filename)
        tmp = img.reshape([1, img.shape[0]*img.shape[1]])
        data_img.append(np.asarray(tmp, dtype = "int32"))
        label_img.append(i)

In [7]:
# convert data type
data_array = np.asarray(data)
new_data = np.asarray(data_array).reshape(data_array.shape[0],-1)
print("The data shape is: ", new_data.shape)
new_label = np.asarray(label)
print("The label shape is: ", new_label.shape)

# this is for img data, check the data shape
data_array_img = np.asarray(data_img)
new_data_img = data_array_img.reshape(data_array_img.shape[0],-1)
print("The IMG data shape is: ", new_data_img.shape)
new_label_img = np.asarray(label_img)
print("The IMG label shape is: ", new_label_img.shape)

The data shape is:  (62992, 784)
The label shape is:  (62992,)
The IMG data shape is:  (12503, 784)
The IMG label shape is:  (12503,)


### 1.3. Set train / test data <a id='1.3'></a>

In [8]:
from sklearn.model_selection import train_test_split
X = new_data
y = new_label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=500)

# print train and test data set shape
print("X_train shape:", X_train.shape, "y_train shape:", y_train.shape)
print("X_test shape:", X_test.shape, "y_test shape:", y_test.shape)

X_train shape: (50393, 784) y_train shape: (50393,)
X_test shape: (12599, 784) y_test shape: (12599,)


## 2. Pre-process data <a id='2'></a>

### 2.1. Standardized data <a id='2.1'></a>

In [9]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()#creating an object
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
X_test_std = scaler.transform(X_test)

### 2.2. PCA <a id='2.2'></a>

In [10]:
from sklearn.decomposition import PCA
pca=PCA(n_components=0.9)
X_train_reduced = pca.fit_transform(X_train_std)
X_test_reduced = pca.transform(X_test_std)
print("Original shape of training data: {}".format(str(X_train_std.shape)))
print("Reduced shape of training data: {}".format(str(X_train_reduced.shape)))
print("Original shape of testing data: {}".format(str(X_test_std.shape)))
print("Reduced shape of testing data: {}".format(str(X_test_reduced.shape)))

Original shape of training data: (50393, 784)
Reduced shape of training data: (50393, 162)
Original shape of testing data: (12599, 784)
Reduced shape of testing data: (12599, 162)


## 3. Algorithms<a id='3'></a>

### 3.1. Nearest Neighbor<a id='3.1'></a>

In [11]:
from sklearn.neighbors import KNeighborsClassifier

In [12]:
# No Parameter Tuning
knn_start = time.time()

knn = KNeighborsClassifier(n_neighbors=1,p=1)
knn.fit(X_train_reduced, y_train)
y_pred = knn.predict(X_test_reduced)
print("Knn - accuracy on test set: {:.3f}".format(accuracy_score(y_test, y_pred)))

print('################ time used ##############')
knn_end = time.time()
print("knn no parameter tuning time used: ", knn_end - knn_start)

Knn - accuracy on test set: 0.872
################ time used ##############
knn no parameter tuning time used:  270.1841676235199


In [28]:
# Parameter Tuning
pknn_start = time.time()

# set parameters that we want to tunning
param_grid = {'n_neighbors': [1, 5, 10, 15], 'p':[1,2]}
print("Parameter grid:\n{}".format(param_grid))

# use GridSearchCV with cv=10
grid_search = GridSearchCV(KNeighborsClassifier(), param_grid, cv=10)

grid_search.fit(X_train_reduced, y_train)

print("Test set score: {:.2f}".format(grid_search.score(X_test_reduced, y_test)))
print("Best parameters: {}".format(grid_search.best_params_))
print("Best cross-validation score: {:.2f}".format(grid_search.best_score_))
print("Best estimator:\n{}".format(grid_search.best_estimator_))

print('################ time used ##############')
pknn_end = time.time()
print("knn with parameter tuning time used: ", pknn_end - pknn_start)
y_pred = grid_search.predict(X_test_reduced)

print("Knn - accuracy score on test set: {:.3f}".format(accuracy_score(y_test, y_pred)))
print("Knn - precision score on test set: {:.3f}".format(precision_score(y_test, y_pred,average='macro')))
print("Knn - recall score on test set: {:.3f}".format(recall_score(y_test, y_pred,average='macro')))
print("Knn - f1 score on test set: {:.3f}".format(f1_score(y_test, y_pred,average='macro')))
print("Knn - confusion matrix on test set: \n",confusion_matrix(y_test, y_pred))
all_res['accuracy_score'].append(accuracy_score(y_test, y_pred))
all_res['precision_score'].append(precision_score(y_test, y_pred,average='macro'))
all_res['recall_score'].append(recall_score(y_test, y_pred,average='macro'))
all_res['f1_score'].append(f1_score(y_test, y_pred,average='macro'))

Parameter grid:
{'n_neighbors': [1, 5, 10, 15], 'p': [1, 2]}
Test set score: 0.87
Best parameters: {'n_neighbors': 1, 'p': 1}
Best cross-validation score: 0.87
Best estimator:
KNeighborsClassifier(n_neighbors=1, p=1)
################ time used ##############
knn with parameter tuning time used:  1852.4000704288483
Knn - accuracy score on test set: 0.872
Knn - precision score on test set: 0.873
Knn - recall score on test set: 0.872
Knn - f1 score on test set: 0.872
Knn - confusion matrix on test set: 
 [[154   0   0 ...   0   0   0]
 [  0 172   0 ...   0   0   0]
 [  0   0 211 ...   0   0   0]
 ...
 [  0   1   0 ... 153   0   1]
 [  0   0   0 ...   0 183   0]
 [  0   0   0 ...   0   0 155]]


### 3.2. MLP<a id='3.2'></a>

In [14]:
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator

In [15]:
# we set two functions to compare the difference between mlp with dropout and without dropout
def mlp():
    # Clear any existing TensorFlow graph from memory and set random seeds.
    keras.backend.clear_session()
    np.random.seed(42)
    tf.random.set_seed(42)

    # set model
    model= keras.models.Sequential([
        keras.layers.Flatten(input_shape=X_train_std.shape[1:]),
        keras.layers.Dense(400, activation="relu"),
        keras.layers.Dense(200, activation="relu"),
        keras.layers.Dense(100, activation="relu"),
        keras.layers.Dense(62, activation="softmax")
    ])

    return model

In [16]:
# we would like to try different learning rate and epochs
optimizer = ['Adam', 'SGD']

lr_adam = [0.0001, 0.001]
epoch_adam = [50, 150]

lr_sgd = [0.001, 0.01]
epoch_sgd = [50, 150]


count = 1
mlp_res = {'tech': [], 'lr':[], 'epoch':[], 'accuracy_score':[], 'precision_score':[], 'recall_score':[], 'f1_score':[]}
for i in lr_adam:
    for j in epoch_adam:
        print("Adam: %d / %d ###################################################" % (count, len(lr_adam)*len(epoch_adam)))
        count += 1
        model = mlp()
        model.compile(loss="sparse_categorical_crossentropy", 
                      optimizer=keras.optimizers.Adam(learning_rate=i), 
                      metrics=["accuracy"])
        model.fit(X_train_std, y_train, epochs=j,
                           validation_split=0.2) # set 20% as validation seet
        y_prob = model.predict(X_test_std)
        y_pred = np.argmax(y_prob,axis=1)
        mlp_res['tech'].append('Adam')
        mlp_res['lr'].append(i)
        mlp_res['epoch'].append(j)
        mlp_res['accuracy_score'].append(accuracy_score(y_test, y_pred))
        mlp_res['precision_score'].append(precision_score(y_test, y_pred, average='macro'))
        mlp_res['recall_score'].append(recall_score(y_test, y_pred, average='macro'))
        mlp_res['f1_score'].append(f1_score(y_test, y_pred, average='macro'))


for i in lr_sgd:
    for j in epoch_sgd:
        print(" SGD: %d / %d ###################################################" % (count, len(lr_sgd)*len(epoch_sgd)))
        count += 1
        model = mlp()
        decay_rate = i / j
        sgd = keras.optimizers.SGD(learning_rate=i, momentum=0.8, decay=decay_rate, nesterov=False)
        model.compile(loss="sparse_categorical_crossentropy", 
                      optimizer=sgd, 
                      metrics=["accuracy"])
        model.fit(X_train_std, y_train, epochs=j,
                           validation_split=0.2) # set 20% as validation seet
        y_prob = model.predict(X_test_std)
        y_pred = np.argmax(y_prob,axis=1)
        mlp_res['tech'].append('SGD')
        mlp_res['lr'].append(i)
        mlp_res['epoch'].append(j)
        mlp_res['accuracy_score'].append(accuracy_score(y_test, y_pred))
        mlp_res['precision_score'].append(precision_score(y_test, y_pred, average='macro'))
        mlp_res['recall_score'].append(recall_score(y_test, y_pred, average='macro'))
        mlp_res['f1_score'].append(f1_score(y_test, y_pred, average='macro'))


Adam: 1 / 4 ###################################################
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Adam: 2 / 4 ###################################################
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150


Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150


Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150


Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
Adam: 3 / 4 ###################################################
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50


Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Adam: 4 / 4 ###################################################
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150


Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150


Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150


Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
 SGD: 5 / 4 ###################################################
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50


Epoch 50/50
 SGD: 6 / 4 ###################################################
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150
Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150


Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150


Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
 SGD: 7 / 4 ###################################################
Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50


Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
 SGD: 8 / 4 ###################################################
Epoch 1/150
Epoch 2/150
Epoch 3/150
Epoch 4/150
Epoch 5/150
Epoch 6/150
Epoch 7/150
Epoch 8/150
Epoch 9/150
Epoch 10/150
Epoch 11/150
Epoch 12/150
Epoch 13/150
Epoch 14/150
Epoch 15/150
Epoch 16/150
Epoch 17/150
Epoch 18/150
Epoch 19/150
Epoch 20/150
Epoch 21/150


Epoch 22/150
Epoch 23/150
Epoch 24/150
Epoch 25/150
Epoch 26/150
Epoch 27/150
Epoch 28/150
Epoch 29/150
Epoch 30/150
Epoch 31/150
Epoch 32/150
Epoch 33/150
Epoch 34/150
Epoch 35/150
Epoch 36/150
Epoch 37/150
Epoch 38/150
Epoch 39/150
Epoch 40/150
Epoch 41/150
Epoch 42/150
Epoch 43/150
Epoch 44/150
Epoch 45/150
Epoch 46/150
Epoch 47/150
Epoch 48/150
Epoch 49/150
Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150


Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150
Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150


Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150


In [17]:
# print result
new_mlp_res = pd.DataFrame(mlp_res)
new_mlp_res

Unnamed: 0,tech,lr,epoch,accuracy_score,precision_score,recall_score,f1_score
0,Adam,0.0001,50,0.855147,0.856422,0.854501,0.854316
1,Adam,0.0001,150,0.855385,0.858622,0.855343,0.854669
2,Adam,0.001,50,0.853004,0.855207,0.852915,0.851792
3,Adam,0.001,150,0.856497,0.857484,0.856461,0.855428
4,SGD,0.001,50,0.849036,0.850419,0.848542,0.848234
5,SGD,0.001,150,0.854115,0.854951,0.853697,0.853511
6,SGD,0.01,50,0.859195,0.859238,0.85873,0.858608
7,SGD,0.01,150,0.863402,0.863466,0.863336,0.863026


In [26]:
# find the best model and save the result to out final result list
all_res['accuracy_score'].append(mlp_res['accuracy_score'][7])
all_res['precision_score'].append(mlp_res['precision_score'][7])
all_res['recall_score'].append(mlp_res['recall_score'][7])
all_res['f1_score'].append(mlp_res['f1_score'][7])

In [27]:
# pd.DataFrame(history.history).plot(figsize=(8,5))
# plt.grid(True)
# plt.gca().set_ylim(0,1)
# plt.show()

### 3.3. CNN<a id='3.3'></a>

In [29]:
# Clear any existing TensorFlow graph from memory and set random seeds.
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

In [30]:
X_train_keras = X_train_std.reshape(X_train_std.shape[0], 28, 28)
X_test_keras = X_test_std.reshape(X_test_std.shape[0], 28, 28)

X_train_keras = np.expand_dims(X_train_keras, axis=3)
X_test_keras = np.expand_dims(X_test_keras, axis=3)

y_train_keras = keras.utils.to_categorical(y_train, 62).astype('int32')
y_test_keras = keras.utils.to_categorical(y_test, 62).astype('int32')

# print shape of data and label
print("The shape of train data: ", X_train_keras.shape)
print("The shape of test data: ", X_test_keras.shape)
print("The shape of train label: ", y_train_keras.shape)
print("The shape of test label: ", y_test_keras.shape)

The shape of train data:  (50393, 28, 28, 1)
The shape of test data:  (12599, 28, 28, 1)
The shape of train label:  (50393, 62)
The shape of test label:  (12599, 62)


In [31]:
model = load_model("best_algorithm.h5")
print(model.summary())

Model: "sequential"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
conv2d (Conv2D)              (None, 28, 28, 64)        3200      
_________________________________________________________________
batch_normalization (BatchNo (None, 28, 28, 64)        256       
_________________________________________________________________
max_pooling2d (MaxPooling2D) (None, 13, 13, 64)        0         
_________________________________________________________________
conv2d_1 (Conv2D)            (None, 13, 13, 128)       204928    
_________________________________________________________________
batch_normalization_1 (Batch (None, 13, 13, 128)       512       
_________________________________________________________________
max_pooling2d_1 (MaxPooling2 (None, 6, 6, 128)         0         
_________________________________________________________________
conv2d_2 (Conv2D)            (None, 6, 6, 256)         2

In [32]:
# Check pretrained model performance on test set
y_prob = model.predict(X_test_keras)
y_pred = np.argmax(y_prob,axis=1)
ground_truth = np.argmax(y_test_keras,axis=1)

print("CNN - accuracy score on test set: {:.3f}".format(accuracy_score(ground_truth, y_pred)))
print("CNN - precision score on test set: {:.3f}".format(precision_score(ground_truth, y_pred, average='macro')))
print("CNN - recall score on test set: {:.3f}".format(recall_score(ground_truth, y_pred, average='macro')))
print("CNN - f1 score on test set: {:.3f}".format(f1_score(ground_truth, y_pred, average='macro')))
print("CNN - confusion matrix on test set: \n", confusion_matrix(ground_truth, y_pred))
all_res['accuracy_score'].append(accuracy_score(ground_truth, y_pred))
all_res['precision_score'].append(precision_score(ground_truth, y_pred, average='macro'))
all_res['recall_score'].append(recall_score(ground_truth, y_pred, average='macro'))
all_res['f1_score'].append(f1_score(ground_truth, y_pred, average='macro'))

CNN - accuracy score on test set: 0.902
CNN - precision score on test set: 0.902
CNN - recall score on test set: 0.901
CNN - f1 score on test set: 0.900
CNN - confusion matrix on test set: 
 [[170   0   0 ...   0   0   0]
 [  0 179   0 ...   0   0   0]
 [  0   0 211 ...   0   1   0]
 ...
 [  0   0   0 ... 173   0   0]
 [  0   1   0 ...   0 191   0]
 [  0   0   0 ...   0   0 138]]


In [24]:
# pd.DataFrame(history.history).plot(figsize=(8,5))
# plt.grid(True)
# plt.gca().set_ylim(0,1)
# plt.show()

## 4. Classifier comparisons<a id='4'></a>

In [None]:
all_res_n = pd.DataFrame(all_res)
all_res_n

## 5. Computer details<a id='5'></a>

- Hardware
    - OS System: Windows 10 64-bit operating system
    - CPU: Intel(R) Core(TM) i7-9700KF
    - GPU: NVIDIA GeForce RTX 2070
    - RAM: 16.0 GB
    
- Software
    - Python 3.8.3
    - notebook 6.0.3

## 6. Easy to use<a id='6'></a>

In [39]:
import sklearn
from sklearn.metrics import accuracy_score
import tensorflow as tf
from tensorflow import keras
from keras import layers
import numpy as np
import pandas as pd
import os
import time
import cv2

%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

# used to compare result of all models
all_res = {'algorithms': ['knn', 'MLP', 'cnn'],
           'accuracy_score':[],
           'accuracy_score':[],
           'precision_score':[],
           'recall_score':[], 
           'f1_score':[]}

# read image and resize image
def img_resizing(i_path):
    img = cv2.imread(i_path, cv2.IMREAD_GRAYSCALE)
    data = cv2.resize(img, (28, 28), interpolation=cv2.INTER_CUBIC)
    return data

# converting the images into sets of data
data = []
label = []

print("############### Data Path Input ############### \n")
print("Default path is: 'English/Fnt/'")
print("*** DO NOT ENTER anything if using the above path ***")
print("Windows Path Example: C:/Users/xxx/Downloads/5318/English/Fnt/")
print("MacOS Path Example: /Users/xxx/Downloads/5318/English/Fnt/")

data_path = input("Please input the data path (before Samplexxx folders) end with '/'")
if data_path == '':
    data_path = 'English/Fnt/'

for i in range(62):
    path = data_path + 'Sample%03d/' % (i+1)
    for filename in os.listdir(path):
        try:
            img = img_resizing(path+filename)
            tmp = img.reshape([1, img.shape[0]*img.shape[1]])
            data.append(np.asarray(tmp, dtype = "int32"))
            label.append(i)
        except:
            error_message = path + filename
            print("failed: ", error_message)
            pass
print("############### Data Loaded ############### \n ")
# convert data type
data_array = np.asarray(data)
new_data = np.asarray(data_array).reshape(data_array.shape[0],-1)
new_label = np.asarray(label)

from sklearn.model_selection import train_test_split
X = new_data
y = new_label
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=500)
print("############### Train/Test Splited ############### \n ")

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()#creating an object
scaler.fit(X_train)#calculate min and max value of the training data
X_train_std = scaler.transform(X_train) #apply normalisation to the training set
X_test_std = scaler.transform(X_test) #apply normalization to the test set
print("############### Standardise Data ############### \n ")

from sklearn.decomposition import PCA
pca=PCA(n_components=0.9)
X_train_reduced = pca.fit_transform(X_train_std)
X_test_reduced = pca.transform(X_test_std)
print("############### PCA Done ############### \n ")

# knn
print("############### KNN Model Loading ############### \n ")
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=1,p=1)
knn.fit(X_train_reduced, y_train)
print("############### KNN Model Loaded ############### \n ")
y_pred = knn.predict(X_test_reduced)

print("Knn - accuracy score on test set: {:.3f}".format(accuracy_score(y_test, y_pred)))
print("Knn - precision score on test set: {:.3f}".format(precision_score(y_test, y_pred, average='macro')))
print("Knn - recall score on test set: {:.3f}".format(recall_score(y_test, y_pred, average='macro')))
print("Knn - f1 score on test set: {:.3f}".format(f1_score(y_test, y_pred, average='macro')))
print("Knn - confusion matrix on test set: \n", confusion_matrix(y_test, y_pred))
all_res['accuracy_score'].append(accuracy_score(y_test, y_pred))
all_res['precision_score'].append(precision_score(y_test, y_pred, average='macro'))
all_res['recall_score'].append(recall_score(y_test, y_pred, average='macro'))
all_res['f1_score'].append(f1_score(y_test, y_pred, average='macro'))

# MLP
from tensorflow.keras.models import Sequential, load_model
from tensorflow.keras import layers
from tensorflow.keras.preprocessing.image import ImageDataGenerator
print("############### MLP Model Loading ############### \n ")
# Clear any existing TensorFlow graph from memory and set random seeds.
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

model= keras.models.Sequential([
    keras.layers.Flatten(input_shape=X_train_std.shape[1:]),
    keras.layers.Dense(400, activation="relu"),
    keras.layers.Dense(200, activation="relu"),
    keras.layers.Dense(100, activation="relu"),
    keras.layers.Dense(62, activation="softmax")
])

sgd = keras.optimizers.SGD(learning_rate=0.01, momentum=0.8, decay=0.01/150, nesterov=False)
model.compile(loss="sparse_categorical_crossentropy", 
                optimizer=sgd, 
                metrics=["accuracy"])
model.fit(X_train_std, y_train, epochs=150,
                    validation_split=0.2) # set 20% as validation seet
print("############### MLP Model Loaded ###############")
# Check pretrained model performance on test set
y_prob = model.predict(X_test_std)
y_pred = np.argmax(y_prob,axis=1)

print("MLP - accuracy score on test set: {:.3f}".format(accuracy_score(y_test, y_pred)))
print("MLP - precision score on test set: {:.3f}".format(precision_score(y_test, y_pred, average='macro')))
print("MLP - recall score on test set: {:.3f}".format(recall_score(y_test, y_pred, average='macro')))
print("MLP - f1 score on test set: {:.3f}".format(f1_score(y_test, y_pred, average='macro')))
print("MLP - confusion matrix on test set: \n", confusion_matrix(y_test, y_pred))
all_res['accuracy_score'].append(accuracy_score(y_test, y_pred))
all_res['precision_score'].append(precision_score(y_test, y_pred, average='macro'))
all_res['recall_score'].append(recall_score(y_test, y_pred, average='macro'))
all_res['f1_score'].append(f1_score(y_test, y_pred, average='macro'))

# cnn
# Clear any existing TensorFlow graph from memory and set random seeds.
keras.backend.clear_session()
np.random.seed(42)
tf.random.set_seed(42)

X_train_keras = X_train_std.reshape(X_train_std.shape[0], 28, 28)
X_test_keras = X_test_std.reshape(X_test_std.shape[0], 28, 28)

X_train_keras = np.expand_dims(X_train_keras, axis=3)
X_test_keras = np.expand_dims(X_test_keras, axis=3)

y_train_keras = keras.utils.to_categorical(y_train, 62).astype('int32')
y_test_keras = keras.utils.to_categorical(y_test, 62).astype('int32')

# load model
print("############### CNN Model Path Input ###############")
model_path = 'best_algorithm.h5'
model = load_model(model_path)
print("############### CNN model loaded ###############")
# Check pretrained model performance on test set
y_prob = model.predict(X_test_keras)
y_pred = np.argmax(y_prob,axis=1)
ground_truth = np.argmax(y_test_keras,axis=1)

print("CNN - accuracy score on test set: {:.3f}".format(accuracy_score(ground_truth, y_pred)))
print("CNN - precision score on test set: {:.3f}".format(precision_score(ground_truth, y_pred, average='macro')))
print("CNN - recall score on test set: {:.3f}".format(recall_score(ground_truth, y_pred, average='macro')))
print("CNN - f1 score on test set: {:.3f}".format(f1_score(ground_truth, y_pred, average='macro')))
print("CNN - confusion matrix on test set: \n", confusion_matrix(ground_truth, y_pred))
all_res['accuracy_score'].append(accuracy_score(ground_truth, y_pred))
all_res['precision_score'].append(precision_score(ground_truth, y_pred, average='macro'))
all_res['recall_score'].append(recall_score(ground_truth, y_pred, average='macro'))
all_res['f1_score'].append(f1_score(ground_truth, y_pred, average='macro'))

print("############### Model Comparison ############### \n ")
all_res_n = pd.DataFrame(all_res)
print(all_res_n)

############### Data Path Input ############### 

Default path is: 'English/Fnt/'
*** DO NOT ENTER anything if using the above path ***
Windows Path Example: C:/Users/xxx/Downloads/5318/English/Fnt/
MacOS Path Example: /Users/xxx/Downloads/5318/English/Fnt/
Please input the data path (before Samplexxx folders) end with '/'C:/Users/95839/Desktop/English/Fnt/
############### Data Loaded ############### 
 
############### Train/Test Splited ############### 
 
############### Standardise Data ############### 
 
############### PCA Done ############### 
 
############### KNN Model Loading ############### 
 
############### KNN Model Loaded ############### 
 
Knn - accuracy score on test set: 0.872
Knn - precision score on test set: 0.873
Knn - recall score on test set: 0.872
Knn - f1 score on test set: 0.872
Knn - confusion matrix on test set: 
 [[154   0   0 ...   0   0   0]
 [  0 172   0 ...   0   0   0]
 [  0   0 211 ...   0   0   0]
 ...
 [  0   1   0 ... 153   0   1]
 [  0   0   0 ... 

Epoch 50/150
Epoch 51/150
Epoch 52/150
Epoch 53/150
Epoch 54/150
Epoch 55/150
Epoch 56/150
Epoch 57/150
Epoch 58/150
Epoch 59/150
Epoch 60/150
Epoch 61/150
Epoch 62/150
Epoch 63/150
Epoch 64/150
Epoch 65/150
Epoch 66/150
Epoch 67/150
Epoch 68/150
Epoch 69/150
Epoch 70/150
Epoch 71/150
Epoch 72/150
Epoch 73/150
Epoch 74/150
Epoch 75/150
Epoch 76/150
Epoch 77/150
Epoch 78/150
Epoch 79/150
Epoch 80/150
Epoch 81/150
Epoch 82/150
Epoch 83/150
Epoch 84/150
Epoch 85/150
Epoch 86/150
Epoch 87/150
Epoch 88/150
Epoch 89/150
Epoch 90/150
Epoch 91/150
Epoch 92/150
Epoch 93/150
Epoch 94/150
Epoch 95/150
Epoch 96/150
Epoch 97/150
Epoch 98/150
Epoch 99/150
Epoch 100/150
Epoch 101/150
Epoch 102/150
Epoch 103/150
Epoch 104/150
Epoch 105/150


Epoch 106/150
Epoch 107/150
Epoch 108/150
Epoch 109/150
Epoch 110/150
Epoch 111/150
Epoch 112/150
Epoch 113/150
Epoch 114/150
Epoch 115/150
Epoch 116/150
Epoch 117/150
Epoch 118/150
Epoch 119/150
Epoch 120/150
Epoch 121/150
Epoch 122/150
Epoch 123/150
Epoch 124/150
Epoch 125/150
Epoch 126/150
Epoch 127/150
Epoch 128/150
Epoch 129/150
Epoch 130/150
Epoch 131/150
Epoch 132/150
Epoch 133/150
Epoch 134/150
Epoch 135/150
Epoch 136/150
Epoch 137/150
Epoch 138/150
Epoch 139/150
Epoch 140/150
Epoch 141/150
Epoch 142/150
Epoch 143/150
Epoch 144/150
Epoch 145/150
Epoch 146/150
Epoch 147/150
Epoch 148/150
Epoch 149/150
Epoch 150/150
############### MLP Model Loaded ###############
MLP - accuracy score on test set: 0.863
MLP - precision score on test set: 0.863
MLP - recall score on test set: 0.863
MLP - f1 score on test set: 0.863
MLP - confusion matrix on test set: 
 [[160   0   0 ...   0   0   0]
 [  0 174   0 ...   0   0   0]
 [  0   1 205 ...   0   0   1]
 ...
 [  0   0   0 ... 149   0   2]
 