This is a HTML page taking an image with English words and numbers and returning the text on it. An English word image generator produces character images in different fonts and font sizes. Then, those images generated and other character images are fed to a neural network model. The idea of the preprocessing method and model is from here. Afterwards, the classifying model is used to retrieved the text from the image.
The index page:
The result page:
To run the page, it needs Python 3 to be installed. It also needs packages including
Packages can be installed through pip.
The HTML page is able to run when Flask_web.py is executed with Python 3 through the following command
python Flask_web.py
"Running on http://127.0.0.1:5000/" will be displayed.
Then, the page can be browsed at the address http://127.0.0.1:5000/
An image with English words or digits on it can be uploaded. The text will be retrieved and displayed in the result image after processing.
Flow diagram
An English word generator produces images of English characters and digits for each font and each font size in a black background. It includes 26 upper case and lower case characters, and digits, 0 to 9. Those 53 font files are in "src_smallSubset/Font/" folder. 4 Font sizes, 8,12,36,40, are used. 13144 images are produced and saved to the folder "words_generated".
Apart from English Word Generator, other English character and digit datasets are also inputted to improve the performance of the model. Datasets of different variation of character include
- 62992 computer font character images from Chars74k dataset
- A subset of 6283 street view character images from Chars74k dataset in here
- A handwritten character dataset of 6979 images from EMNIST
Moreover, an non-text object image dataset of 14000 images from CIFAR-10 is used to recognize non-text objects other than characters and digits.
Images are preprocessed and then features are extracted before feeding them to the models. Street view character image, handwritten character and non-text object image datasets are noisy images from photos in Google Street View or scanning of handwritten document so preprocessing is needed. Scikit-image package is used to preprocess and extract features.
Images are changed to grayscale and then applied median filter to reduce noise. Rescaling intensity is also applied to the images to stretch contrast. After resizing to 32x32 pixels, the feature vectors of images are retrieved through histogram of oriented gradient, which computes gradient image and gradient histograms. Feature vectors are concatenation of the feature of 16 cells (8x8 pixels per cell) in 9 orientation.
Images of Computer font character and words from English Generator are also changed to grayscale and resized to 32x32 pixels before extracting features through histogram of oriented gradient.
The lists of features for each image are saved to a .txt file.
Support Vector Machine Classifier (SVC) aims at identifying and differentiating characters from other non-text object. Scikit-learn SVM package is used to build the model. Radial basis function is used as the kernel function.
The following features are loaded, shuffled randomly and inputted to train the SVC model.
- Street view character and
- Non-text object images
The average recall of model through cross validation is 89%. The model is then saved to "svmmodel.pkl" file after fitting data.
Multi-layer Perceptron (MLP) neural network model is a backpropagation model aiming at classifying an image to 62 classes of upper case and lower case characters, and digits. Scikit-learn MLP package is used to build the model. The parameters are as follows:
- Solver: stochastic gradient-based optimizer
- Activation function: hyperbolic tan function
- Learning rate, "invscaling": gradually decreases the learning rate
- 5 hidden layers with 300 neurons per layer, and one input and output layer
The following features are combined, shuffled randomly and inputted to train the MLP model.
- Street view character image,
- Handwritten character image,
- Computer font character image and
- Image generated by English Word Generator
The average accuracy of model through cross validation is 68%. The model is then saved to "mlpfullmodel.pkl" file after fitting data.
The HTML page is a simple interface for user to upload an image, process by predicting the text and display the output image. It is built with Flask, which is a tool to create a page and run python function in the HTML page.
The image is firstly uploaded by the user in the index.html page and saved to the "static" folder. The image is then read by the function to predict text in the image, and print and save the text on a blank background. The function looks for objects in the image through searching for contours at level 0.45 by find_contours of Scikit-image package.
After finding contours, SVC model is loaded and takes the cropping images of objects appearing in the image to filter out images containing non-text objects. MLP model is then loaded and takes the filtered images to identify each character and digit. The text is printed on a blank background image, and this result image is saved to the "static" folder. The text, original image and the result image are displyed in the result.html page.
Images in "src_smallSubset/examples_for_test/" folder are example images to be uploaded to the page for demonstrating the result.
Convolutional neural network is a deep, feed-forward artificial neural networks for image processing and recognition. It includes multiple Convolutional and Pooling layers to learn the image in pixels. It has an advantage of less preprocessing procedures required when it is compared to my model.
- A full Chars74k dataset
The full Chars74k dataset contains 50000 character images from street view, computer font and handwritten text. Using the whole dataset provides a more comprehensive examples of characters in different sizes, fonts and contexts, which may benefit the machine learning model.
If you have any enquiries or suggestions, please feel free to contact me at kftam@connect.ust.hk.