Skip to content

ichisadashioko/kanji-recognition

Repository files navigation

Kanji recognition

Introduction

This project is inspired by the Tensorflow tutorial on MNIST handwritten digit when I was learning Convolutional Nerual Networks.

This project demonstrates building a CNN to recognize Japanese kanji characters.

Write-up

Labels

Moving away from the MNIST example, my first problem was the labels. As I was learning RTK at that time, I thought those characters would be a good starting point as they fit my need (my need for using those characters during learning Japanese). I spent a few days writing some scrappers for getting those characters from a memrise course and wikipedia.

Data

While writing those scrappers, I realized that I had no dataset for training. Because of that, I created a drawing/note taking app with cordova to generate some data without labeling. That took a few weeks and I was happy with that because that was one of my first mobile app experience.

A few weeks later, I realized that I was not generating nowhere enough data for training. The MNIST example has around 10,000 records for each labels. I had ~2000 labels and less than 10 records for 20% of those labels. I needed to find a way to create data. "Fonts" - a thing that came to my mind. While learning Japanese with Anki, the default font for rendering Japanese was pretty bad for learners - the characters were not rendered as we suppose to write them. I got my hand on some of the Japanese fonts that is suitable for Japanese learners from the community. I got the ideal to use Japanese fonts to generate image data. It took me a 1-2 months to complete the project.

Training

With the data ready, I was able to train reasonable good models (in a few weeks). I spent the next few months to build some applications that utilize that model - a web app demo, an android app, and a desktop app for labeling my hard-work writing data.

New data

One days, I stumbled on ETL Character Database - an image dataset which is perfect for my need. It contains more data than I can write in the next 5 years. In addition, all of them has been labeled. I took me a few weeks to process one part of the dataset. It was the first time I had to research about text encoding (ASCII, UTF-8, SHIFT-JIS, UTF-16, etc.). With the new found dataset, the model performed significant better than being trained with the fonts dataset.

Implementations

  • TensorFlow - Python - Train model with Python and TensorFlow.

  • tfjs (on gh-pages branch) - Use trained model to recognize handwriting from HTML canvas with JavaScript.

  • TensorFlow Lite - Use trained model to create handwriting input app on Android device with Java/Kotlin.

References