# Kuzushiji Recognition with Natural Language Inference
# Group 23: Hualin Liu, Anran Hao, Victor Soh

# Part I. Project Motivation


## Background and Motivation

This project is based on a kaggle competition of handwritten text extraction from ancient Japanese books. These books are written in Kuzushiji, which are not comprehensible to modern people. By building models to recognize and transcribe them to modern Japanese, we can make more documents accessible which helps us understand ancient Japanese culture and history. Several observations of data are worth mentioning. First of all, the dataset is highly imbalanced with several classes having only one occurrence. Second, the dataset is very noisy. Characters were written in varying styles in different documents. Not to mention that besides Kuzushiji, there are annotation characters and sometimes characters from other page which can be seen underneath the current page. Third, the characters are all from paintings and books. Third, we notice that most of the methods only focus on character detection from image without using any text context to infer the character with low confidence score. Motivated by these observations, we would like to propose a method that extract text in documents effectively despite noise. In our initial proposal, fine-tuned language model is leveraged to infer and improve detection and classification results with syntactic and semantic information.

## Existing methods

Traditional handwritten optical character recognition (OCR) method emphasizes on character segmentation, feature extraction and classification. The segmentation approach attempts to separate the characters from the rest of the image. Feature extraction uses geometric features such as bends and lines or statistical features such as moments to feed to the classifier. Classification uses Neural network, Bayesian or Nearest neighbour to cluster and classify the extracted features into their classes. Recently, Deep Learning techniques have shown some strong performance in handwritten optical recognition. The main tasks for handwritten OCR using deep learning are detection and classification. For the detection of objects, popular methods such as Regions with CNN (RCNN)/MaskRCNN, Single Shot Detector (SSD)/YOLO [2] and Region Fully-Connected Network (RFCN)/UNET were introduced recently. RCNN uses selective search to gather a region of interest and bounding boxes are derived. The bounded images are subsequently extracted for classification. Some of the teams in Kaggle had used the UNET for the segmentation and detection of the characters. The UNET and RFCN use skip net connection to train a network to map an image to segments. The segments are used for subsequent classification of objects, similar to RCNN. The SSD and YOLO methods are fast detection method which can also carry out the task of the classification. For most classification of objects, state of the art classification networks like Resnet and Inception are commonly used. In the classification of text with sequence from scenes, Convolutional Recurrent Neural Network (CRNN) has also been shown to improve the accuracy of text recognition.

## Proposed method

We propose a two-phase approach for the task: first, a character detection model scans through an image to identify candidate characters; second, a character classification model classifies the detected characters into different classes, each of which is mapped to a Unicode character. We aim to leverage contextual information of the characters to enhance the recognition. To this end, we plan to integrate language model (LM) with pre-trained weights into our detection-classification architecture. More specifically, we intend to make sure of inter-character relationships captured by LM for both the classification and the detection model. For example, where characters in a sentence are predicted with low confidence, we can mask these characters and use LM to infer their classes from high confidence prediction of the remaining characters in the sentence. Based on its knowledge about co-appearing probabilities of characters, LM can be used to help classify the characters correctly when visual clues from the detection model fail us. For character detection, LM may also help the model rule out candidates that are not malformed. For example, if two consecutive characters written in a condensed manner are recognised by the detection model as one single character, the text span will receive a lower score than other candidate prediction results by LM, as it is less likely to form a sensible sentence. When predictions are uncertain, LM can help to distinguish the right one from the rest among all possible candidates.

# Part II. Data Acquisition

## Kuzushiji Dataset

The Kuzushiji dataset can be downloaded from <a href="https://www.kaggle.com/c/kuzushiji-recognition/data">here</a>. 

You will need to install Kaggle and use the following command to download:

<code>pip install kaggle
kaggle competitions download -c kuzushiji-recognition
</code>

Please put all downloaded content inside a folder called <code>data</code> and arrange the folder to be alongside other source code.

The following structure shows an example of the folder structure:

<code>
./ (local folder)
│   ... (Other source code folder)
└───data
    │ sample_submission.csv
    │ train_images
    | unicode_translation.csv
    | train.csv
    | ...
    
</code>

The dataset contains the following:

train.csv - the training set labels and bounding boxes.
- image_id: the id code for the image.
- labels: a string of all labels for the image. The string should be read as space separated series of values where Unicode character, X, Y, Width, and Height are repeated as many times as necessary.

sample_submission.csv - a sample submission file in the correct format.

- image_id: the id code for the image
- labels: a string of all labels for the image. The string should be read as space separated series of values where Unicode character, X, and Y are repeated as many times as necessary. The default label predicts that there are the same two characters on every page, centered at pixels (1,1) and (2,2).

unicode_translation.csv - supplemental file mapping between unicode IDs and Japanese characters.

[train/test]_images.zip - the images.
