Code adapted from Youtube user @aniquemaniac. The `.ipynb` is optimized on Google Colab. The original pipeline is used to train Tesseract 4.1 on customed English font family, not Japanese handwriting. Thus, I made several modifications to make the pipeline work for this purpose.

Download these required files from github and upload to Google Drive

**1. Tesseract 4.1**


https://github.com/tesseract-ocr/tesseract/tree/4.1

**2. Download jpn.traineddata**

https://github.com/tesseract-ocr/tessdata_best

After downloading it, put `jpn.traineddata` inside the downloaded tesseract folder `tesseract/tessdata`

**3. Langdata lstm**

https://github.com/tesseract-ocr/langdata_lstm

Download only some important directory.
So download this chrome extension extension. This will allow you to select and download specific directory and files. https://chrome.google.com/webstore/detail/gitzip-for-github/ffabmkklhbepgcgfonabamgnfafbdlkn


Watch this video to locate where to put the files in the `langdata_lstm` folder: https://www.youtube.com/watch?v=V2chutR7RZo

1. Specific language Code folder, `jpn`, `chi_sim`, `chi_tra`
2. `Licence`
3. `desired_bigrams.txt`
4. `font_properties`
5. `radical-stroke.txt`
6. `forbidden_characters_default`
7. `Katakana.unicharset`, `Katakana.xheights`, `Hiragana.unicharset`, `Hiragana.xheights`

Folder Structure ▶

    ------tesseract
    |         |----tessdata
    |                |---jpn.traineddata
    |
    ------langdata_lstm

Finally, upload everything to Google Drive root folder.






#1. Install Tesseract



In [None]:
!sudo apt install tesseract-ocr

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
The following additional packages will be installed:
  tesseract-ocr-eng tesseract-ocr-osd
The following NEW packages will be installed:
  tesseract-ocr tesseract-ocr-eng tesseract-ocr-osd
0 upgraded, 3 newly installed, 0 to remove and 45 not upgraded.
Need to get 4,816 kB of archives.
After this operation, 15.6 MB of additional disk space will be used.
Get:1 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-eng all 1:4.00~git30-7274cfa-1.1 [1,591 kB]
Get:2 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr-osd all 1:4.00~git30-7274cfa-1.1 [2,990 kB]
Get:3 http://archive.ubuntu.com/ubuntu jammy/universe amd64 tesseract-ocr amd64 4.1.1-2.1build1 [236 kB]
Fetched 4,816 kB in 1s (5,097 kB/s)
debconf: unable to initialize frontend: Dialog
debconf: (No usable dialog-like program is installed, so the dialog based frontend cannot be used. at /usr/share/perl5/Debc

#2. Create Folders
**train**   ▶    &nbsp;&nbsp;&nbsp;&nbsp; Data for training to be generated<br>
**output**  ▶    &nbsp; Output of trained data


In [None]:
!mkdir output train

# 3. Add Google Drive

In [None]:
from google.colab import drive
drive.mount('/content/drive')

Mounted at /content/drive


#4. Grant Permission to Tesseract Folder


In [None]:
!chmod 755 -R /content/drive/MyDrive/tesseract/src/training/tesstrain.sh

# 5. Create `lstm.train` Files
First generate `<index>.tif` and `<index>.box` files using the `.ipynb` notebook in our repo. You can modify the range to generate more `lstm.train` files.

In [None]:
%%shell
for i in {0..9}; do tesseract -l jpn /content/train/$i.tif /content/train/$i --psm 6  lstm.train; done
#-l jpn

Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/jpn_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'jpn_vert'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/jpn_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'jpn_vert'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/jpn_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading language 'jpn_vert'
Tesseract Open Source OCR Engine v4.1.1 with Leptonica
Error opening data file /usr/share/tesseract-ocr/4.00/tessdata/jpn_vert.traineddata
Please make sure the TESSDATA_PREFIX environment variable is set to your "tessdata" directory.
Failed loading



Create `jpn.training_files.txt`
For the content, enter the `.lstmf` files you want to train upon. Put the txt file under `/train`.
```
train/0.lstmf
train/1.lstmf
train/2.lstmf
```

#6. Extract `jpn.lstm` from `jpn.traineddata`

`jpn.lstm` will be generated from `jpn.taineddata`**bold text**, which will we use as the pretrained model.

In [None]:
!combine_tessdata -e /content/drive/MyDrive/tesseract/tessdata/jpn.traineddata jpn.lstm

Extracting tessdata components from /content/drive/MyDrive/tesseract/tessdata/jpn.traineddata
Wrote jpn.lstm
Version string:4.00.00alpha:jpn:synth20170629:[1,48,0,1Ct3,3,16Mp3,3Lfys64Lfx96Lrx96Lfx512O1c1]
0:config:size=2563, offset=192
17:lstm:size=12936715, offset=2755
18:lstm-punc-dawg:size=2602, offset=12939470
19:lstm-word-dawg:size=1167978, offset=12942072
20:lstm-number-dawg:size=50, offset=14110050
21:lstm-unicharset:size=173324, offset=14110100
22:lstm-recoder:size=46601, offset=14283424
23:version:size=80, offset=14330025


#7. Train the LSTM Model

Trained data will be inside `content/output` folder.

Change Max Iteration as per your need. Increase this value to get less error.

In [None]:
%%shell

rm -rf output/*
OMP_THREAD_LIMIT=16 lstmtraining \
	--continue_from jpn.lstm \
	--model_output output/handwriting \
	--traineddata /content/drive/MyDrive/tesseract/tessdata/jpn.traineddata \
	--train_listfile train/jpn.training_files.txt \
	--max_iterations 3000

Loaded file jpn.lstm, unpacking...
Continuing from jpn.lstm
Loaded 30/30 lines (1-30) of document train/0.lstmf
Loaded 30/30 lines (1-30) of document train/5.lstmf
Loaded 30/30 lines (1-30) of document train/1.lstmf
Loaded 30/30 lines (1-30) of document train/2.lstmf
Loaded 30/30 lines (1-30) of document train/4.lstmf
Loaded 30/30 lines (1-30) of document train/3.lstmf
Loaded 30/30 lines (1-30) of document train/7.lstmf
Loaded 30/30 lines (1-30) of document train/8.lstmf
Loaded 30/30 lines (1-30) of document train/6.lstmf
Loaded 30/30 lines (1-30) of document train/9.lstmf
2 Percent improvement time=100, best error was 100 @ 0
At iteration 100/100/100, Mean rms=1.597%, delta=14.413%, char train=60.1%, word train=100%, skip ratio=0%,  New best char error = 60.1 wrote best model:output/font_name60.1_100.checkpoint wrote checkpoint.

2 Percent improvement time=100, best error was 60.1 @ 100
At iteration 200/200/200, Mean rms=1.469%, delta=12.3%, char train=51.167%, word train=100%, skip r



#8. Get the Trained Model

This command will create trained data from `fontname.checkpoint`.
This will be inside `content/output` folder.

In [None]:
%%shell

lstmtraining --stop_training \
	--continue_from output/handwriting_checkpoint \
	--traineddata /content/drive/MyDrive/tesseract/tessdata/jpn.traineddata \
	--model_output output/handwriting.traineddata

Loaded file output/font_name_checkpoint, unpacking...




#9. Model Inference

Download and paste the `handwriting.traineddata` inside this folder on a linux machine:

/usr/share/tesseract-ocr/4.00/tessdata

You can put your `.jpg` image anywhere in the colab. Modify the following code to generate the prediction `.txt`.

For different `psm` flag options, see: https://pyimagesearch.com/2021/11/15/tesseract-page-segmentation-modes-psms-explained-how-to-improve-your-ocr-accuracy/

In [None]:
!tesseract /content/test/test.jpg /content/test/predict1 -l handwriting --psm 6

Tesseract Open Source OCR Engine v4.1.1 with Leptonica
