Skip to content

hesta-io/Zhir-Bootcamp

Repository files navigation

ZhirAI Bootcamp

This repo serves as the starting point and documentation for training new models. Since tesseract has a large number of pre-trained models (.traineddata), we don't need to train models from scratch. We can finetune an existing model to improve its accuracy. To learn more about the options for training models read the official documentation.

Steps

  1. Install tesseract 4.1 with the training binaries: Linux/Windows (and add it to PATH in windows). Note: It seems like version 5-alpha has a bug and can't be used for training yet.

    sudo apt install tesseract-ocr
    sudo apt install libtesseract-dev
    
  2. Prepare the raw data and put it in langdata folder.

  3. Place all fonts in the fonts folder.

  4. Run find . -type f -print0 | xargs -0 dos2unix in terminal to fix line endings for all files.

  5. Generate ground truth: (18 hours for 13.5 million lines total)

    python3 -m pip install image
    python3 -m pip install python-bidi
    sudo chmod +x *.sh
    nohup ./1-txt2lstmf.sh ckb > 1-ckb.log &
    
  6. Run training (At least 24 hours): NOTE: Make sure number of characters in unicharset matches the one specified by the training script (2-train-layer.sh). More information.

    nice --20 nohup ./2-train-layer.sh ckb > training.log &
    
  7. Create best and fast .traineddata files from each .checkpoint file

    make traineddata MODEL_NAME=ckb
    

Useful scripts:

# See available fonts in a folder
text2image --fonts_dir path/to/fonts --list_available_fonts

# Open a log file and scroll to the end:
less +G ./1-ckb.log

# Run 1-txt2lstmf in background:
nohup ./1-txt2lstmf.sh ckb > 1-ckb.log &

# To kill a process and see system resource usage:
htop

# Run txt2lstmf scripts:
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-1 > logs/1.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-2 > logs/2.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-3 > logs/3.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-4 > logs/4.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-5 > logs/5.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-6 > logs/6.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-7 > logs/7.log &
nice --20 nohup ./1-txt2lstmf.sh ckb fonts-8/ckb-8 > logs/8.log &

# Get disk info:
df

# Get directory size:
du -sh bootcamp

# Get number of files in a directory:
find bootcamp/gt -name '01_Sarchia_Abdulkareem.200.4*' | wc -l
find bootcamp/gt -type f | wc -l

Notes:

  1. Make sure langdata/ckb/ckb.fontslist.txt has at least one font and an empty line at the end!

Troubleshooting

  1. If you had trouble pushing your changes to the repository, run git config http.postBuffer 524288000.

Read more