Skip to content
/ pilgen Public

The aim of this repository is to generate datasets (image & its label) for OCR training.

Notifications You must be signed in to change notification settings

hiyali/pilgen

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Pilgen - (WIP: optimizing)

Python Image-Label dataset Generator for OCR

Generate

python3 generate.py --lang ug --count 100 --out-dir data/

This command will output 100 images into folder data/images/, filename pattern is 'word_{}.jpg'.format(line_num), exmaple:

data/images/word_1.jpg
data/images/word_2.jpg
...
data/images/word_100.jpg

and a gt.txt file, its content pattern is '{}\t{}'.format(filepath, word), like below:

data/images/word_1.jpg	ئانا
data/images/word_2.jpg	تىلىم
...
data/images/word_100.jpg	گۈللە

Supported languages

  • ug - Uyghur (Uighur)
  • other langs may will come

FAQ

  1. How use your own corpus?

Ref: #2

  1. Uyghur words are separated in image?

Ref: #2

Test

python3 test.py

Develop environment

  • Ubuntu 18.04.1
  • Python 3.6.9

Author

Salam Hiyali

Contribute

Feel free

License

MIT

About

The aim of this repository is to generate datasets (image & its label) for OCR training.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Sponsor this project

Packages

No packages published

Languages