This is the small script that can create the OCR training data for Japanese address.
The address information data comes from Japanese goverment site:
All cities data(data/cities.txt) is fetched from Japan Post: https://www.post.japanpost.jp/zipcode/download.html
All companies address data is fetched from Japan National Tax Agency: https://www.houjin-bangou.nta.go.jp/download/zenken/, and I downloaded the data and made some procession to generate the alldata.txt. It includes 4,943,568 addresses, so almost includes every character which used in address in Japan.
Just run the gen_jp_addresses.py
, it will generate three files:
- areas.txt which includes the city, town
- numbers.txt which includes the numbered address, like chome(丁目), banchi(番地)
- details.txt which includes the remained part of the address, like building name, floor or room number.
You can read more about Japanese address system here
When you get the files in output folder, you can generate the training data by your favorite tool.
I prefer to use text_renderer
I will make a simple introduction about how to use text_renderer
to generate the training data below.
Follow the README file to install the text_renderer first.
Just copy the data/cities.txt
, output/areas.txt
, output/numbers.txt
and output/details.txt
to example_data/text
,
and copy the chars.txt
provided in this repo to example_data/chars
.
Copy jpn_addr.py
to example_data
folder, you can adjust this file, currently it will generate 550,000 training images
and 55,000 evaluation images, I think it is the minimal number to generate suffient data if you use about 10 fonts.
Copy the font files to example_data/font
, and write the names of font which you want to use in example_data/font_list/font_list.txt
file.
You can also redefine the training data's background, please read the manual of text_renderer
.
Now you can just run
python3 main.py --config example_data/jpn_addr.py --dataset img --num_processes 8 --log_period 5
to generate the training data. Set --num_processes
to less equal than your CPU cores. And if you want to use lmdb
data format,
change img
to lmdb
.
I use PaddleOCR to train OCR model. You should make some changes to make the training data you just generated can be used by PaddleOCR.
PaddleOCR use text formation labels file, but the labels file generated by text_renderer
is json formation. So you need transform
it. I provide the trans_labels.py
to transform from json to text file.
Assume you generated training data at example_data/jpn_addr/training
, just run
python trans_labels.py xxx/example_data/jpn_addr/training/labels.json xxx/example_data/jpn_addr/training/train_list.txt
to get the train_list.txt
Please do it again for evaluation label file.
And then you can train your own OCR model by PaddleOCR.
If you have any question, please write an issue. You can use Chinese, Japanese and English.