Introduction

This is the small script that can create the OCR training data for Japanese address.

The address information data comes from Japanese goverment site:

All cities data(data/cities.txt) is fetched from Japan Post: https://www.post.japanpost.jp/zipcode/download.html

All companies address data is fetched from Japan National Tax Agency: https://www.houjin-bangou.nta.go.jp/download/zenken/, and I downloaded the data and made some procession to generate the alldata.txt. It includes 4,943,568 addresses, so almost includes every character which used in address in Japan.

Usage

Just run the gen_jp_addresses.py, it will generate three files:

areas.txt which includes the city, town
numbers.txt which includes the numbered address, like chome(丁目), banchi(番地)
details.txt which includes the remained part of the address, like building name, floor or room number.

You can read more about Japanese address system here

When you get the files in output folder, you can generate the training data by your favorite tool.

I prefer to use text_renderer

I will make a simple introduction about how to use text_renderer to generate the training data below.

Generate training data with text_renderer

Follow the README file to install the text_renderer first.

Just copy the data/cities.txt, output/areas.txt, output/numbers.txt and output/details.txt to example_data/text, and copy the chars.txt provided in this repo to example_data/chars.

Copy jpn_addr.py to example_data folder, you can adjust this file, currently it will generate 550,000 training images and 55,000 evaluation images, I think it is the minimal number to generate suffient data if you use about 10 fonts.

Copy the font files to example_data/font, and write the names of font which you want to use in example_data/font_list/font_list.txt file.

You can also redefine the training data's background, please read the manual of text_renderer.

Now you can just run

python3 main.py --config example_data/jpn_addr.py --dataset img --num_processes 8 --log_period 5

to generate the training data. Set --num_processes to less equal than your CPU cores. And if you want to use lmdb data format, change img to lmdb.

Training with PaddleOCR

I use PaddleOCR to train OCR model. You should make some changes to make the training data you just generated can be used by PaddleOCR.

PaddleOCR use text formation labels file, but the labels file generated by text_renderer is json formation. So you need transform it. I provide the trans_labels.py to transform from json to text file.

Assume you generated training data at example_data/jpn_addr/training, just run

python trans_labels.py xxx/example_data/jpn_addr/training/labels.json xxx/example_data/jpn_addr/training/train_list.txt

to get the train_list.txt

Please do it again for evaluation label file.

And then you can train your own OCR model by PaddleOCR.

Q&A

If you have any question, please write an issue. You can use Chinese, Japanese and English.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Introduction

Usage

Generate training data with text_renderer

Training with PaddleOCR

Q&A

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
data		data
output		output
.gitignore		.gitignore
README.md		README.md
chars.txt		chars.txt
gen_jp_addresses.py		gen_jp_addresses.py
jpn_addr.py		jpn_addr.py
trans_labels.py		trans_labels.py

mikeshi80/gen_japan_address_train_data

Folders and files

Latest commit

History

Repository files navigation

Introduction

Usage

Generate training data with text_renderer

Training with PaddleOCR

Q&A

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages