A simple example for training machine learning model

This example demonstrates the processes of collecting data and train a image classification machine learning model.

We are going to train a Hiragana (ひらがな) classifier (single character only).

Dataset will be generated from some fonts.

Install dependencies

This project requires Python 3.6+ because I use f-string feature.

pip install -r requirements.txt

What do we want to classify?

We need to specify a limited output characters that we want to classify. I put a list of basic Hiragana (without variants) in hiragana.txt. You may put more characters in there. Remember to encode them in UTF-8 encoding.

Because this is a classification problem, we need to define the output format. Specifically, we want to pin down which character will take which index position in the output. If we don't have this output order consistently, we might interpret the model output wrongly. We will generate a JSON file for specifying the output format. It will be a array of characters in which the character order has been defined. This is useful when we export the model to TensorFlow Lite or TensorFlow.js where the model doesn't contain any information about what the output is represented.

notebook

This notebook will guide you to generate the labels.json file. The order of characters are just the order which they appear in the hiragana.txt file. There are more suitable ways to order these characters but for the sake of simplicity, this ordering method will get the job done.

Where do we get the dataset?

When I got into machine learning, all the tutorials have nice and clean dataset ready for us to press a button and done. In this sample, we will have to create/find and clean the dataset ourselves. Well, I did clean the dataset to make this sample work. But I hope this example can help beginners who haven't able to do anything new beyond running through the tutorials.

We will have to collect Hiragana-compatible fonts to generate the character images and put them in the fonts directory.

Create the dataset

python3 create-dataset.py

By default, this script will take labels.json as input. Add -h for usage information.

Inspect the dataset

python3 inspection-server.py

This will start a local webserver for inspecting and validating the dataset in browser at http://localhost:3000/index.html.

You can click on the label to show records or mark that label as done (all the records for that label have been reviewed). You can click on the image to mark the record as invalid or mark all the record with the same font as invalid. This process cannot be and should not be automated.

Note

I do apply typing for Python so most of the time you or me from the future can know where something comes from.
I do have type hinting for JavaScript with jshint. I didn't choose TypeScript because I wanted to reduce dependencies.

References

argparser - Python Documentation
decorator - Corey Schafer's video
classmethod - Pythoc Documenation

Name		Name	Last commit message	Last commit date
Latest commit History 94 Commits
fonts		fonts
scripts @ b4e467c		scripts @ b4e467c
static		static
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
04_build_tfrecord.py		04_build_tfrecord.py
05_train_model.py		05_train_model.py
06_gui_test.py		06_gui_test.py
1-create-label.ipynb		1-create-label.ipynb
2-create-dataset-from-fonts.ipynb		2-create-dataset-from-fonts.ipynb
README.md		README.md
argtypes.py		argtypes.py
constants.py		constants.py
create-dataset.py		create-dataset.py
custom-labeling-file.md		custom-labeling-file.md
export-dataset-as-image.py		export-dataset-as-image.py
hiragana.txt		hiragana.txt
inspect-dataset.py		inspect-dataset.py
inspection-server.py		inspection-server.py
japanese-characters.txt		japanese-characters.txt
jouyou_kanji.txt		jouyou_kanji.txt
key_label_dict.py		key_label_dict.py
logger.py		logger.py
note.ipynb		note.ipynb
numpy_boardcasting_note.ipynb		numpy_boardcasting_note.ipynb
prepare-python-virtual-environment.md		prepare-python-virtual-environment.md
profiler.py		profiler.py
remove-invalid-records.py		remove-invalid-records.py
requirements.txt		requirements.txt
setup.cfg		setup.cfg
sort_label_by_unicode_codepoint.py		sort_label_by_unicode_codepoint.py
tensorflow_utils.py		tensorflow_utils.py
train.py		train.py
training-note.ipynb		training-note.ipynb
utils.py		utils.py
validate-model.ipynb		validate-model.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A simple example for training machine learning model

Install dependencies

What do we want to classify?

Where do we get the dataset?

Create the dataset

Inspect the dataset

Note

References

About

Releases

Packages

Contributors 2

Languages

ichisadashioko/hiragana-recognition

Folders and files

Latest commit

History

Repository files navigation

A simple example for training machine learning model

Install dependencies

What do we want to classify?

Where do we get the dataset?

Create the dataset

Inspect the dataset

Note

References

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages