tatoeba_tinysegmenter

A small experiment using both Mecab and Tinysegmenter to create a tokenized list of Japanese sentences in JSON, taken from the Tatoeba corpus. It seems Mecab is much more accurate with tokenization.

Input: CSV file with a column of Japanese sentences. The first (header) row should be 'sentence.'

Output: JSON object with key 'sentences', which contains an array of arrays. Each array contains the tokens of a sentence.

Setup:

Download csv file of the Tatoeba Japanese corpus: https://tatoeba.org/eng/downloads
Add a first (header) row and enter 'sentence' in the topmost cell.
Install Tinysegmenter: pip install tinysegmenter
Install Mecab: pip install mecab-python3 (see Windows directions here: https://github.com/SamuraiT/mecab-python3)

Python port of Tinysegmenter courtesy of Masato Hagiwara; Original Tinysegmenter courtesy of Taku Kudo. https://pypi.org/project/tinysegmenter/ http://tinysegmenter.tuxfamily.org/

Mecab official documentation (Japanese): https://taku910.github.io/mecab/

Tatoeba corpus licensed under CC BY 2.0 FR. https://creativecommons.org/licenses/by/2.0/fr/ https://tatoeba.org/eng/downloads

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
convert_csv_mecab.py		convert_csv_mecab.py
convert_csv_tinysegmenter.py		convert_csv_tinysegmenter.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

tatoeba_tinysegmenter

About

Releases

Packages

Languages

jestasgameland/Japanese_tokenizer

Folders and files

Latest commit

History

Repository files navigation

tatoeba_tinysegmenter

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages