Skip to content
Compare the speed of various Japanese tokenizers in Python.
Branch: master
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Type Name Latest commit message Commit time
Failed to load latest commit information.

Tokenizer Benchmarks

This repository has scripts for benchmarking Japanese tokenizers. It was originally created to make sure that fugashi wasn't slower than mecab-python3.

The benchmark task is to get a per-word word count from the Aozora Bunko edition of "I am a Cat", stored in the wagahai.txt file, using a Counter object.

I suggest using hyperfine for benchmarking, though anything that can run the scripts is adequate.

To run:

# install mecab, unidic, and hyperfine with your OS package manager
pip install fugashi mecab-python3 sudachipy natto-py
# sudachipy needs its own dictionary
pip install
hyperfine -w 10 ./bench*.py

Results on my machine:

Command Mean [ms] Min [ms] Max [ms] Relative
./ 266.8 ± 1.2 265.0 269.1 1.0
./ 255.6 ± 2.3 251.9 259.7 1.0
./ 1178.3 ± 27.8 1153.9 1230.5 4.6
./ 58495.8 ± 283.2 58157.2 58898.5 228.9
You can’t perform that action at this time.