Skip to content

jwnz/tiniestsegmenter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

Python

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Rust

Add the crate to your project: cargo add tiniestsegmenter

Usage:

use tiniestsegmenter as ts;

fn main() {
    let tokens: Result<Vec<&str>, ts::TokenizeError> = ts::tokenize("ジャガイモが好きです。");
}

Performance

tiniestsegmenter can process 2GB of text in less than 90 seconds on a Macbook Pro at speeds of around ±20 MB/s on a single thread.

Comparison with similar codebases

Each codebase was benchmarked using the timemachineu8j dataset, a Japanese transation of The Time Machine by Herbert George Wells.

Repo Lang time (ms)
jwnz/tiniestsegmenter Rust 11.996
jwnz/tiniestsegmenter Python 14.803
nyarla/go-japanese-segmenter Go 36.869
woxtu/rust-tinysegmenter Rust 44.535
JuliaStrings/TinySegmenter.jl Julia 45.691
ikawaha/tinysegmenter.go Go 58.694
SamuraiT/tinysegmenter Python 219.604

System:
Chip: Apple M2 Pro (Macbook Pro 14-inch, 2023)
Cores: 10
Memory: 16 GB

About

A compact Japanese segmenter

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published