TiniestSegmenter

A port of TinySegmenter written in pure, safe rust with no dependencies. You can find bindings for both Rust and Python.

TinySegmenter is an n-gram word tokenizer for Japanese text originally built by Taku Kudo (2008).

Usage

Python

tiniestsegmenter can be installed from PyPI: pip install tiniestsegmenter

import tiniestsegmenter

tokens = tiniestsegmenter.tokenize("ジャガイモが好きです。")

With the GIL released on the rust side, multi-threading is also possible.

import functools
import tiniestsegmenter

tokenizer = functools.partial(tiniestsegmenter.tokenize)

documents = ["ジャガイモが好きです。"] * 10_000
with ThreadPoolExecutor(4) as e:
    list(e.map(encoder, documents))

Rust

Add the crate to your project: cargo add tiniestsegmenter

Usage:

use tiniestsegmenter as ts;

fn main() {
    let tokens: Result<Vec<&str>, ts::TokenizeError> = ts::tokenize("ジャガイモが好きです。");
}

Performance

tiniestsegmenter can process 2GB of text in less than 90 seconds on a Macbook Pro at speeds of around ±20 MB/s on a single thread.

Comparison with similar codebases

Each codebase was benchmarked using the timemachineu8j dataset, a Japanese transation of The Time Machine by Herbert George Wells.

Repo	Lang	time (ms)
jwnz/tiniestsegmenter	Rust	11.996
jwnz/tiniestsegmenter	Python	14.803
nyarla/go-japanese-segmenter	Go	36.869
woxtu/rust-tinysegmenter	Rust	44.535
JuliaStrings/TinySegmenter.jl	Julia	45.691
ikawaha/tinysegmenter.go	Go	58.694
SamuraiT/tinysegmenter	Python	219.604

System:
Chip: Apple M2 Pro (Macbook Pro 14-inch, 2023)
Cores: 10
Memory: 16 GB

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
bindings/python		bindings/python
tiniestsegmenter		tiniestsegmenter
.gitignore		.gitignore
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TiniestSegmenter

Usage

Performance

About

Releases

Packages

Languages

License

jwnz/tiniestsegmenter

Folders and files

Latest commit

History

Repository files navigation

TiniestSegmenter

Usage

Performance

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages