Morphologically Biased Byte-Pair Encoding

mBPE acts as an extension to the huggingface/tokenizers library and is designed to enhance segmentations produced by the byte-pair encoding tokenization algorithm¹. Byte-pair encoding has been show to badly approximate morphological boundaries², which is especially problematic for morphologically rich language. By incorporating morphological knowledge into the pre-tokenization process, we aim to improve the quality of produced segmentations through an induced bias towards morphologically motivated sub-word boundaries.

Pre-trained tokenizers and models are available on Hugging Face.

Pre-Tokenizers

External

The external pre-tokenizer enables the integration custom pre-tokenization algorithms via a socket connection. Tokenization parallelism should be disabled by setting TOKENIZERS_PARALLELISM=true. Note that disabling parallelism will slow down tokenization significantly. See jonasknobloch/unimorph for a reference server implementation.

Tree-Split

The tree-split pre-tokenizer introduces additional boundaries by clustering inflected word forms retrieved from UniMorph³ dictionaries. Form clusters are aligned by constructing a suffix tree for each cluster. New boundaries are then introduced by traversing the trees and introducing boundaries at nodes with multiple children.

Morfessor

The Morfessor pre-tokenizer introduces additional boundaries retrieved using an arbitrary Morfessor⁴⁵ model. Trained Morfessor models need to be converted using the provided protobuf definition and conversion script

Intrinsic Metrics

Tokenizer Fertility

tokenizer	compounds	fertility
gpt2_cx-en_00000-00000_50k	4992469	1.32
gpt2+ts_cx-en_00000-00000_50k	4923123	1.40
gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k	3630703	1.42
gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k	99191	1.69

Boundary Precision and Recall

tokenizer	P	R	F1
gpt2_cx-en_00000-00000_50k	0.33	0.56	0.42
gpt2+ts_cx-en_00000-00000_50k	0.40	0.58	0.47
gpt2+morf_u0-30-50-x_cx-en_00000-00000_50k	0.45	0.61	0.52
gpt2+morf_s0-30-x-2_cx-en_00000-00000_50k	0.56	0.59	0.57

Name		Name	Last commit message	Last commit date
Latest commit History 43 Commits
examples		examples
scripts		scripts
src		src
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
README.md		README.md
build.rs		build.rs
ces_afghansky.tsv		ces_afghansky.tsv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Morphologically Biased Byte-Pair Encoding

Pre-Tokenizers

External

Tree-Split

Morfessor

Intrinsic Metrics

Tokenizer Fertility

Boundary Precision and Recall

About

Languages

jonasknobloch/mbpe

Folders and files

Latest commit

History

Repository files navigation

Morphologically Biased Byte-Pair Encoding

Pre-Tokenizers

External

Tree-Split

Morfessor

Intrinsic Metrics

Tokenizer Fertility

Boundary Precision and Recall

Footnotes

About

Topics

Resources

Stars

Watchers

Forks

Languages