FastTokenizer

FastTokenizer is a tokenizer meant to perform language agnostic tokenization using unicode information.

While the initial goal is to design a tokenizer for the purpose of machine translation, the same tokenizer is generic enough to be adapted to a wide range of tasks in NLP due to its' ability to handle a wide range of languages and writing systems.

Some of the notable features of FastTokenizer are

Providing just the right amount of tokenization.
- Segmentation are designed to be intuitive and rule based. The format is ideal for downstream NLP models like subword modelling, RNNs or transformers.
- Also designed to be not so aggressive. This way number of tokens can be kept down, allowing model to run faster.
Works for any and every langauge/writing system.
Cross programming language.
Performs format retaining unicode normalization.
Performance matches or exceeds moses-tokenizer on tasks such as WMT and GLUE.
Tokenization can be reversed.
- However custom desegmenter should be used to achieve desired formatting as desegmentation is highly use-case driven.

Comparison with other tokenizers from the web

Source:          他的表现遭到《天空体育》评论员内维尔的批评。
Segmenter:       ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
Moses:           ['他的表现遭到《天空体育》评论员内维尔的批评。']
Spacy Tokenizer: ['他的表现遭到《天空体育》评论员内维尔的批评。']
Tweet Tokenizer: ['他的表现遭到', '《', '天空体育', '》', '评论员内维尔的批评', '。']
NLTK Tokenizer:  ['他的表现遭到《天空体育》评论员内维尔的批评。'

Source:          AirPods耳機套
Segmenter:       ['AirPods', '耳機套']
Moses:           ['AirPods耳機套']
Spacy Tokenizer: ['AirPods耳機套']
Tweet Tokenizer: ['AirPods耳機套']
NLTK Tokenizer:  ['AirPods耳機套']

Source:          A typical master's programme has a duration of 1-1.5 years.
Segmenter:       ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Moses:           ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1', '@-@', '1.5', 'years', '.']
Spacy Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years.']
Tweet Tokenizer: ['A', 'typical', "master's", 'programme', 'has', 'a', 'duration', 'of', '1-1', '.', '5', 'years', '.']
NLTK Tokenizer:  ['A', 'typical', 'master', "'s", 'programme', 'has', 'a', 'duration', 'of', '1-1.5', 'years', '.']

Installation

C++

Comming soon.

Python

pip install fasttokenizer

Usage

C++

#include <fasttokenizer/segmenter.h>

Segmenter segmenter = Segmenter(args.protected_dash_split);

std::string text = "Hello World!";
std::string output;

// Normalize
output = segmenter.normalize(text)

// Segment
output = segmenter.segment(text)

// Normalize and segment in one function
// Reduce string to icu::UnicodeString overhead
output = segmenter.normalize_and_segment(text);

// Desegment
output = segmenter.desegment(text);

Python

import fasttokenizer

segmenter = fasttokenizer.Segmenter()

text = "Hello World!"

# Normalize
output: str = segmenter.normalize(text)

# Segment
output: str = segmenter.segment()

# Normalize and segment
output: str = segmenter.normalize_and_segment(text)

# Output of segment is str.
# To get tokens, you can split by whitespace.
tokens = output.split()

# Desegment
output: str = segmenter.desegment(text)

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
.github/workflows		.github/workflows
bindings/python/fasttokenizer		bindings/python/fasttokenizer
deps		deps
include/fasttokenizer		include/fasttokenizer
scripts		scripts
src		src
.flake8		.flake8
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION		VERSION
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

FastTokenizer

Comparison with other tokenizers from the web

Installation

C++

Python

Usage

C++

Python

About

Releases

Packages

Languages

License

mingruimingrui/FastTokenizer

Folders and files

Latest commit

History

Repository files navigation

FastTokenizer

Comparison with other tokenizers from the web

Installation

C++

Python

Usage

C++

Python

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages