A fast, simple, multilingual tokenizer
Switch branches/tags
Nothing to show
Clone or download
Permalink
Failed to load latest commit information.
LICENSE Initial commit Feb 28, 2015
README.md README.md: Thanks, Liling May 24, 2017
tok-tok.pl tok-tok.pl: add --skip-comments cmd arg Dec 31, 2016
toktok.py m: toktok.py: rm debug statement Apr 11, 2017

README.md

Tok-tok

Tok-tok is a fast, simple, multilingual tokenizer. Python and Perl implementations are provided.

Example

For example, given an input of:

  They thought, "Is 9.5 or 525,600 my favorite number?",  before seeing Dr. Bob's dog talk on www.foo.ninja/index.html?nobeta=1&noalpha=0.

The output is:

  They thought , " Is 9.5 or 525,600 my favorite number ? " , before seeing Dr. Bob ' s dog talk on www.foo.ninja/index.html?nobeta=1&noalpha=0 .

Usage

Python

python3 toktok.py [options] < text.txt > text.tok.txt

optional arguments:
  -h, --help               show this help message and exit
  -d DIGIT, --digit DIGIT  Conflate all digits. For example "3.14" -> "5.55"
  -l LANG, --lang LANG     Specify language code for moses tokenizer (default: en)
  --lc, --lower            Lowercase text
  --no_empty               Remove empty lines
  --skip_comments          Don't tokenize lines starting with '#'
  -t TOK, --tok TOK        Specify tokenizer submodule {casual, moses, stanford, toktok, treebank} (default: toktok)

A big thanks to Liling Tan for porting the regexes to Python.

Perl

perl tok-tok.pl [options] < text.txt > text.tok.txt

Options:
 -h, --help           Print this usage
 -d, --digit <u>      Conflate all digits to <u> . Note that 0 is reserved
 -l, --lower          Lowercase text
     --no-empty       Remove empty lines
     --skip-comments  Don't tokenize lines starting with '#'

Languages

Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding. You can use the command-line tool iconv to convert other formats to UTF-8.

Name

The name of the software means "language" or "speech" in Bislama, a Melanesian Creole language. Bislama is quite possibly the world's perfect language.