Skip to content


Folders and files

Last commit message
Last commit date

Latest commit



23 Commits

Repository files navigation


Tok-tok is a fast, simple, multilingual tokenizer. Python and Perl implementations are provided.


For example, given an input of:

  They thought, "Is 9.5 or 525,600 my favorite number?",  before seeing Dr. Bob's dog talk on

The output is:

  They thought , " Is 9.5 or 525,600 my favorite number ? " , before seeing Dr. Bob ' s dog talk on .



python3 [options] < text.txt > text.tok.txt

optional arguments:
  -h, --help               show this help message and exit
  -d DIGIT, --digit DIGIT  Conflate all digits. For example "3.14" -> "5.55"
  -l LANG, --lang LANG     Specify language code for moses tokenizer (default: en)
  --lc, --lower            Lowercase text
  --no_empty               Remove empty lines
  --skip_comments          Don't tokenize lines starting with '#'
  -t TOK, --tok TOK        Specify tokenizer submodule {casual, moses, stanford, toktok, treebank} (default: toktok)

A big thanks to Liling Tan for porting the regexes to Python.


perl [options] < text.txt > text.tok.txt

 -h, --help           Print this usage
 -d, --digit <u>      Conflate all digits to <u> . Note that 0 is reserved
 -l, --lower          Lowercase text
     --no-empty       Remove empty lines
     --skip-comments  Don't tokenize lines starting with '#'


Tok-tok has been tested on, and gives reasonably good results for English, Persian, Russian, Czech, French, German, Vietnamese, Tajik, and a few others. The input should be in UTF-8 encoding. You can use the command-line tool iconv to convert other formats to UTF-8.


The name of the software means "language" or "speech" in Bislama, a Melanesian Creole language. Bislama is quite possibly the world's perfect language.