Skip to content


Switch branches/tags

Latest commit


Git stats


Failed to load latest commit information.

Habeas Corpus

This is a collection of command-line corpus tools. For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the --recursive flag:

    git clone --recursive

Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:

./ --help
Most of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:
./ < input.txt > output.txt
Or you can pipe these commands with other commands.

  • allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
  • - Tabulates the frequency of characters in a text
  • - Builds a corpus from a remote website, recursively downloading all webpages
  • - Randomly generates text, given a language model and vocabulary
  • - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
  • - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
  • - Reverses word order in each line. For example "how are you?" becomes "you? are how"
  • Preprocessing:
  • Vocabulary extraction:
    • - Lists the vocabulary (set of unique words) from a text corpus
    • - Lists a frequency-sorted vocabulary (set of unique words) from a text corpus
    • - Replaces infrequent tokens with <unk>
    • - Converts words to integers, online
  • Experiment management:
    • - Generates train/dev/test splits from a whole corpus. Every n lines goes to the training set, then one to the development set, then one to the test set
    • - Builds subcorpora from a whole corpus, increasing in size exponentially
  • Penn Treebank formatting:
  • Character set encoding:
  • Classical cryptanalysis:

Other Useful Commands

  • grep - Search for a pattern in text. All of the command-line arguments are useful, especially -r, -i, -c, -e, -v, -o, -w (spells ricevow)
  • shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try --random-source=input.txt
  • sort - Sorts the lines of the input. For large corpora, use LANG=C sort --buffer-size=4000M --temporary-directory=./
  • tac - Reverses line-order of the input
  • wc - Counts number of lines, words (tokens), and characters. The argument --max-line-length is also useful.

Other Useful Corpus Tools