This is a collection of command-line corpus tools.
For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the
git clone --recursive https://github.com/jonsafari/habeas-corpus
Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:
./myscript.sh --helpMost of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:
./myscript.sh < input.txt > output.txtOr you can pipe these commands with other commands.
- allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
- char_freq.sh - Tabulates the frequency of characters in a text
- corpus_get.sh - Builds a corpus from a remote website, recursively downloading all webpages
- generate_language.sh - Randomly generates text, given a language model and vocabulary
- mediawiki_dict.sh - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
- par_map.sh - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
- rev_words.pl - Reverses word order in each line. For example "how are you?" becomes "you? are how"
- digit_conflate.pl - Conflates all numerical digits to a single digit. For example 48,250.75 -> 55,555.55
- lowercase.pl - Lowercases all texts. Works on almost all bicameral orthographies
- Tok-tok - General tokenizer, suitable for many languages
- uppercase.pl - Uppercases all texts. Works on almost all bicameral orthographies
- Vocabulary extraction:
- Experiment management:
- Penn Treebank formatting:
- Character set encoding:
- Classical cryptanalysis:
Other Useful Commands
- grep - Search for a pattern in text. All of the command-line arguments are useful, especially
- shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try
- sort - Sorts the lines of the input. For large corpora, use
LANG=C sort --buffer-size=4000M --temporary-directory=./
- tac - Reverses line-order of the input
- wc - Counts number of lines, words (tokens), and characters. The argument
--max-line-lengthis also useful.