Skip to content

jonsafari/habeas-corpus

Repository files navigation

Habeas Corpus

This is a collection of command-line corpus tools. For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the --recursive flag:

    git clone --recursive https://github.com/jonsafari/habeas-corpus

Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:

./myscript.sh --help
Most of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:
./myscript.sh < input.txt > output.txt
Or you can pipe these commands with other commands.

  • allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
  • char_freq.sh - Tabulates the frequency of characters in a text
  • corpus_get.sh - Builds a corpus from a remote website, recursively downloading all webpages
  • generate_language.sh - Randomly generates text, given a language model and vocabulary
  • mediawiki_dict.sh - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
  • par_map.sh - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
  • rev_words.pl - Reverses word order in each line. For example "how are you?" becomes "you? are how"
  • Preprocessing:
  • Vocabulary extraction:
    • vocab.sh - Lists the vocabulary (set of unique words) from a text corpus
    • vocab_top.sh - Lists a frequency-sorted vocabulary (set of unique words) from a text corpus
    • vocab_filter.py - Replaces infrequent tokens with <unk>
    • word2int.py - Converts words to integers, online
  • Experiment management:
    • generate_splits.pl - Generates train/dev/test splits from a whole corpus. Every n lines goes to the training set, then one to the development set, then one to the test set
    • subcorpora.pl - Builds subcorpora from a whole corpus, increasing in size exponentially
  • Penn Treebank formatting:
  • Character set encoding:
  • Classical cryptanalysis:

Other Useful Commands

  • grep - Search for a pattern in text. All of the command-line arguments are useful, especially -r, -i, -c, -e, -v, -o, -w (spells ricevow)
  • shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try --random-source=input.txt
  • sort - Sorts the lines of the input. For large corpora, use LANG=C sort --buffer-size=4000M --temporary-directory=./
  • tac - Reverses line-order of the input
  • wc - Counts number of lines, words (tokens), and characters. The argument --max-line-length is also useful.

Other Useful Corpus Tools