Command-line corpus tools
Switch branches/tags
Nothing to show
Clone or download
Permalink
Failed to load latest commit information.
preprocess @ a5576aa update subrepos Mar 21, 2017
subword-nmt @ fb526f1 update subrepos Mar 21, 2017
tok-tok @ 46bd925 update subrepos Mar 21, 2017
.gitmodules add submodule for kpu/preprocess Feb 17, 2017
README.md README.md: fix indentation May 15, 2017
allcat start migrating stuff Mar 12, 2015
buckwalter2unicode.pl start migrating stuff Mar 12, 2015
char_freq.sh start migrating stuff Mar 12, 2015
corpus_get.sh start migrating stuff Mar 12, 2015
digit_conflate.pl start migrating stuff Mar 12, 2015
generate_language.sh start migrating stuff Mar 12, 2015
generate_splits.pl generate_splits.pl: remove _gold suffixes Jul 28, 2016
lowercase.pl start migrating stuff Mar 12, 2015
mediawiki2text.sh start migrating stuff Mar 12, 2015
mediawiki_dict.sh start migrating stuff Mar 12, 2015
par_map.sh +par_map.sh Mar 18, 2015
penn2conll.sh start migrating stuff Mar 12, 2015
penn2plain.pl start migrating stuff Mar 12, 2015
penn2qtree.sh start migrating stuff Mar 12, 2015
pivot.pl start migrating stuff Mar 12, 2015
playfair_digraph_freq.sh start migrating stuff Mar 12, 2015
rev_words.pl start migrating stuff Mar 12, 2015
subcorpora.pl start migrating stuff Mar 12, 2015
uppercase.pl start migrating stuff Mar 12, 2015
vocab.sh mv get_vocab.sh vocab.sh; etc Jul 28, 2016
vocab_filter.py m: vocab_filter.py: cleanup May 13, 2016
vocab_top.sh +vocab_top.sh Jul 28, 2016
win1256_2_roman.tcl start migrating stuff Mar 12, 2015
word2int.py word2int.py: add --in_map and --out_map cmd args Mar 20, 2016

README.md

Habeas Corpus

This is a collection of command-line corpus tools. For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the --recursive flag:

    git clone --recursive https://github.com/jonsafari/habeas-corpus

Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:

./myscript.sh --help
Most of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:
./myscript.sh < input.txt > output.txt
Or you can pipe these commands with other commands.

  • allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
  • char_freq.sh - Tabulates the frequency of characters in a text
  • corpus_get.sh - Builds a corpus from a remote website, recursively downloading all webpages
  • generate_language.sh - Randomly generates text, given a language model and vocabulary
  • mediawiki_dict.sh - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
  • par_map.sh - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
  • rev_words.pl - Reverses word order in each line. For example "how are you?" becomes "you? are how"
  • Preprocessing:
  • Vocabulary extraction:
    • vocab.sh - Lists the vocabulary (set of unique words) from a text corpus
    • vocab_top.sh - Lists a frequency-sorted vocabulary (set of unique words) from a text corpus
    • vocab_filter.py - Replaces infrequent tokens with <unk>
    • word2int.py - Converts words to integers, online
  • Experiment management:
    • generate_splits.pl - Generates train/dev/test splits from a whole corpus. Every n lines goes to the training set, then one to the development set, then one to the test set
    • subcorpora.pl - Builds subcorpora from a whole corpus, increasing in size exponentially
  • Penn Treebank formatting:
  • Character set encoding:
  • Classical cryptanalysis:

Other Useful Commands

  • grep - Search for a pattern in text. All of the command-line arguments are useful, especially -r, -i, -c, -e, -v, -o, -w (spells ricevow)
  • shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try --random-source=input.txt
  • sort - Sorts the lines of the input. For large corpora, use LANG=C sort --buffer-size=4000M --temporary-directory=./
  • tac - Reverses line-order of the input
  • wc - Counts number of lines, words (tokens), and characters. The argument --max-line-length is also useful.

Other Useful Corpus Tools