Habeas Corpus

This is a collection of command-line corpus tools. For convenience it also includes submodules from Ken's preprocess and Rico's BPE repos. To include submodules when cloning, add the --recursive flag:

    git clone --recursive https://github.com/jonsafari/habeas-corpus

Many of the scripts have a command-line argument --help for usage information, so often you can type the following for more specific help:

./myscript.sh --help

Most of these scripts take their input from stdin, and output text to stdout, so the Unix command-line usage for many of these scripts is:

./myscript.sh < input.txt > output.txt

Or you can pipe these commands with other commands.

allcat - Works like cat regardless of whether the file is plaintext, .gz, .bz2, .lzma, or .xz (best)
char_freq.sh - Tabulates the frequency of characters in a text
corpus_get.sh - Builds a corpus from a remote website, recursively downloading all webpages
generate_language.sh - Randomly generates text, given a language model and vocabulary
mediawiki_dict.sh - Converts MediaWiki dumps to bilingual dictionary. You should use the wikidump from the smaller language
par_map.sh - Parallel Map a command for either a single file or multiple files (i.e. parallelize a command)
rev_words.pl - Reverses word order in each line. For example "how are you?" becomes "you? are how"
Preprocessing:
- digit_conflate.pl - Conflates all numerical digits to a single digit. For example 48,250.75 -> 55,555.55
- lowercase.pl - Lowercases all texts. Works on almost all bicameral orthographies
- Tok-tok - General tokenizer, suitable for many languages
- uppercase.pl - Uppercases all texts. Works on almost all bicameral orthographies
Vocabulary extraction:
- vocab.sh - Lists the vocabulary (set of unique words) from a text corpus
- vocab_top.sh - Lists a frequency-sorted vocabulary (set of unique words) from a text corpus
- vocab_filter.py - Replaces infrequent tokens with <unk>
- word2int.py - Converts words to integers, online
Experiment management:
- generate_splits.pl - Generates train/dev/test splits from a whole corpus. Every n lines goes to the training set, then one to the development set, then one to the test set
- subcorpora.pl - Builds subcorpora from a whole corpus, increasing in size exponentially
Penn Treebank formatting:
- penn2conll.sh - Converts Penn treebank format to POS-tagged CoNLL-X format
- penn2plain.pl - Converts Penn treebank format to plaintext
- penn2qtree.sh - Converts Penn treebank format to Qtree format for use in LaTeX documents
Character set encoding:
- buckwalter2unicode.pl - Converts from Buckwalter transliteration to UTF-8 native Arabic script
Classical cryptanalysis:
- pivot.pl - Rotates text by 90 degrees
- playfair_digraph_freq.sh - Tabulates Playfair-style digraph character frequencies

Other Useful Commands

grep - Search for a pattern in text. All of the command-line arguments are useful, especially -r, -i, -c, -e, -v, -o, -w (spells ricevow)
shuf - Randomizes the lines of the input. For reproducible pseudo-randomization, try --random-source=input.txt
sort - Sorts the lines of the input. For large corpora, use LANG=C sort --buffer-size=4000M --temporary-directory=./
tac - Reverses line-order of the input
wc - Counts number of lines, words (tokens), and characters. The argument --max-line-length is also useful.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Habeas Corpus

Other Useful Commands

Other Useful Corpus Tools

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
preprocess @ a5576aa		preprocess @ a5576aa
subword-nmt @ fb526f1		subword-nmt @ fb526f1
tok-tok @ 46bd925		tok-tok @ 46bd925
.gitmodules		.gitmodules
README.md		README.md
allcat		allcat
buckwalter2unicode.pl		buckwalter2unicode.pl
char_freq.sh		char_freq.sh
corpus_get.sh		corpus_get.sh
digit_conflate.pl		digit_conflate.pl
generate_language.sh		generate_language.sh
generate_splits.pl		generate_splits.pl
lowercase.pl		lowercase.pl
mediawiki2text.sh		mediawiki2text.sh
mediawiki_dict.sh		mediawiki_dict.sh
par_map.sh		par_map.sh
penn2conll.sh		penn2conll.sh
penn2plain.pl		penn2plain.pl
penn2qtree.sh		penn2qtree.sh
pivot.pl		pivot.pl
playfair_digraph_freq.sh		playfair_digraph_freq.sh
rev_words.pl		rev_words.pl
subcorpora.pl		subcorpora.pl
uppercase.pl		uppercase.pl
vocab.sh		vocab.sh
vocab_filter.py		vocab_filter.py
vocab_top.sh		vocab_top.sh
win1256_2_roman.tcl		win1256_2_roman.tcl
word2int.py		word2int.py

jonsafari/habeas-corpus

Folders and files

Latest commit

History

Repository files navigation

Habeas Corpus

Other Useful Commands

Other Useful Corpus Tools

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages