This program is a utility for extracting n-grams from texts. It extracts every contiguous string from a collection of texts, from length 1 up to a maximum length determined by the user. The included scripts then perform sorting routines on the n-gram files to determine which strings occur most frequently. This method is especially useful for texts composed in languages that do not feature orthographic spacing between individual words.
usage: ./aks [language] [maximum n value] [source directory] ./processmasters [maximum n value] [source directory]
aks tibetan_roman 32 /home/handyc/texts
aks tibetan_uchen 32 /home/handyc/texts
aks chinese 32 /home/handyc/texts
aks sanskrit_unicode 32 /home/handyc/texts
You may need to change permissions on the scripts in order to allow yourself to run them.
Questions, comments, please write to firstname.lastname@example.org or search for my Leiden University address