% ngram(1) | N-Gram Generator
NGRAM - An interface to the ngram library
ngram -h
THIS IS A WORK IN PROGRESS AND HAS BUGS.
- Fix bugs, there are some egregious ones.
- Turn into header only library with a driver.
- Enumerate use cases; such as finding dictionaries for compression (put together a pipeline and example that makes a good dictionary from command line tools including this one).
- Test cases.
- Fuzzing.
- This program does not work with UTF-8, only ANSI/8-bit character sets.
Project: ngram - Generate ngrams
Author: Richard James Howe
License: Public Domain
Email: howe.r.j.89@gmail.com
Website: <https://github.com/howerj/ngram>
Generate an n-gram list from an input. The program acts a Unix filter. The program is built around a library that can read from an arbitrary source (and print to one), allowing it to be embedded in other applications.
Input is taken from standard input and written to standard output, the program works on textual or binary data. Output is in the form of escaped strings. Non-zero is returned on error.
-h print this help message and exit successfully
-i ignore case by converting upper to lower case
-b run built in self tests
-t tree print instead of on n-gram per line
-v increase verbosity
-d string use a list of delimiters, binary values are in hex, '\xHH'
-s char set the output separator for printing results
-w use white space as a set of delimiters
-W use any character that is not alphanumeric as a delimiter
-l # minimum n-gram count to print, maximum if -H not used
-H # maximum n-gram count to generate
-n # instead of using a delimiter, read # in bytes at a time
This program returns zero on success and non-zero on failure.
Placed in the public domain, effectively, with the Unlicense, do what thou wilt.
Build requirements are a C compiler and Make. To build:
make
To run the tests:
make test
For a full list of command line options, run "./ngram -h" (On Unixen) or "ngram.exe -h" on Windows.
To save n-grams to a file:
./ngram < file.ext > file.ngrams
By default n-grams of length 1 are constructed with their counts prefixed, and elements are split into single characters. This produces a frequency count of all the symbols that occur in the input text. Ranges of n-grams can be printed by specifying them with the "-l" and "-H" command. Instead of splitting the input character by character it is possible to split it on a set of delimiters (one of the built in sets, like white-space specified by "-w", or with a custom set, specified with the "-d" options, which excepts a set of characters to delimit the n-grams on).
To print out all n-grams of length 2:
./ngram -l 2 < file.ext > file.ngrams
Or
./ngram -l 2 -H 2 < file.ext > file.ngrams
To print out all n-grams of length 2-4:
./ngram -l 2 -H 4 < file.ext > file.ngrams
To split on white-space instead of character by character:
./ngram -l 3 -w < file.ext > file.ngrams
Alternatively:
./ngrams -l 3 -d " \x09\n\r\v" < file.ext > file.ngrams
Output can be given the form of a tree as well with the "-t" option.
This tool does not handle ignoring a set of characters when constructing its n-grams. Instead of adding these features to this tool the program tr can be used to delete characters (ignoring them).