A package in C++ for character or word ngram analysis. It uses Ternary Search Tree instead of hashing table for faster ngram frequency counting. Words are converted to unique IDs and encoded to more compact base 256 integers. It is a partial implementation of Dr. Vlado Keselj 's Text-Ngrams 1.6, which is a very flexible Ngram package in perl.
C++ Makefile
Switch branches/tags
Nothing to show
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
ByteNgrams.cpp Fix 0x00 byte value in byte ngrams. byte values are stored with hex v… Apr 10, 2015
ByteNgrams.h Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
CharNgrams.cpp Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
CharNgrams.h
INgrams.cpp Remove DOS characters Mar 25, 2008
INgrams.h
Makefile Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
README.md Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
WordNgrams.cpp Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
WordNgrams.h Remove DOS characters Dec 7, 2009
config.h Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
mystring.h include new headers in mystring.h to be able to compile in Linux Nov 12, 2009
ngrams.cpp Remove DOS characters Dec 7, 2009
ngrams.h Minor changes for improving performance Mar 26, 2008
ngrams.sln
ngrams.vcproj Remove DOS characters Dec 7, 2009
sample.txt
string.cpp Fix 0x00 byte value in byte ngrams. byte values are stored with hex v… Apr 10, 2015
ternarySearchTree.h
text2wfreq.cpp
text2wfreq.h Add Byte NGrams for extracting ngrams from binary file Apr 9, 2015
vector.h Change the constructor vector(unsigned) to behave same as std:vector May 11, 2015

README.md

information at http://users.cs.dal.ca/~vlado/srcperl/Ngrams/Ngrams.html

How to use it:

  1. download and save the source code.

  2. $ make

  3. $ ngrams --type=word --n=3 --in= sample.txt

    or

    $ ngrams --type=character -n=3 --in= sample.txt

    or

    Byte ngrams, e.g., getting ngrams from binary file.

    $ ngrams --type=byte -n=3 --in= sample.txt

That's it.

If you found any bug or have any suggestion, please kindly send me email jerryy@gmail.com. Thanks.

Zheyuan Yu. Feb 18,2006