LibIndic N-gram Generator
An n-gram generator for indic languages.
What is Ngram?
An n-gram model is a type of probabilistic model for predicting the next item in a sequence. n-grams are used in various areas of statistical natural language processing and genetic sequence analysis.
An n-gram is a subsequence of n items from a given sequence. The items in question can be phonemes, syllables, letters, words or base pairs according to the application.
An n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram" (or, less commonly, a "digram"); size 3 is a "trigram"; and size 4 or more is simply called an "n-gram".
- Clone the repository
git clone https://github.com/libindic/indicngram.git
- Change to the cloned directory
- Run setup.py to create installable source
python setup.py sdist
- Install using pip
pip install dist/libindic-ngram*.tar.gz
Input Parameters: Text and value of N (default value 2) Output: List of grams >>> from libindic.ngram import Ngram >>> ngram_generator = Ngram() >>> ngram_gerator(<text>, <window size>)
>>> from libindic.ngram import Ngram >>> ngram_generator = Ngram() >>> text = "Languages" >>> grams = ngram_generator.letterNgram(text, 3) >>> print(grams) ['Lan', 'ang', 'ngu', 'gua', 'uag', 'age', 'ges'] >>> for gram in grams: ... print("".join(gram)) Lan ang ngu gua uag age ges
Run tests with
python setup.py test
Read the docs for more.