In [2]:
import sys
if "../" not in sys.path: sys.path.append ("../")
from modules import utils
from modules.normalize import NGramsNormalizer

A factory method call to `NGramsNormalizer` can load the dictionaries from local files and create the `NGramsNormalizer` object. Note that this step consumes memory and can also take much time if the dictionaries are large like [Google Ngrams](http://storage.googleapis.com/books/ngrams/books/datasetsv2.html).

In [3]:
grams = NGramsNormalizer.fromFiles ("/hg191/corpora/google-ngrams/en.1M.filtered.1g",
                                    "/hg191/corpora/google-ngrams/en.1M.filtered.2g",
                                    "/hg191/corpora/google-ngrams/en.1M.filtered.3g",
                                    verbose=True)

2019-03-06 14:18:46,013 INFO Unigrams loaded
2019-03-06 14:19:51,277 INFO Bigrams loaded
2019-03-06 14:24:45,720 INFO Trigrams loaded


Now, let's try to segment a few terms using different methods. The simplest method to use is `byLikelihoodRatio` which calculates the likelihood ratio using just the unigram and bigram statistics.

In [6]:
examples = [
            "aidand", #aid and
            "often", #often
            "themselvesdown", #themselves down
            "Bespeak", #Bespeak
            "Senatoradmits", #Senator admits
            "Safeguard" #Safeguard
           ]

In [7]:
for example in examples:
    print (grams.byLikelihoodRatio (example, smoothing=0.1, threshold=1))

aid and
often
themselves down
Bespeak
Senator admits
Safe guard


Note that the output from the method looks fairly okay, though there can be false positives (eg. "Safeguard" is split to "Safe guard" when it shouldn't). Sometimes the surrounding context can help. The `byLikelihoodRatioContextual` method can make use of the context.

In [11]:
examples = [
        ("our", "aidand", "our"), #aid and
        ("memory", "often", "years"),#of ten
        ("it", "often", "feels"), #often
        ("let", "themselvesdown", "into"), #themselves down
        ("cloud", "Bespeak", "the"), #Bespeak
        ("The", "Senatoradmits", "that"), #Senator admits
        ("a", "Safeguard", "against") #Safeguard
    ]

In [12]:
for lc, example, rc in examples:
    print (grams.byContextualLikelihoodRatio(example, lc, rc, interpolation=False, smoothing=0.1, threshold=1))

aid and
of ten
often
themselves down
Bespeak
Senator admits
Safeguard


Using the extra context, the term "Safeguard" isn't incorrectly split anymore. Note also the example "often" which usually would remain as is without the context but is now rightly splits into "of" "ten" when it sees the span "memory often years" but keeps the original form "often" when it sees the span "it often feels". 

Internally, the method uses trigram and bigram statistics. Sometimes, trigram statistics can suffer from sparsity in which case we provide a way to compute interpolated probabilities and then calculate the likelihood ratios based on them. Simply turn on the `interpolation` flag to `True`. The interpolation weights for bigram and trigrams can be specified using `bigram_lambdas` and `trigram_lambdas` tuples respectively.

In [14]:
for lc, example, rc in examples:
    print (grams.byContextualLikelihoodRatio(example, lc, rc, interpolation=True, smoothing=0.1, threshold=1, 
                                             bigram_lambdas=(0.9,0.1),
                                             trigram_lambdas=(0.7, 0.2, 0.1)))

aid and
of ten
often
themselves down
Bespeak
Senator admits
Safeguard
