In [None]:
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('cess_cat')

Load a corpus in Catalan or English. The nltk corpora result from tokenizing and segmenting into sentences large collections of text.

The ``gutenberg`` corpus comes from a set of English literature classics. The ``cess_cat`` corpus comes from https://www.cs.upc.edu/~nlp/wikicorpus/, the "120 Million Word Spanish Corpus" which has a subset in Catalan of 50 million words scrapped from Vikipedia in 2006.

In [None]:
name_corpus = 'cess_cat'

if name_corpus=='cess_cat':
    from nltk.corpus import cess_cat as corpus
    # clean the corpus of strange words
    words = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    for w in tqdm(corpus.words()):
        if w not in words_to_remove:
            words.append(w)

elif name_corpus=='gutenberg':
    from nltk.corpus import gutenberg as corpus
    print(corpus.fileids())
    words = corpus.words()
else:
    assert False

print('corpus {} : {} words, {} sentences'
      .format(name_corpus, len(words), len(corpus.sents())))

Build a language model from bigrams. A LM is just a dictionary
with key = condition = one word, and value = ``FreqDist`` 
object = another dictionary with key = next word, value = number 
of occurrences.
This is adapted from https://www.nltk.org/book/ch02.html, section 2.4


In [None]:
grams = list(nltk.bigrams(words))
# also trigrams, ngrams, everygrams(max_len)
cfd = nltk.ConditionalFreqDist(grams)
print(cfd.conditions())
for i in [100, 200, 300, 400]:
    print(cfd.conditions()[i])
    print(cfd[cfd.conditions()[i]].most_common())
    print('--------------')

if name_corpus == 'cess_cat':
    freq_dist =cfd['Una']
else:
    freq_dist = cfd['The']

print(freq_dist.items())
print(freq_dist.max())
print(list(freq_dist.elements()))

Sample text from the language model

In [None]:
import random

def sample_bigram_model(cfd_bigrams, last_word, num_words=15):
    pass
    # TODO


if name_corpus=='cess_cat':
    print(sample_bigram_model(cfd, 'El', 100))
    print(sample_bigram_model(cfd, 'La', 100))
    print(sample_bigram_model(cfd, 'Per', 100))
else:
    print(sample_bigram_model(cfd, 'The', 100))
    print(sample_bigram_model(cfd, 'For', 100))

Extension of previous function to tri, 4... n-grams is long and complicated
because conditions of cfd are not one word but lists of pairs, triplets, n-1 words. In addition, the probability of not finding the previous 2, 3..n
generated words among the conditions (ngrams) is very high. So better rely
on the ``lm`` package of nltk. It has also support for adding ``<s>``, ``</s>`` symbols to sentences (padding), different types of smoothing and backoff, and sampling text.

Build a proper language model with support for ``<s>``, ``</s>``, smoothing, backoff, sampling and computation of perplexity. See how here
https://www.nltk.org/api/nltk.lm.html

In [None]:
if name_corpus=='cess_cat':
    text = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    #for s in tqdm(corpus.sents()[:1000]): # debug or quickly train the network
    for s in tqdm(corpus.sents()):
        new_s = [w for w in s if w not in words_to_remove]
        text.append(new_s[:-1]) # except ending point
else:
    text = []
    for s in tqdm(corpus.sents()):
        text.append(s[:-1]) # except ending point
    
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm.models import MLE, Laplace, StupidBackoff

n = 3
# TODO:
# for each of the three types of language model in last import:
#     for n=3, 4, 5 (tri-grams, 4-grams, 5-grams)
#         create a model instance
#         pad sentences in text
#         train the model
#         sample a text with 100 words
# Hint: do like in ' '.join(['These', 'are', 'some', 'words'])
#
# Compare results, which combination seems more realistic ?


## Hints:

1. Once text is the list of sentences you have to insert the starting and end sentences symbols with ``padded_everygram_pipeline()``. 

2. But then, this won't do: 

    ```
    lm.fit(train, vocab)
    print(' '.join(lm.generate(100, random_seed=4)))
    ```

    because after </s> the language model keeps generating </s> all the time:

    ``Arroyo va eludir relacionar els nous sortejos per millorar la imatge d' unitat , malgrat les darreres pluges , que han refusat obertament la invitació per considerar -lo un organisme compensatori pel transvasament de l' Ebre que es contempla al PHN ' </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>``


    The reason is explained [here](https://stackoverflow.com/questions/60295058/nltk-mle-model-clarification-trigrams-and-greater) for the case of tri-grams. In short:

    *Each sentence is completely independent of each other. The model doesn't know what came 
    before that sentence nor what comes after. Also, remember that you're training a trigram model, so the last two words in every sentence are ``</s>, </s>``. Therefore, the model learns that ``</s>`` is followed by ``</s>`` with a very high probability but it never learns that ``</s>`` can sometimes be followed by ``<s>``. So the easiest solution to your problem is just to manually start a new sentence (i.e. call generate() again) every time you see ``</s>``*.

3. In practical terms, for each model you have to do something like next cell:

In [None]:
sentences = []
instantiate the model
for i in range(num_sentences_to_generate):
    padded_everygram_pipeline
    train the model 
    # make first sentence with text_seed '<s>' (plus random_seed) and following ones
    # with the last words of previous sentence plus </s> <s> at the end
    if i==0: # first sentence
        text_seed = ...
        generated_words = ... 
        # get many words, more than those in a sentence
    else: # 2nd, 3rd... sentences are generated conditioned to a context=ending of previous one
        text_seed = ...
        generated_words = ...
    # remove '<s>', '</s>' from generated words, this is the new sentence
    # add '.' as last word of the new sentence

# print sentences = list of words