Load a corpus in Catalan or English. The nltk corpora result from tokenizing and segmenting into sentences large collections of text.

The ``gutenberg`` corpus comes from a set of English literature classics. The ``cess_cat`` corpus comes from https://www.cs.upc.edu/~nlp/wikicorpus/, the "120 Million Word Spanish Corpus" which has a subset in Catalan of 50 million words scrapped from Vikipedia in 2006.

In [None]:
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('cess_cat')

In [None]:
name_corpus = 'cess_cat'

if name_corpus=='cess_cat':
    from nltk.corpus import cess_cat as corpus
    # clean the corpus of strange words
    words = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    for w in tqdm(corpus.words()):
        if w not in words_to_remove:
            words.append(w)

elif name_corpus=='gutenberg':
    from nltk.corpus import gutenberg as corpus
    print(corpus.fileids())
    words = corpus.words()
else:
    assert False

print('corpus {} : {} words, {} sentences'
      .format(name_corpus, len(words), len(corpus.sents())))

Build a language model from bigrams. A LM is just a dictionary
with key = condition = one word, and value = ``FreqDist`` 
object = another dictionary with key = next word, value = number 
of occurrences.
This is adapted from https://www.nltk.org/book/ch02.html, section 2.4


In [None]:
grams = list(nltk.bigrams(words))
# also trigrams, ngrams, everygrams(max_len)
cfd = nltk.ConditionalFreqDist(grams)
print(cfd.conditions())
for i in [100, 200, 300, 400]:
    print(cfd.conditions()[i])
    print(cfd[cfd.conditions()[i]].most_common())
    print('--------------')

if name_corpus == 'cess_cat':
    freq_dist =cfd['Una']
else:
    freq_dist = cfd['The']

print(freq_dist.items())
print(freq_dist.max())
print(list(freq_dist.elements()))

Sample text from the language model

In [None]:
import random

def sample_bigram_model(cfd_bigrams, last_word, num=15):
    for i in range(num):
        print(last_word, end=' ')
        # if we do w_k = \arg \max w \in V p(w | w_{k-1}) with
        #     next_word = cfdist[word].max()
        # we get caught in a cycle, repeating again and again 
        # the same few words. It is better to sample from the
        # probability distribution with
        next_word = random.choice(list(cfd_bigrams[last_word].elements()))
        last_word = next_word


if name_corpus=='cess_cat':
    print(sample_bigram_model(cfd, 'El', 100))
    print(sample_bigram_model(cfd, 'La', 100))
    print(sample_bigram_model(cfd, 'Per', 100))
else:
    print(sample_bigram_model(cfd, 'The', 100))
    print(sample_bigram_model(cfd, 'For', 100))

Extension of previous function to tri, 4... n-grams is long and complicated
because conditions of cfd are not one word but lists of pairs, triplets, n-1 words. In addition, the probability of not finding the previous 2, 3..n
generated words among the conditions (ngrams) is very high. So better rely
on the ``lm`` package of nltk. It has also support for adding ``<s>``, ``</s>`` symbols to sentences (padding), different types of smoothing and backoff, and sampling text.

Build a proper language model with support for ``<s>``, ``</s>``, smoothing, backoff, sampling and computation of perplexity. See how here
https://www.nltk.org/api/nltk.lm.html

In [None]:
if name_corpus=='cess_cat':
    text = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    #for s in tqdm(corpus.sents()[:1000]): # debug or quickly train the network
    for s in tqdm(corpus.sents()):
        new_s = [w for w in s if w not in words_to_remove]
        text.append(new_s[:-1]) # except ending point
else:
    text = []
    for s in tqdm(corpus.sents()):
        text.append(s[:-1]) # except ending point
    


In [20]:

from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm.models import MLE, Laplace, StupidBackoff

n = 4
lm1 = MLE(n)
lm2 = Laplace(n)
lm3 = StupidBackoff(alpha=0.4, order=n)
for lm in [lm1, lm2, lm3]:
    print(lm)
    train, vocab = padded_everygram_pipeline(n, text)
    # can not reuse the same pair of train, vocab!

    # train is a generator of n-grams
    #   print('train', list(list(train)[0])[:20])
    # once printed can not be used again!
    # vocab is the sentences in text padded with <s>, </s> and
    # put in a single list
    #   print('vocab', list(vocab)[:30])
   
    # this won't do: 
    #   lm.fit(train, vocab)
    #   print(' '.join(lm.generate(100, random_seed=4)))
    #
    # because after </s> the language model keeps generating </s> all the time.
    # The reason is explained here for the case of tri-grams:
    # https://stackoverflow.com/questions/60295058/nltk-mle-model-clarification-trigrams-and-greater
    # In short:
    # Each sentence is completely independent of each other. The model doesn't know what came 
    # before that sentence nor what comes after. Also, remember that you're training a trigram 
    # model, so the last two words in every sentence are ('</s>', '</s>'). Therefore, the model 
    # learns that '</s>' is followed by '</s>' with a very high probability but it never learns 
    # that '</s>' can sometimes be followed by '<s>'.
    # So the easiest solution to your problem is just to manually start a new sentence (i.e. 
    # call generate() again) every time you see '</s>'

    lm.fit(train, vocab)
    num_sentences_to_generate = 3
    sentences = []
    generated_words = None
    for i in range(num_sentences_to_generate):
        # this produces always the same sentence
        #   generated_words = lm.generate(100, random_seed=4) 
        # this produces sentences that don't start like a sentence
        #   generated_words = lm.generate(100, text_seed='<s>')
        # this also returns always the same sentence
        #   generated_words = lm.generate(100, text_seed='<s>', random_seed=4) 

        # instead, make first sentence with seed '<s>' and following ones
        # with the last words of previous sentence plus </s> <s>
        if i==0: # first sentence
            generated_words = lm.generate(100, text_seed='<s>', random_seed=4) 
            # get many words, more than those in a sentence
        else: # 2nd, 3rd... sentences are generated conditioned to a context=ending of previous one
            text_seed = new_sentence[-n+1:] + ['</s>',] + ['<s>',]
            # print('text seed', text_seed)
            generated_words = lm.generate(100, text_seed=text_seed)
        

        # print('generated words', generated_words)
        new_sentence = [w for w in generated_words if w not in ['<s>', '</s>']]
        # print('new sentence', new_sentence)
        sentences += new_sentence + ["."]

    print('Generated text')
    print(' '.join(sentences))
    print('-------------')
    
    #print(lm.vocab.lookup(text[0]))
    #print(lm.vocab.lookup(['beeeee', 'muuuu']))
    #print(lm.counts)
    #print(lm.score('El'), lm.score('el'), lm.score('dia'), lm.score("<UNK>"))
    #print(lm.perplexity([('relació', 'amb', 'les', 'empreses')]))
    

<nltk.lm.models.MLE object at 0x7f3e19f6ada0>
<s> <s> <s> Aquí , es riuen de nosaltres " </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
Generated text
Aquí , es riuen de nosaltres " . El dret al vot " . El fiscal demana penes de 28 i 11 anys de presó , vuit anys i sis mesos de presó i 120.000 pessetes de multa un agricultor que tenia emmagatzemats purins a la_seva granja .
-------------
<nltk.lm.models.Laplace object at 0x7f3e523d2050>
<s> <s> <s> Arroyo va eludir relacionar els nous sortejos per millorar la imatge d' unitat , malgrat les darreres pluges , que han refusat obertament la invi