Load a corpus in Catalan or English. The nltk corpora result from tokenizing and segmenting into sentences large collections of text.

The ``gutenberg`` corpus comes from a set of English literature classics. The ``cess_cat`` corpus comes from https://www.cs.upc.edu/~nlp/wikicorpus/, the "120 Million Word Spanish Corpus" which has a subset in Catalan of 50 million words scrapped from Vikipedia in 2006.

In [3]:
from tqdm import tqdm
import nltk
nltk.download('punkt')
nltk.download('gutenberg')
nltk.download('cess_cat')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.
[nltk_data] Downloading package gutenberg to /root/nltk_data...
[nltk_data]   Unzipping corpora/gutenberg.zip.
[nltk_data] Downloading package cess_cat to /root/nltk_data...
[nltk_data]   Unzipping corpora/cess_cat.zip.


True

In [4]:
name_corpus = 'cess_cat'

if name_corpus=='cess_cat':
    from nltk.corpus import cess_cat as corpus
    # clean the corpus of strange words
    words = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    for w in tqdm(corpus.words()):
        if w not in words_to_remove:
            words.append(w)

elif name_corpus=='gutenberg':
    from nltk.corpus import gutenberg as corpus
    print(corpus.fileids())
    words = corpus.words()
else:
    assert False

print('corpus {} : {} words, {} sentences'
      .format(name_corpus, len(words), len(corpus.sents())))

100%|██████████| 503853/503853 [00:31<00:00, 16246.33it/s]


corpus cess_cat : 492004 words, 17104 sentences


Build a language model from bigrams. A LM is just a dictionary
with key = condition = one word, and value = ``FreqDist`` 
object = another dictionary with key = next word, value = number 
of occurrences.
This is adapted from https://www.nltk.org/book/ch02.html, section 2.4


In [5]:
grams = list(nltk.bigrams(words))
# also trigrams, ngrams, everygrams(max_len)
cfd = nltk.ConditionalFreqDist(grams)
print(cfd.conditions())
for i in [100, 200, 300, 400]:
    print(cfd.conditions()[i])
    print(cfd[cfd.conditions()[i]].most_common())
    print('--------------')

if name_corpus == 'cess_cat':
    freq_dist =cfd['Una']
else:
    freq_dist = cfd['The']

print(freq_dist.items())
print(freq_dist.max())
print(list(freq_dist.elements()))

['El', 'Tribunal_Suprem', 'TS', 'ha', 'confirmat', 'la', 'condemna', 'a', 'quatre', 'anys', "d'", 'inhabilitació', 'especial', 'i', 'una', 'multa', 'de', '3,6', 'milions', 'pessetes', 'per', 'veterinaris', 'gironins', ',', 'haver', '-se', 'beneficiat', 'dels', 'càrrecs', 'públics', 'que', 'desenvolupaven', 'la_seva', 'relació', 'amb', 'les', 'empreses', 'càrniques', 'zona', 'en', 'oferir', '-los', 'serveis', 'particulars', '.', 'La', 'sentència', 'qual', 'tingut', 'accés', 'Intra-ACN', 'desestima', 'els', 'recursos', 'interposats', 'pels', 'processats', 'Albert_Bramón', 'president', 'del', 'Col·legi_de_Veterinaris_de_Girona', 'el', 'moment', 'fets', 'Josefina_J.', 'Pere_C.', 'Mateu_B.', 'actuaven', 'com_a', 'inspectors', 'Generalitat', 'van', 'ser', 'condemnats', "l'", 'Audiència_de_Girona', 'un', 'delicte', 'negociacions', 'prohibides', 'funcionaris', 'Els', "s'", 'encarregaven', 'control', 'higiene', 'La_Garrotxa', 'aquest', 'motiu', 'feien', 'visites', 'periòdiques', 'analítiques', 

Sample text from the language model

In [6]:
import random

def sample_bigram_model(cfd_bigrams, last_word, num=15):
    for i in range(num):
        print(last_word, end=' ')
        # if we do w_k = \arg \max w \in V p(w | w_{k-1}) with
        #     next_word = cfdist[word].max()
        # we get caught in a cycle, repeating again and again 
        # the same few words. It is better to sample from the
        # probability distribution with
        next_word = random.choice(list(cfd_bigrams[last_word].elements()))
        last_word = next_word


if name_corpus=='cess_cat':
    print(sample_bigram_model(cfd, 'El', 100))
    print(sample_bigram_model(cfd, 'La', 100))
    print(sample_bigram_model(cfd, 'Per', 100))
else:
    print(sample_bigram_model(cfd, 'The', 100))
    print(sample_bigram_model(cfd, 'For', 100))

El que els expositors , més respecte_al PP seria garantia de 65.000 persones immigrades , actualment en l' oportunitat única a l' edifici . Durant un bosc de l' Estat només han fracassat , s' ha declarat recentment aprovada pel nomenament de Nova_York sense que aquesta discussió i 10 anys - què també va estavellar el 1999 un fòrum de l' objectiu de migdia , a la unitat del conseller de finals del descans , quin número 3 i fusta de Catalunya hi hagi centrat l' avió ja els errors en el sector i " Les_seves memòries , i ' None
La mecànica falla de magribins i animació per estudiar i afegeix que representa la Venda d' ETA aclareix que la possibilitat que travessava tota la precarietat de Bering , d'_acord . En les primeres paraules serveixen de l' any 50 matèries obligatòries , el compromís del Col·legi_Jardí competirà en català , que se sent gairebé centenari del treballador , l' Estat espanyol , celebrarà al seu pare , el trauma de 4 i funcionàries " . Aquest estudi de les millors en sub

Extension of previous function to tri, 4... n-grams is long and complicated
because conditions of cfd are not one word but lists of pairs, triplets, n-1 words. In addition, the probability of not finding the previous 2, 3..n
generated words among the conditions (ngrams) is very high. So better rely
on the ``lm`` package of nltk. It has also support for adding ``<s>``, ``</s>`` symbols to sentences (padding), different types of smoothing and backoff, and sampling text.

Build a proper language model with support for ``<s>``, ``</s>``, smoothing, backoff, sampling and computation of perplexity. See how here
https://www.nltk.org/api/nltk.lm.html

In [7]:
if name_corpus=='cess_cat':
    text = []
    words_to_remove = ['*0*', '-Fpa-', '-Fpt-']
    #for s in tqdm(corpus.sents()[:1000]): # debug or quickly train the network
    for s in tqdm(corpus.sents()):
        new_s = [w for w in s if w not in words_to_remove]
        text.append(new_s[:-1]) # except ending point
else:
    text = []
    for s in tqdm(corpus.sents()):
        text.append(s[:-1]) # except ending point
    
from nltk.lm.preprocessing import padded_everygram_pipeline
from nltk.lm.models import MLE, Laplace, StupidBackoff

n = 3
lm1 = MLE(n)
lm2 = Laplace(n)
lm3 = StupidBackoff(alpha=0.4, order=n)
for lm in [lm1, lm2, lm3]:
    print(lm)
    train, vocab = padded_everygram_pipeline(n, text)
    # can not reuse the same pair of train, vocab!
    lm.fit(train, vocab)

    #print(lm.vocab.lookup(text[0]))
    #print(lm.vocab.lookup(['beeeee', 'muuuu']))
    print(lm.counts)
    #print(lm.score('El'), lm.score('el'), lm.score('dia'), lm.score("<UNK>"))
    print(lm.perplexity([('relació', 'amb', 'les', 'empreses')]))
    print(' '.join(lm.generate(100, random_seed=4)))

100%|██████████| 17104/17104 [00:26<00:00, 638.78it/s]


<nltk.lm.models.MLE object at 0x7f1b9dfc4f40>
<NgramCounter with 3 ngram orders and 1578636 ngrams>
inf
El Girona_Convention_Bureau facilitarà , a_través_del bucle loop repetint successivament el mateix col·lapse que es fa una reivindicació al reconeixement social i l' evolució geològica ; el president del Grup_Popular al Parlament_de_Catalunya , on tothom s' hi va haver un gran nombre de palestins que han deixat la_seva cita de dilluns a l' atenció al públic </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s> </s>
<nltk.lm.models.Laplace object at 0x7f1b9dfc4fa0>
<NgramCounter with 3 ngram orders and 1578636 ngrams>
39686.0
En aquest sentit , Maragall ha volgut posar_de_relleu l' esforç econòmic de Catalunya , a qui també acusa la fiscal Márquez_de_Prado i el ridícul dels que més programes ha situat en 11321,94 en augme