Befor going through the notebook you should read first 7 pages of - https://web.stanford.edu/~jurafsky/slp3/3.pdf

## Language modelling

Language modelling is basically just assigning probabilities to words. On its own LM is not very useful, but it can be applied to almost any other NLP task. Recent progress in NLP is in large part driven by language models (BERT, ELMO, GPT-2).

Language modelling is a complicated topic, so we go through it gradually. In this notebook, we'll look at the basics.

Let's take two big books.

In [7]:
# NLTK includes a small selection of texts from the Project Gutenberg electronic text archive, 
# which contains some 25,000 free electronic books, hosted at http://www.gutenberg.org/. 
import nltk
nltk.download('gutenberg')
nltk.corpus.gutenberg.fileids()

['austen-emma.txt',
 'austen-persuasion.txt',
 'austen-sense.txt',
 'bible-kjv.txt',
 'blake-poems.txt',
 'bryant-stories.txt',
 'burgess-busterbrown.txt',
 'carroll-alice.txt',
 'chesterton-ball.txt',
 'chesterton-brown.txt',
 'chesterton-thursday.txt',
 'edgeworth-parents.txt',
 'melville-moby_dick.txt',
 'milton-paradise.txt',
 'shakespeare-caesar.txt',
 'shakespeare-hamlet.txt',
 'shakespeare-macbeth.txt',
 'whitman-leaves.txt']

In [18]:
emma = nltk.corpus.gutenberg.raw('austen-emma.txt')
moby = nltk.corpus.gutenberg.raw('melville-moby_dick.txt')

Анна Каренина немного больше.

In [19]:
print("Length of Emma -", len(emma))
print("Length of Moby Dick - ", len(moby))

Length of Emma - 887071
Length of Moby Dick -  1242990


Simple normalization pipeline. You should understand what it does by now

In [21]:
from string import punctuation
import numpy as np

def normalize(text):
    normalized_text = [(word.strip(punctuation)) for word \
                                                            in text.lower().split()]
    normalized_text = [word for word in normalized_text if word]
    return normalized_text


We can compare two texts

In [22]:
norm_emma = normalize(emma)
norm_moby = normalize(moby)

In [25]:
print("Length of Emma in tokens -", len(norm_emma))
print("Length of Moby Dick in tokens - ", len(norm_moby))

Length of Emma in tokens - 158131
Length of Moby Dick in tokens -  212013


In [26]:
norm_emma[:10]

['emma', 'by', 'jane', 'austen', '1816', 'volume', 'i', 'chapter', 'i', 'emma']

In [27]:
print("Unique tokens in Emma -", len(set(norm_emma)))
print("Unique tokens in Moby Dick - ", len(set(norm_moby)))

Unique tokens in Emma - 9460
Unique tokens in Moby Dick -  20233


Let's compute word frequencies

In [28]:
from collections import Counter

In [31]:
vocab_emma = Counter(norm_emma)
vocab_moby = Counter(norm_moby)

In [32]:
vocab_emma.most_common(10)

[('to', 5149),
 ('the', 5146),
 ('and', 4613),
 ('of', 4274),
 ('a', 3073),
 ('i', 2968),
 ('her', 2417),
 ('it', 2400),
 ('was', 2376),
 ('she', 2278)]

In [33]:
vocab_moby.most_common(10)

[('the', 14320),
 ('of', 6578),
 ('and', 6362),
 ('a', 4628),
 ('to', 4577),
 ('in', 4143),
 ('that', 2940),
 ('his', 2520),
 ('it', 2368),
 ('i', 1943)]

If we devide each frequency by the size of the corpus (total number of tokens) we'll get word probabilities!

In [34]:
probas_emma = Counter({word:c/len(norm_emma) for word, c in vocab_emma.items()})
probas_emma.most_common(20)

[('to', 0.03256161031043881),
 ('the', 0.03254263869829445),
 ('and', 0.029172015607312925),
 ('of', 0.027028223435000095),
 ('a', 0.019433254706540778),
 ('i', 0.018769248281488134),
 ('her', 0.015284795517640438),
 ('it', 0.015177289715489057),
 ('was', 0.015025516818334165),
 ('she', 0.01440577748828503),
 ('in', 0.013577350424647918),
 ('not', 0.013406605915348667),
 ('be', 0.012426405954556664),
 ('you', 0.01212918403096167),
 ('that', 0.011123688587310521),
 ('he', 0.011111040845880946),
 ('had', 0.010232022816525538),
 ('as', 0.00904945899286035),
 ('have', 0.008322213860659834),
 ('for', 0.008271622894941535)]

In [35]:
probas_moby = Counter({word:c/len(norm_moby) for word, c in vocab_moby.items()})
probas_moby.most_common(20)

[('the', 0.06754302802186658),
 ('of', 0.03102639932456972),
 ('and', 0.03000759387396056),
 ('a', 0.021828850117681462),
 ('to', 0.02158829883073208),
 ('in', 0.01954125454571182),
 ('that', 0.01386707418884691),
 ('his', 0.01188606359044021),
 ('it', 0.011169126421493022),
 ('i', 0.009164532363581479),
 ('but', 0.008376844816119767),
 ('he', 0.008240060750991684),
 ('with', 0.00811271006966554),
 ('as', 0.00810799337776457),
 ('is', 0.008046676383051983),
 ('was', 0.007711791258083231),
 ('for', 0.007537273657747402),
 ('all', 0.006957120553928297),
 ('this', 0.006424134369118875),
 ('at', 0.006178866390268521)]

These probabilities can be used to directly compare the usage of word by different authors or we can try to predict who would more likely say some phrase?

In [41]:
phrase = 'I hate that whale'

prob = Counter({'emma':0, 'moby':0})

for word in normalize(phrase):
    # logarithm is often applied to probabilities
    # when we multiply small numbers (like probabilities) we can quickly get too many zeros
    # addition of logarithms is equivalent to multiplying probabilities
    
    prob['emma'] += np.log(probas_emma.get(word, 0.00001)) # we need small probas in case the word is missing
    prob['moby'] += np.log(probas_moby.get(word, 0.00001))



In [42]:
# the bigger the result (closer to 0) the more probable
prob.most_common()

[('moby', -24.559352967277015), ('emma', -30.57202400204973)]

Here we just multiplied the probabilities of individual words. It is equivalent to saying that every word in a text is selected independently. Which is obviously not true.

Actually we need to use the formula of [total probability](https://en.wikipedia.org/wiki/Law_of_total_probability)
The problem is - we can't compute probabilities for sequences we haven't seen in the corpus. And the longer the sequence the less likely it is that we saw the sequence before. 

To overcome this, we can use [Markov assumption](https://en.wikipedia.org/wiki/Markov_property) and approximate the probability of a sequence as a product of bigram probabilities. 

First, we need to get bigram frequencies.

In [44]:
from nltk.tokenize import sent_tokenize
def ngrammer(tokens, n=2):
    ngrams = []
    for i in range(0,len(tokens)-n+1):
        ngrams.append(' '.join(tokens[i:i+n]))
    return ngrams

In [45]:
norm_emma[:10]

['emma', 'by', 'jane', 'austen', '1816', 'volume', 'i', 'chapter', 'i', 'emma']

In [46]:
ngrammer(norm_emma[:10])

['emma by',
 'by jane',
 'jane austen',
 'austen 1816',
 '1816 volume',
 'volume i',
 'i chapter',
 'chapter i',
 'i emma']

We need to add \< start \> to every sentence to make proper probability distribution and also to be able to start a sequence if we want to generate.

To be able to stop we need  \< end \> to every sentence

In [47]:
sentences_emma = [['<start>'] + normalize(text) + ['<end>'] for text in sent_tokenize(emma)]
sentences_moby = [['<start>'] + normalize(text) + ['<end>'] for text in sent_tokenize(moby)]

In [49]:
unigrams_emma = Counter()
bigrams_emma = Counter()

for sentence in sentences_emma:
    unigrams_emma.update(sentence)
    bigrams_emma.update(ngrammer(sentence))


unigrams_moby = Counter()
bigrams_moby = Counter()

for sentence in sentences_moby:
    unigrams_moby.update(sentence)
    bigrams_moby.update(ngrammer(sentence))


In [50]:
len(unigrams_emma)

9436

In [51]:
bigrams_emma.most_common(10)

[('<start> i', 871),
 ('to be', 602),
 ('of the', 559),
 ('<start> she', 505),
 ('in the', 441),
 ('it was', 418),
 ('<start> he', 395),
 ('i am', 363),
 ('<start> it', 339),
 ('she had', 322)]

In [52]:
bigrams_moby.most_common(10)

[('of the', 1879),
 ('in the', 1176),
 ('to the', 726),
 ('<start> but', 704),
 ('<start> the', 545),
 ('from the', 440),
 ('<start> and', 406),
 ('<start> i', 396),
 ('of his', 371),
 ('and the', 368)]

The probality of bigram more formally is the probability of word2 given word1. We can compute it by deviding the number of occurences of word2 and word1 together by the number of occurences of word1 -  
**p(word1 word2)** = **(word1 and word2)/word1**

In [59]:
phrase = 'If I loved you less, I might be able to talk about it more' #emma quote
# phrase = 'For there are devils in the deep, but worst are the ones we make.' #moby dick quote

prob = Counter()
for ngram in ngrammer(['<start>'] + normalize(phrase) + ['<end>']):
    word1, word2 = ngram.split()
    if word1 in unigrams_emma and ngram in bigrams_emma:
        prob['emma'] += np.log(bigrams_emma[ngram]/unigrams_emma[word1])
    else:
        prob['emma'] += -10 # we need small proba in case the word is missing
    if word1 in unigrams_moby and ngram in bigrams_moby:
        prob['moby'] += np.log(bigrams_moby[ngram]/unigrams_moby[word1])
    else:
        prob['moby'] += -10



In [60]:
prob.most_common()

[('emma', -58.08141178760031), ('moby', -83.59821440802531)]

Using those probabilities we can try to generate a text in Austen style.

In [62]:
# we build a matrix of probabilities
# ij-element in a matrix is a probability of (word_i, word_j)

matrix_emma = np.zeros((len(unigrams_emma), 
                   len(unigrams_emma)))

# we create a dictionary of word-index and index-word mapping
# because we use indecies in matrix and words in sentences
id2word_emma = list(unigrams_emma)
word2id_emma = {word:i for i, word in enumerate(id2word_emma)}


for ngram in bigrams_emma:
    word1, word2 = ngram.split()
    matrix_emma[word2id_emma[word1]][word2id_emma[word2]] =  (bigrams_emma[ngram]/
                                                                     unigrams_emma[word1])



In [63]:

matrix_moby = np.zeros((len(unigrams_moby), 
                   len(unigrams_moby)))

id2word_moby = list(unigrams_moby)
word2id_moby = {word:i for i, word in enumerate(id2word_moby)}



for ngram in bigrams_moby:
    word1, word2 = ngram.split()
    matrix_moby[word2id_moby[word1]][word2id_moby[word2]] =  (bigrams_moby[ngram]/
                                                                     unigrams_moby[word1])



**This matrix of probabilities is a bigram language model**. We can write a simple function that will generate N words using this language model.

#### The generation process is like this:  
----
1) we start with \< start \> token  
2) we use the index of \< start \> to get the probabilites of the next words  
3) we use np.random.choice to select the next word according to the probabilities  
4) we add the chosen word to the text and use it to generate the next word  
5) when we see \< end >\ token we stop or continue from the beginning  

In [64]:

def generate(matrix, id2word, word2id, n=100, start='<start>'):
    text = []
    current_idx = word2id[start]
    
    for i in range(n):
        
        chosen = np.random.choice(matrix.shape[1], p=matrix[current_idx])
#       try uncommenting the line below to see why we need np.random.choice
#         chosen = matrix[current_idx].argmax() # it just selects most probable word

        text.append(id2word[chosen])
        
        if id2word[chosen] == '<end>':
            chosen = word2id['<start>']
        current_idx = chosen
    
    return ' '.join(text)

In [65]:
print(generate(matrix_emma, id2word_emma, word2id_emma).replace('<end>', '\n'))

but i love must be danced would speak of her husband and miss fairfax 
 it will turn out in poor comfort but a pen in my accents swell to enscombe 
 weston.--so it darted through 
 bought at home he was very different homes and was as to see them emma's little time since we never had so much benefit and expressions but certainly shews it in confessing exactly 
 the want only to sit down the liveliest objects she that could answer written enough broad neat and this being affected by her tongue 
 no,"--he gravely 
 i


In [66]:
print(generate(matrix_moby, id2word_moby, word2id_moby).replace('<end>', '\n'))

i don't you are part of uncommon bulk 
 it can't remember he can hit aright this part of that the driving on the middle-watch a world 
 turn to!--i make much the great sperm whales 
 and fish seen to on the boats are absolutely paints like old second was examined the streets of instantaneous violent gaspings and informed nantucketers born of the widely-separated ships made the binnacle slipped my dear domestic peculiarity on the waters near them daggoo 
 look so ahab stayed in the many of all hard 
 coming 
 i cherished no telling said excitedly


# Homework (Task 1)

##### Implement a trigram language model, generate some texts and compare it to the bigram language model we wrote above. Which one gives better texts?

Use the code above as a starting point. You don't really need to change much, but it might be difficult to figure out. Read Jurasky carefully to get a better undestanding. And feel free to ask me any questions. 

You can use other corpus if you want.   


**Hints**:  
you'll need two start-tags in trigram language model,   
use bigrams as rows in the matrix and unigrams as columns,   
if the text you generated is just randow words - something is wrong

---------------------------

## Collocation


Collocations are word combinations that occur regularly. If we want to find good collocations we can't just use word probabilities because they only show how the word2 is likely after word1. If word2 only occurs after word1 it is a good collocation, but it can be rare and its probability will be low.

There can be many ways of scoring collocations. The most common one is [PMI](https://en.wikipedia.org/wiki/Pointwise_mutual_information)

The formula for PMI is **p(ab)/p(a)*p(b)** Let's try it out on our texts

In [75]:
from collections import Counter, defaultdict
import numpy as np
from nltk.corpus import stopwords
from string import punctuation
stops = set(stopwords.words('english'))

def normalize(text):
    normalized_text = [word.strip(punctuation) for word \
                                                            in text.lower().split()]
    normalized_text = [word for word in normalized_text if word not in stops]
    return normalized_text


def ngrammer(tokens, n=2):
    ngrams = []
    for i in range(0,len(tokens)-n+1):
        ngrams.append(' '.join(tokens[i:i+n]))
    return ngrams
    

Preprocessing is the same

In [76]:
sentences_emma =  [normalize(text) for text in sent_tokenize(emma)]
sentences_moby =  [normalize(text) for text in sent_tokenize(moby)]

In [78]:
word_counter = Counter()

for text in sentences_emma:
    word_counter.update(ngrammer(text, n=2))


In [79]:
word_counter.most_common(15)

[('mr knightley', 242),
 ('mrs weston', 217),
 ('mr elton', 174),
 ('miss woodhouse', 150),
 ('mr weston', 136),
 ('frank churchill', 123),
 ('every thing', 119),
 ('mrs elton', 115),
 ('miss fairfax', 105),
 ('mr woodhouse', 104),
 ('miss bates', 97),
 ('jane fairfax', 90),
 ('every body', 86),
 ('young man', 75),
 ('great deal', 63)]

We can use raw counts instead of probabilities. The results we'll be relatively the same.

In [82]:
def scorer_pmi(word_count_a, word_count_b, bigram_count):
    try:
        score = bigram_count/((word_count_a*word_count_b))
    
    except ZeroDivisionError:
        return 0
    
    return score

We'll make a function to collect unigrams and bigrams

In [83]:
def collect_stats(texts):
    
    unigrams = Counter()
    bigrams = Counter()
    
    for text in texts:
        unigrams.update(text)
        bigrams.update(ngrammer(text, 2))
    
    return unigrams, bigrams

And also a function that will score every bigram we collect.

In [158]:
def score_bigrams(unigrams, bigrams, scorer, threshold=-100000):
    
    bigram2score = Counter()
    len_vocab = len(unigrams)
    for bigram in bigrams:
        score = scorer(unigrams[bigram[0]], unigrams[bigram[1]], 
                       bigrams[bigram])
        
        ## if PMI is bigger than the threshold we add to the result
        if score > threshold:
            bigram2score[bigram] = score
    
    return bigram2score

In [109]:
unigrams, bigrams = collect_stats(sentences_moby)

In [110]:
bigram2score = score_bigrams(unigrams, bigrams, scorer_pmi)

Here we rank bigrams by their PMI scores

In [111]:
bigram2score.most_common(15)

[('let us', 12.5),
 ('new bedford', 8.5),
 ('never mind', 8.5),
 ('closed eyes', 5.0),
 ('chief mate', 5.0),
 ('never heard', 4.0),
 ('next morning', 3.5),
 ('new england', 3.0),
 ('new zealand', 3.0),
 ('centuries ago', 3.0),
 ('let go', 3.0),
 ('next day', 3.0),
 ('never yet', 3.0),
 ('clear spirit', 3.0),
 ('new york', 2.5)]

We can also add minimum bigram count to the function, so the word combination that occurred only once would not be on the top.

In [112]:
def scorer_pmi(word_count_a, word_count_b, bigram_count, min_count=1):
    try:
        score = ((bigram_count - min_count) / ((word_count_a * word_count_b)))
    except ZeroDivisionError:
        return 0
    
    return score
def score_bigrams(unigrams, bigrams, scorer, threshold=-100000, min_count=1):
    
    bigram2score = Counter()
    len_vocab = len(unigrams)
    for bigram in bigrams:
        score = scorer(unigrams[bigram[0]], unigrams[bigram[1]], 
                       bigrams[bigram], min_count)
        
        
        if score > threshold:
            bigram2score[bigram] = score
    
    return bigram2score

In [113]:
bigram2score = score_bigrams(unigrams, bigrams, scorer_pmi, min_count=10)

In [114]:
bigram2score.most_common(15)

[('let us', 8.0),
 ('new bedford', 4.0),
 ('never mind', 4.0),
 ('every one', 1.2),
 ('chief mate', 0.5),
 ('ever since', 0.3),
 ('every way', 0.1),
 ('moby dick', 0),
 ('dick herman', 0),
 ('melville 1851', 0),
 ('supplied late', 0),
 ('late consumptive', 0),
 ('consumptive usher', 0),
 ('usher grammar', 0),
 ('grammar school', 0)]

You can read about other metrics for scoring collocations here:

http://www.scielo.org.mx/scielo.php?script=sci_arttext&pid=S1405-55462016000300327#t1

---------------------

There are some existing tools to collect collocations from the corpus. Gensim has one.

In [116]:
import gensim

In [122]:
sentences = []
for sent in sent_tokenize(moby):
    sentences.append(normalize(sent))

In [156]:
sentences[1]

['supplied',
 'late',
 'consumptive',
 'usher',
 'grammar',
 'school',
 'pale',
 'usher--threadbare',
 'coat',
 'heart',
 'body',
 'brain',
 'see']

In [141]:
# train
# here npmi is used to score bigrams
# it is similar to pmi, but is from -1 to 1
ph = gensim.models.Phrases(sentences, min_count=1, threshold=-1, scoring='npmi')

In [142]:
# transforming text
ph[list(sentences[4])]

['take_hand',
 'school_others',
 'teach_name',
 'whale-fish_called',
 'tongue_leaving',
 'ignorance_letter',
 'h_almost',
 'alone_maketh',
 'signification_word',
 'deliver_true']

In [147]:
# we can apply Phraser to the bigrammed text
ph2 = gensim.models.Phrases(ph[sentences], min_count=1, threshold=-1, scoring='npmi')

In [149]:
# and then we can applied both Phrases to the text sequentially and get 4-grams 
ph2[ph[sentences[4]]]

['take_hand_school_others',
 'teach_name_whale-fish_called',
 'tongue_leaving_ignorance_letter',
 'h_almost_alone_maketh',
 'signification_word_deliver_true']

Nltk also has ngram scorer.

In [150]:
import nltk
from nltk.collocations import *

In [151]:
bigram_measures = nltk.collocations.BigramAssocMeasures()
trigram_measures = nltk.collocations.TrigramAssocMeasures()

In [152]:
finder2 = BigramCollocationFinder.from_documents(sentences)

In [153]:
finder3 = TrigramCollocationFinder.from_documents(sentences)

In [154]:
finder2.nbest(bigram_measures.likelihood_ratio, 20)

[('moby', 'dick'),
 ('sperm', 'whale'),
 ('white', 'whale'),
 ('captain', 'ahab'),
 ('old', 'man'),
 ('sperm', "whale's"),
 ('new', 'bedford'),
 ('captain', 'peleg'),
 ('mr', 'starbuck'),
 ('cape', 'horn'),
 ('aye', 'aye'),
 ('right', 'whale'),
 ('cried', 'ahab'),
 ('thou', 'art'),
 ('let', 'us'),
 ('years', 'ago'),
 ("d'ye", 'see'),
 ('lower', 'jaw'),
 ('father', 'mapple'),
 ('ivory', 'leg')]

In [155]:
finder3.nbest(trigram_measures.raw_freq, 20)

[('great', 'sperm', 'whale'),
 ('sperm', "whale's", 'head'),
 ('every', 'one', 'knows'),
 ('seen', 'white', 'whale'),
 ('cape', 'good', 'hope'),
 ('seven', 'hundred', 'seventy-seventh'),
 ('greenland', 'right', 'whale'),
 ('hast', 'seen', 'white'),
 ('right', "whale's", 'head'),
 ('sperm', 'whale', 'fishery'),
 ('would', 'almost', 'thought'),
 ('captain', 'ahab', 'said'),
 ('chase', 'moby', 'dick'),
 ('even', 'present', 'day'),
 ('god', 'bless', 'ye'),
 ('old', 'manx', 'sailor'),
 ('round', 'cape', 'horn'),
 ('sleep', 'two', 'bed'),
 ('stubb', 'second', 'mate'),
 ('thou', 'clear', 'spirit')]

### Homework (Task 2)



Implement a simple version of Byte-pair-encoding (see first seminar) using gensim.models.Phrases.

Apply gensim.models.Phrases to character sequences instead of word sequences (sentences). Train at least 3 Phrases sequentially. As a result you should get whole words or long character ngrams. 

In [160]:
# when you apply you phrasers to the text 
p3[p2[p[text]]] 

In [None]:
# you should get something like
['s_o_m', 'e', 'r_a', 'n_d_o_m', 't_e_x_t', 'h_e', 'r', 'e']