# Language Modeling

Types of language models include Statistical and Neural Language models

- Statistical Language Modelling
- Neural Language Modelling
- Maximum Likelihood Estimation
- Calculating Word Frequency

## Statistical Language Modelling

Statistical language modelling helps us calculate word frequency, Maximum Likelihood Estimation(MLE) and other computational linquistic models

In this section we will n-gram language model.

What is N-gram?

N-gram is a sequence of N tokens for words. It could predict the probability of occurence of a token in a sentance. Based on the value of N we have unigram, bigram, trigram structures.

We will use Reuters corpus, having a collection of 10,788 news documents totaling 1.3 million words. We can build a language model in a few lines of code using the NLTK package

In [1]:
# code courtesy of https://nlpforhackers.io/language-models/

from nltk.corpus import reuters
from nltk import bigrams, trigrams
from collections import Counter, defaultdict
import nltk
nltk.download('reuters')
nltk.download('punkt')

# Create a placeholder for model
model = defaultdict(lambda: defaultdict(lambda: 0))

# Count frequency of co-occurance  
for sentence in reuters.sents():
    for w1, w2, w3 in trigrams(sentence, pad_right=True, pad_left=True):
        model[(w1, w2)][w3] += 1
 
# Let's transform the counts to probabilities
for w1_w2 in model:
    total_count = float(sum(model[w1_w2].values()))
    for w3 in model[w1_w2]:
        model[w1_w2][w3] /= total_count

print(dict(model['the', 'price', 'of']))

[nltk_data] Downloading package reuters to
[nltk_data]     /Users/johnmoses/nltk_data...
[nltk_data]   Package reuters is already up-to-date!
[nltk_data] Downloading package punkt to /Users/johnmoses/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


{}


In [2]:
# code courtesy of https://nlpforhackers.io/language-models/

import random

# starting words
text = ["today", "the"]
sentence_finished = False
 
while not sentence_finished:
  # select a random probability threshold  
  r = random.random()
  accumulator = .0

  for word in model[tuple(text[-2:])].keys():
      accumulator += model[tuple(text[-2:])][word]
      # select words that are above the probability threshold
      if accumulator >= r:
          text.append(word)
          break

  if text[-2:] == [None, None]:
      sentence_finished = True
 
print (' '.join([t for t in text if t]))

today the Bank said it acquired Norma Bork Associates Inc and New Zealand trade minister Mike Moore told his colleagues great progress had been validly tendered .


## Neural Language Modelling

We can use neural networks to build two kinds of models namely - character based and word based models.

## Maximum Likelihood Estimation

## Calculating Word Frequency

A collection of two or more tokens that tend to exist together is called Collocations. Let's generate unigrams for Alpino Corpus

In [3]:
import nltk
from nltk.util import ngrams
from nltk.corpus import alpino
nltk.download('alpino')

print(alpino.words())
unigrams = ngrams(alpino.words(), 1)
for i in unigrams:
    print(i)

[nltk_data] Downloading package alpino to
[nltk_data]     /Users/johnmoses/nltk_data...
[nltk_data]   Package alpino is already up-to-date!


['De', 'verzekeringsmaatschappijen', 'verhelen', ...]
('De',)
('verzekeringsmaatschappijen',)
('verhelen',)
('niet',)
('dat',)
('ook',)
('de',)
('rentegrondslag',)
('van',)
('vier',)
('procent',)
('nog',)
('een',)
('ruime',)
('marge',)
('laat',)
('ten',)
('opzichte',)
('van',)
('de',)
('thans',)
('geldende',)
('rentestand',)
('.',)
('Gezien',)
('de',)
('lange',)
('duur',)
('van',)
('vele',)
('verzekeringscontracten',)
('is',)
('dit',)
('onvermijdelijk',)
(',',)
('vooral',)
('omdat',)
('de',)
('aard',)
('van',)
('deze',)
('contracten',)
('een',)
('tussentijdse',)
('premieverhoging',)
('niet',)
('toelaat',)
('.',)
('De',)
('premieverlaging',)
('geldt',)
(',',)
('zoals',)
('onlangs',)
('reeds',)
('werd',)
('aangekondigd',)
(',',)
('voor',)
('nieuwe',)
('verzekeringen',)
(',',)
('gesloten',)
('vanaf',)
('15',)
('september',)
('jl.',)
('Het',)
('loslaten',)
('van',)
('de',)
('vaste',)
('wisselkoers',)
('van',)
('de',)
('Duitse',)
('mark',)
('heeft',)
('geleid',)
('tot',)
('een',)
('ernstige

In [4]:
quadgrams = ngrams(alpino.words(), 4)
for i in quadgrams:
    print(i)

('De', 'verzekeringsmaatschappijen', 'verhelen', 'niet')
('verzekeringsmaatschappijen', 'verhelen', 'niet', 'dat')
('verhelen', 'niet', 'dat', 'ook')
('niet', 'dat', 'ook', 'de')
('dat', 'ook', 'de', 'rentegrondslag')
('ook', 'de', 'rentegrondslag', 'van')
('de', 'rentegrondslag', 'van', 'vier')
('rentegrondslag', 'van', 'vier', 'procent')
('van', 'vier', 'procent', 'nog')
('vier', 'procent', 'nog', 'een')
('procent', 'nog', 'een', 'ruime')
('nog', 'een', 'ruime', 'marge')
('een', 'ruime', 'marge', 'laat')
('ruime', 'marge', 'laat', 'ten')
('marge', 'laat', 'ten', 'opzichte')
('laat', 'ten', 'opzichte', 'van')
('ten', 'opzichte', 'van', 'de')
('opzichte', 'van', 'de', 'thans')
('van', 'de', 'thans', 'geldende')
('de', 'thans', 'geldende', 'rentestand')
('thans', 'geldende', 'rentestand', '.')
('geldende', 'rentestand', '.', 'Gezien')
('rentestand', '.', 'Gezien', 'de')
('.', 'Gezien', 'de', 'lange')
('Gezien', 'de', 'lange', 'duur')
('de', 'lange', 'duur', 'van')
('lange', 'duur', 'van