# Part-of-speech tagging

We are going to implement the hidden Markov model using the Brown corpus again. You can get all the tagged words using 

    brown.tagged_words()

Using this can you infer all the probabilities you need for the HMM?

_Use a simple smoothing strategy and return `1e-8` if the probability has no examples in the corpus_

In [3]:
import nltk
nltk.download("brown")
from nltk.corpus import brown
from nltk import bigrams, ConditionalFreqDist
from math import log

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!


In [4]:
emissions = ConditionalFreqDist((w[1], w[0].lower()) for w in brown.tagged_words())
transitions = ConditionalFreqDist((b[0][1],b[1][1]) for b in bigrams(brown.tagged_words()))

def emit_prob(t, w):
  return max(emissions[t][w] / emissions[t].N(), 1e-8)

def trans_prob(t1, t2):
  return max(transitions[t1][t2] / transitions[t1].N(), 1e-8)

assert emit_prob("VB","work") == 0.005312676223547918
assert trans_prob("NN", "VB") == 0.003961435036400603

Using this implement the hidden Markov model that takes a sequence of words and a sequence of tags and returns the **log** probability associated with this.

_We will make a simplifying hack and treat the start token as equivalent to the symbol `.` i.e., we treat every sentence as following on from another sentence after a full stop_

In [5]:
def hmm(words,tags):
  p = 0.0
  for i in range(len(words)):
    if i == 0:
      p += log(trans_prob(".",tags[0]))
    else:
      p += log(trans_prob(tags[i-1], tags[i]))
    p += log(emit_prob(tags[i],words[i]))
  return p
    
assert hmm(["this","works"],["DT","VBZ"]) == -13.400840641537089

Let's extend this further to a function that predicts for the next word's part-of-speech tag based on a partial tagging.

In [6]:
def hmm_predict(words, tags, word):
  return max((t for t in transitions.keys()),
             key=lambda t: hmm(words + [word], tags + [t]))
  
assert hmm_predict(["this", "works"],["DT","VBZ"],"well") == "RB"