# Spelling correction

In this exercise we are going to build a simple model for spelling correction using a bigram language model. We will start by using the Brown corpus

In [1]:
import nltk
nltk.download("brown")
from nltk.corpus import brown

[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Unzipping corpora/brown.zip.


Now let's build a dictionary of all the words in Brown and all the bigrams using NLTK.


In [2]:
from nltk import FreqDist, bigrams, ConditionalFreqDist

words = FreqDist(w.lower() for w in brown.words())
words_2g = ConditionalFreqDist(bigrams(w.lower() for w in brown.words()))

In [4]:
words_2g["in"]["the"]

6025

How do you calculate the conditional probability $P(\text{book}|\text{the})$?

In [None]:
def cond_prob(prev_word, word):
  return words_2g[prev_word][word] / words_2g[prev_word].N()

assert cond_prob("the", "book") == 0.0007717482957225136

Now let's write a simple function that finds all the words at minimal edit distance for a misspelled word.

In [None]:
from nltk.metrics.distance import edit_distance

def candidates(w):
  dists = [(w2,edit_distance(w,w2)) for w2 in words]
  min_dist = min(x[1] for x in dists)
  return [y[0] for y in dists if y[1] == min_dist]

assert set(candidates("thjs")) == set(["this","thus"])

Using a bigram language model how do we choose a the best word based on the word either side?

Use this to write a function that returns the most likely spelling correction of `word`

In [None]:
def correct(prev_word, word, next_word):
  if word not in words:
    return max((w for w in candidates(word)),
               key=lambda w: cond_prob(prev_word, w) * cond_prob(w, next_word))
  else:
    return word

assert correct("is","thjs","correct") == "this"