---
Exercise: Spell Correction
====

<img src="https://s-media-cache-ak0.pinimg.com/236x/af/22/72/af2272d2f2f749c196407d724005f232.jpg" style="width: 600px;"/>

---

<img src="http://www.azquotes.com/picture-quotes/quote-simple-models-and-a-lot-of-data-trump-more-elaborate-models-based-on-less-data-peter-norvig-80-37-54.jpg" style="width: 600px;"/>

Inspired by Peter Novig's [How to Do Things with Words in Python](http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb)

----
Spelling Correction Task
----

Given a word *w*, find the most likely correction *c* = `correct(`*w*`)`.

__Approach:__ Try all candidate words *c* that are known words that are near *w*.  Choose the most likely one.

How do we balance *near* and *likely*?

For now, in a trivial way: always prefer nearer, but when there is a tie on nearness, use the word with the highest  count.  

Measure nearness by *edit distance*: the minimum number of deletions, transpositions, insertions, or replacements of characters. By trial and error, we determine that going out to edit distance 2 will give us reasonable results.

Then we can define `correct(`*w*`)`:

In [1]:
reset -fs

In [2]:
def correct(word):
    "Find the best spelling correction for this word."
    # Prefer edit distance 0, then 1, then 2; otherwise default to word itself.
    candidates = (known(edits0(word)) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    return max(candidates, key=counts.get)

The functions `known` and `edits0` are easy; and `edits2` is easy if we assume we have `edits1`:

In [3]:
def known(words):
    "Return the subset of words that are actually in the dictionary."
    return {word for word in words 
                if word in counts}

def edits0(word): 
    "Return all strings that are zero edits away from word (i.e., just word itself)."
    return {word}

def edits2(word):
    "Return all strings that are two edits away from this word."
    return {e2 for e1 in edits1(word) 
                for e2 in edits1(e1)}

Now for `edits1(word)`: the set of candidate words that are one edit away. For example, given `"wird"`, this would include `"weird"` (inserting an `e`) and `"word"` (replacing a `i` with a `o`), and also `"iwrd"` (transposing `w` and `i`; then `known` can be used to filter this out of the set of final candidates). How could we get them?  One way is to *split* the original word in all possible places, each split forming a *pair* of words, `(a, b)`, before and after the place, and at each place, either delete, transpose, replace, or insert a letter:

<table>
  <tr><td> pairs: <td><tt> Ø+wird <td><tt> w+ird <td><tt> wi+rd <td><tt>wir+d<td><tt>wird+Ø<td><i>Notes:</i><tt> (<i>a</i>, <i>b</i>)</tt> pair</i>
  <tr><td> deletions: <td><tt>Ø+ird<td><tt> w+rd<td><tt> wi+d<td><tt> wir+Ø<td><td><i>Delete first char of b</i>
  <tr><td> transpositions: <td><tt>Ø+iwrd<td><tt> w+rid<td><tt> wi+dr</tt><td><td><td><i>Swap first two chars of b
  <tr><td> replacements: <td><tt>Ø+?ird<td><tt> w+?rd<td><tt> wi+?d<td><tt> wir+?</tt><td><td><i>Replace char at start of b
  <tr><td> insertions: <td><tt>Ø+?+wird<td><tt> w+?+ird<td><tt> wi+?+rd<td><tt> wir+?+d<td><tt> wird+?+Ø</tt><td><i>Insert char between a and b
</table>

In [4]:
def edits1(word):
    "Return all strings that are one edit away from this word."
    pairs      = splits(word)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    return set(deletes + transposes + replaces + inserts)

def splits(word):
    "Return a list of all possible (first, rest) pairs that comprise word."
    return [(word[:i], word[i:]) 
                for i in range(len(word)+1)]

In [5]:
# TODO: Load the alphabet from Standard Library

In [6]:
from string import ascii_lowercase as alphabet

In [7]:
assert alphabet == 'abcdefghijklmnopqrstuvwxyz'

In [8]:
splits('wird')

[('', 'wird'), ('w', 'ird'), ('wi', 'rd'), ('wir', 'd'), ('wird', '')]

In [9]:
print(edits0('wird'))

{'wird'}


In [10]:
print(edits1('wird'))

{'wlrd', 'wirr', 'wircd', 'wxrd', 'wprd', 'wvrd', 'twird', 'wlird', 'wyird', 'wirvd', 'wirhd', 'tird', 'wirod', 'wirid', 'wipd', 'woird', 'wkrd', 'wnird', 'wirf', 'wrrd', 'zird', 'gird', 'wfrd', 'wirtd', 'wirkd', 'weird', 'wirds', 'owird', 'wikrd', 'whrd', 'wurd', 'wirdk', 'vird', 'wirg', 'wirt', 'wirdj', 'sird', 'wirud', 'wiard', 'wqrd', 'ywird', 'wirmd', 'wiwrd', 'wbrd', 'wirrd', 'wirld', 'rwird', 'widr', 'wnrd', 'wirnd', 'wiad', 'wirde', 'wivrd', 'wirpd', 'wmrd', 'wixd', 'nird', 'wijd', 'wirdf', 'wirdd', 'wirbd', 'swird', 'dwird', 'winrd', 'wir', 'dird', 'lird', 'wsird', 'wirwd', 'wirc', 'wzrd', 'wirb', 'hird', 'wmird', 'wirdb', 'bird', 'qird', 'oird', 'nwird', 'wirdg', 'wbird', 'wrid', 'witd', 'word', 'widd', 'wrird', 'wirdz', 'wirfd', 'wira', 'wirw', 'xird', 'wirx', 'wirv', 'jird', 'vwird', 'qwird', 'wifrd', 'wiryd', 'wimd', 'wixrd', 'xwird', 'wgrd', 'wirdx', 'wied', 'gwird', 'ird', 'wifd', 'kird', 'wirh', 'cwird', 'wzird', 'widrd', 'wirdm', 'wizrd', 'iwrd', 'wtird', 'wizd', 'iwir

In [11]:
print("{:,}".format(len(edits2('wird'))))

24,254


-----

Setup common functions and data structures

In [12]:
import re

In [13]:
def tokens(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower())

In [14]:
with open('../../../corpora/shakespeare_all.txt') as f:
    text = f.read()

In [15]:
from collections import Counter

In [16]:
words = tokens(text)
counts = Counter(words)

-----
Back to spell correction

In [17]:
phrase_uncorrected = 'Speling errurs in somethink. Whutever; unusuel misteakes everyware?'

In [18]:
phrase_corrected = map(correct, tokens(phrase_uncorrected))

print("OG Token, ", "Correct Token", end="\n\n")
print(*zip(tokens(phrase_uncorrected), 
           phrase_corrected), sep="\n")

OG Token,  Correct Token

('speling', 'spelling')
('errurs', 'errors')
('in', 'in')
('somethink', 'something')
('whutever', 'whatever')
('unusuel', 'unusual')
('misteakes', 'mistakes')
('everyware', 'everywhere')


----
Let's return a better formated result

In [19]:
def correct_text(text):
    "Correct all the words within a text, returning the corrected text."
    return re.sub('[a-zA-Z]+', correct_match, text)

def correct_match(match):
    "Spell-correct word in match, and preserve proper upper/lower/title case."
    word = match.group()
    return case_of(word)(correct(word.lower()))

def case_of(text):
    "Return the case-function appropriate for text: upper, lower, title, or just str."
    return (str.upper if text.isupper() else
            str.lower if text.islower() else
            str.title if text.istitle() else
            str)

Let's see how our best guess for misspelled words

In [20]:
print(phrase_uncorrected)
print(correct_text(phrase_uncorrected))

Speling errurs in somethink. Whutever; unusuel misteakes everyware?
Spelling errors in something. Whatever; unusual mistakes everywhere?


In [21]:
phrase_uncorrected_2 = 'Audiance sayzs: spealling is difffucult...'
print(phrase_uncorrected_2)
print(correct_text(correct_text(phrase_uncorrected_2)))

Audiance sayzs: spealling is difffucult...
Audience says: spelling is difficult...


In [22]:
phase_harder = "the elegant lady entered the room"

In [23]:
print(correct_text(correct_text(phase_harder))) # Yikes!

the element lady entered the room


![](https://cdn.meme.am/cache/instances/folder5/500x/66510005.jpg)

In [24]:
#TODO: Improve spellchecker with MOAR DATA. Load and use unigram counts

In [25]:
def load_counts(filename, sep='\t'):
    """Return a Counter initialized from key-value pairs, 
    one on each line of filename."""
    C = Counter()
    for line in open(filename):
        key, count = line.split(sep)
        C[key] = int(count)
    return C

In [26]:
counts = load_counts('../../../corpora/unigram_word_counts.txt')

In [27]:
phase_harder = "the elegant lady entered the room"
print(correct_text(correct_text(phase_harder)))

the elegant lady entered the room


In [28]:
assert phase_harder == correct_text(correct_text(phase_harder))

In [29]:
phase_haderest = "you wrote elagent, elligit, but you meant elegant"
print(correct_text(correct_text(phase_haderest))) # Better but not great, yet

you wrote agent, elliott, but you meant elegant


In [30]:
#TODO: Load common spelling errors

In [31]:
"""Return a Counter initialized from key-value pairs, 
    one on each line of filename."""
spell_errors = {}
with open("../../../corpora/spell_errors.txt") as f:
    for line in f:
        correct, errors = line.split(":")
        spell_errors[correct.replace("_", " ")] = errors.strip().split(", ")  # XXX: ignore str with "foo*2" format

In [32]:
assert spell_errors['raining'] == ['rainning', 'raning']
assert spell_errors['at least'] == ['atleast']

In [33]:
# TODO: Rewrite "correct" function to use common errors if applicable otherwise default to old methdod

In [34]:
def correct(word):
    "Find the best spelling correction for this word."
    
    # Look for current word as a value in common spelling error dictionary
    # If found, use that correction.
    # XXX: there dictionary words in the values which introduces new errors ☹
    common_errors = [key for key,values in spell_errors.items() 
                                  if word in values] 
    
    # Prefer common errors, then edit distance 0, then 1, then 2; otherwise default to word itself.
    candidates = (common_errors or
                  known(edits0(word)) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    
    print(candidates)
    return max(candidates, key=counts.get)

In [38]:
# Invert dictionary; instead of looking through all values
correctwords = list(spell_errors.keys())
vals = list(spell_errors.values())
mydict = {}

for i,misspellingList in enumerate(vals):
    for misspelling in misspellingList:
        mydict[misspelling] = correctwords[i]

def correct(word):
    "Find the best spelling correction for this word."
    # Prefer edit distance 0, then 1, then 2; otherwise default to word itself.
    try:
        return mydict[word]
    except KeyError:
        pass

    candidates = (known(edits0(word)) or 
                 known(edits1(word)) or 
                 known(edits2(word)) or 
                 [word])
    return max(candidates, key=counts.get)

In [40]:
assert correct_text("elagent") == "elegant"
assert correct_text("elligit") == "elegant"

In [41]:
correct_text("bug") # XXX: there dictionary words in the values which introduces new errors ☹

'dug'

<br>
<br>

---