---
Exercise: Spell Correction
====

<img src="https://s-media-cache-ak0.pinimg.com/236x/af/22/72/af2272d2f2f749c196407d724005f232.jpg" style="width: 600px;"/>

---

<img src="http://www.azquotes.com/picture-quotes/quote-simple-models-and-a-lot-of-data-trump-more-elaborate-models-based-on-less-data-peter-norvig-80-37-54.jpg" style="width: 600px;"/>

Inspired by Peter Novig's [How to Do Things with Words in Python](http://nbviewer.jupyter.org/url/norvig.com/ipython/How%20to%20Do%20Things%20with%20Words.ipynb)

----
Spelling Correction Task
----

Given a word *w*, find the most likely correction *c* = `correct(`*w*`)`.

__Approach:__ Try all candidate words *c* that are known words that are near *w*.  Choose the most likely one.

How do we balance *near* and *likely*?

For now, in a trivial way: always prefer nearer, but when there is a tie on nearness, use the word with the highest  count.  

Measure nearness by *edit distance*: the minimum number of deletions, transpositions, insertions, or replacements of characters. By trial and error, we determine that going out to edit distance 2 will give us reasonable results.

Then we can define `correct(`*w*`)`:

In [1]:
reset -fs

In [2]:
def correct(word):
    "Find the best spelling correction for this word."
    # Prefer edit distance 0, then 1, then 2; otherwise default to word itself.
    candidates = (known(edits0(word)) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    return max(candidates, key=counts.get)

The functions `known` and `edits0` are easy; and `edits2` is easy if we assume we have `edits1`:

In [10]:
def known(words): ## This is a set
    "Return the subset of words that are actually in the dictionary."
    return {word for word in words 
                if word in counts}

def edits0(word): 
    "Return all strings that are zero edits away from word (i.e., just word itself)."
    return {word}

def edits2(word):
    "Return all strings that are two edits away from this word."
    return {e2 for e1 in edits1(word) 
                for e2 in edits1(e1)}

In [29]:
known('hello there how are you')

{'a', 'e', 'h', 'l', 'o', 'r', 't', 'u', 'w', 'y'}

Now for `edits1(word)`: the set of candidate words that are one edit away. For example, given `"wird"`, this would include `"weird"` (inserting an `e`) and `"word"` (replacing a `i` with a `o`), and also `"iwrd"` (transposing `w` and `i`; then `known` can be used to filter this out of the set of final candidates). How could we get them?  One way is to *split* the original word in all possible places, each split forming a *pair* of words, `(a, b)`, before and after the place, and at each place, either delete, transpose, replace, or insert a letter:

<table>
  <tr><td> pairs: <td><tt> Ø+wird <td><tt> w+ird <td><tt> wi+rd <td><tt>wir+d<td><tt>wird+Ø<td><i>Notes:</i><tt> (<i>a</i>, <i>b</i>)</tt> pair</i>
  <tr><td> deletions: <td><tt>Ø+ird<td><tt> w+rd<td><tt> wi+d<td><tt> wir+Ø<td><td><i>Delete first char of b</i>
  <tr><td> transpositions: <td><tt>Ø+iwrd<td><tt> w+rid<td><tt> wi+dr</tt><td><td><td><i>Swap first two chars of b
  <tr><td> replacements: <td><tt>Ø+?ird<td><tt> w+?rd<td><tt> wi+?d<td><tt> wir+?</tt><td><td><i>Replace char at start of b
  <tr><td> insertions: <td><tt>Ø+?+wird<td><tt> w+?+ird<td><tt> wi+?+rd<td><tt> wir+?+d<td><tt> wird+?+Ø</tt><td><i>Insert char between a and b
</table>

In [71]:
def edits1(word):
    "Return all strings that are one edit away from this word."
    #Creating all possible edits
    pairs      = splits(word)
    #print(pairs)
    deletes    = [a+b[1:]           for (a, b) in pairs if b]
    #print(deletes,'delete')
    transposes = [a+b[1]+b[0]+b[2:] for (a, b) in pairs if len(b) > 1]
    #print(transposes,'transpose')
    replaces   = [a+c+b[1:]         for (a, b) in pairs for c in alphabet if b]
    
    #print(replaces,'replaces')
    inserts    = [a+c+b             for (a, b) in pairs for c in alphabet]
    #print(inserts,'inserts')
    return set(deletes + transposes + replaces + inserts)

def splits(word):
    "Return a list of all possible (first, rest) pairs that comprise word."
    return [(word[:i], word[i:]) 
                for i in range(len(word)+1)]

In [57]:
# TODO: Load the alphabet from Standard Library

In [58]:
import string
alphabet = string.ascii_lowercase

In [59]:
assert alphabet == 'abcdefghijklmnopqrstuvwxyz'

In [60]:
splits('wird')

[('', 'wird'), ('w', 'ird'), ('wi', 'rd'), ('wir', 'd'), ('wird', '')]

In [66]:
l = 'hello'
''+'a'+l[1:]

'aello'

In [67]:
print(edits1('hello'))

['aello', 'bello', 'cello', 'dello', 'eello', 'fello', 'gello', 'hello', 'iello', 'jello', 'kello', 'lello', 'mello', 'nello', 'oello', 'pello', 'qello', 'rello', 'sello', 'tello', 'uello', 'vello', 'wello', 'xello', 'yello', 'zello', 'hallo', 'hbllo', 'hcllo', 'hdllo', 'hello', 'hfllo', 'hgllo', 'hhllo', 'hillo', 'hjllo', 'hkllo', 'hlllo', 'hmllo', 'hnllo', 'hollo', 'hpllo', 'hqllo', 'hrllo', 'hsllo', 'htllo', 'hullo', 'hvllo', 'hwllo', 'hxllo', 'hyllo', 'hzllo', 'healo', 'heblo', 'heclo', 'hedlo', 'heelo', 'heflo', 'heglo', 'hehlo', 'heilo', 'hejlo', 'heklo', 'hello', 'hemlo', 'henlo', 'heolo', 'heplo', 'heqlo', 'herlo', 'heslo', 'hetlo', 'heulo', 'hevlo', 'hewlo', 'hexlo', 'heylo', 'hezlo', 'helao', 'helbo', 'helco', 'heldo', 'heleo', 'helfo', 'helgo', 'helho', 'helio', 'heljo', 'helko', 'hello', 'helmo', 'helno', 'heloo', 'helpo', 'helqo', 'helro', 'helso', 'helto', 'heluo', 'helvo', 'helwo', 'helxo', 'helyo', 'helzo', 'hella', 'hellb', 'hellc', 'helld', 'helle', 'hellf', 'hellg', 

In [49]:
print("{:,}".format(len(edits2('wird'))))

24,254


-----

Setup common functions and data structures

In [50]:
import re

In [51]:
def tokens(text):
    "List all the word tokens (consecutive letters) in a text. Normalize to lowercase."
    return re.findall('[a-z]+', text.lower())

In [52]:
with open('../../corpora/shakespeare_all.txt') as f:
    text = f.read()

In [53]:
from collections import Counter

In [97]:
words = tokens(text)
counts = Counter(words)
#counts.items()

-----
Back to spell correction

In [55]:
phrase_uncorrected = 'Speling errurs in somethink. Whutever; unusuel misteakes everyware?'

In [31]:
phrase_corrected = map(correct, tokens(phrase_uncorrected))

print("OG Token, ", "Corrected Token", end="\n\n")
print(*zip(tokens(phrase_uncorrected), 
           phrase_corrected), sep="\n")

OG Token,  Corrected Token

('speling', 'seeling')
('errurs', 'errors')
('in', 'in')
('somethink', 'something')
('whutever', 'whatever')
('unusuel', 'unusual')
('misteakes', 'mistakes')
('everyware', 'everywhere')


----
Let's return a better formated result

In [163]:
def correct_text(text):
    "Correct all the words within a text, returning the corrected text."
    return re.sub('[a-zA-Z]+', correct_match, text)

def correct_match(match):
    "Spell-correct word in match, and preserve proper upper/lower/title case."
    #print(match,'match')
    word = match.group() #this is a capture group for regular expresion
    #print(word,'word')
    #print(case_of(word)(correct(word.lower())))
    return case_of(word)(correct(word.lower()))

def case_of(text):
    "Return the case-function appropriate for text: upper, lower, title, or just str."
    #going through letter by letter
    return (str.upper if text.isupper() else
            str.lower if text.islower() else
            str.title if text.istitle() else
            str)

Let's see how our best guess for misspelled words

In [164]:
print(phrase_uncorrected)
print(correct_text(phrase_uncorrected))

Speling errurs in somethink. Whutever; unusuel misteakes everyware?
Speling erreurs in somethin. Whatever; unusual mistakes everyway?


In [126]:
p = 'StrIng'
p.upper()

'STRING'

In [78]:
phrase_uncorrected_2 = 'Audiance sayzs: spealling is difffucult...'
print(phrase_uncorrected_2)
print(correct_text(phrase_uncorrected_2))

Audiance sayzs: spealling is difffucult...
Audience says: spelling is difficult...


In [74]:
phase_harder = "the elegant lady entered the room"

In [77]:
print(correct_text(correct_text(phase_harder))) # Yikes!

the element lady entered the room


![](https://cdn.meme.am/cache/instances/folder5/500x/66510005.jpg)

In [25]:
#TODO: Improve spellchecker with MOAR DATA. Load and use unigram counts

In [116]:
with open('../../corpora/unigram_word_counts.txt') as f:
    unigram = f.read()

In [129]:
unigram = re.split('\t|\s',unigram)


TypeError: expected string or bytes-like object

In [135]:
len(unigram)

666667

In [136]:
unigram_dict = {unigram[i-1]:unigram[i] for i in range(1,len(unigram))}

In [139]:
unigram_dict['the']

'23135851162'

In [148]:
counts = unigram_dict #this will redefine your corpus

In [149]:
phase_harder = "the elegant lady entered the room"
print(correct_text(correct_text(phase_harder)))

the elegant lady entered the room


In [150]:
assert phase_harder == correct_text(correct_text(phase_harder))

In [191]:
phase_haderest = "you wrote elagent, elligit, but you meant elegant"
print(correct_text(correct_text(phase_haderest))) # Better but not great, yet

you wrote lament, elicit, but you meant elegant


In [192]:
#TODO: Load common spelling errors

In [370]:
with open('../../corpora/spell_errors.txt') as t:
    spelling = t.read()

In [371]:
spelling[:80]

'raining: rainning, raning\nwritings: writtings\ndisparagingly: disparingly\nyellow:'

In [372]:
spelling = re.split('\n|:',spelling)

In [377]:
spelling = [item.replace('_',' ') for item in spelling]
spelling = [item.strip(' ') for item in spelling]

In [378]:
for item in spelling:
    if item == 'at least':
        print(item)

at least


In [379]:
spell_errors = {spelling[i-1]:spelling[i].split(', ') for i in range(1,len(spelling))}

In [380]:
spell_errors['at least']

['atleast']

In [381]:
spell_errors['at least']

['atleast']

In [382]:
assert spell_errors['raining'] == ['rainning', 'raning']
assert spell_errors['at least'] == ['atleast']

In [383]:
# TODO: Rewrite "correct" function to use common errors if applicable otherwise default to old methdod

In [398]:
def correct_text(text):
    for k,v in spell_errors.items():
        if text in v:
            return k
    
    candidates = (known(edits0(word)) or 
                  known(edits1(word)) or 
                  known(edits2(word)) or 
                  [word])
    return max(candidates, key=counts.get)

In [399]:
assert correct_text("elagent") == "elegant"
assert correct_text("elligit") == "elegant"

In [400]:
correct_text("elligit")

'elegant'

In [401]:
['elagent' in spell_errors['elegant']]

[True]

In [402]:
fake = {'hello':['hi','bye']}


<br>
<br>

---