# Lab 0 - 4/9 
### Author Sepehr Tayari elt15sta
###### The following methods are used to generate a spell checking application

In [1]:
import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())): 
    "Probability of `word`."
    return WORDS[word] / N

def correction(word): 
    "Most probable spelling correction for word."
    return max(candidates(word), key=P)

def candidates(word): 
    "Generate possible spelling corrections for word."
    return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words): 
    "The subset of `words` that appear in the dictionary of WORDS."
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`."
    letters    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word): 
    "All edits that are two edits away from `word`."
    return (e2 for e1 in edits1(word) for e2 in edits1(e1))

# How is the correct word choosen?

The misspelled word will have many different permuations. In order to find the correct word, all permuatations must be checked and compared to some sort of dictionary to determine if it is an actuall word or just gibberish. In this laboration, file big.txt is used as the dictionary. Using the function correction() following will happen: 

In [2]:
correction('seling')

'seeing'

In [3]:
correction('slling')

'selling'

correction() function is in fact calling candidates() which in turn checks for permutations. We'll check the function edits1(). This will generate many permutations on a given string.

In [4]:
len(edits1('seling'))

338

Of all 338 suggestions, only 5 are actuall words. To find out if it is an actuall word, the function known() is used. known() compares the permutations to the words in big.txt. 

In [5]:
known(edits1('seling'))

{'sealing', 'seeing', 'selling', 'sewing', 'sling'}

The word which is chosen is determined statistically. In the above example where correction() is used, both words could possible be selling, but the outputs are different. Which word is choosen is determined by the function P(), which returns the frequency of a word in the big.txt file. We can find out which different words are possible with the candidates() function.

In [6]:
candidates('seling')

{'sealing', 'seeing', 'selling', 'sewing', 'sling'}

In [7]:
candidates('slling')

{'selling', 'sling'}

The function edits2() is also provided. This function lets us find words with two character differences. The permutation set will become much greater in this case, and therefore find more suggested words.

In [8]:
len(known(edits2('seling')))

76

Although edits2() is only used if edits1() does not return any value, since misspelling one letter is more common than misspelling two. 

Which candidate the correction() function will use is determined by the P() function which is passed in as an argument in max().


In [9]:
for k in candidates('seling'):
    print(P(k), k)

4.481953414576209e-06 sewing
1.7927813658304835e-06 sling
7.171125463321934e-06 sealing
1.4342250926643868e-05 selling
0.00018555287136345504 seeing


Above we can see that probability for 'seeing' is higher and therefore correction('seling') will return 'seeing'. 

We can use Pythons own library to find following data: 

In [10]:
len(WORDS)

32198

In [11]:
WORDS.most_common(10)

[('the', 79809),
 ('of', 40024),
 ('and', 38312),
 ('to', 28765),
 ('in', 22023),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681)]

The word 'the' is the most common. Which is expected since it is builds up 7% of the English language. We can try this

In [12]:
P('the')

0.07154004401278254