* [Importing Libraries](#importing)
* [Tokenise Words](#tokenise)

## <font color='#4a8bad'>Importing Libraries</font>
***
<a id="importing"></a>

In [1]:
import re
from collections import Counter

## <font color='#4a8bad'>Tokenise Words</font>
***
<a id="tokenise"></a>

Convert text to lower case and tokenise the document.

In [2]:
def words(document):
    return re.findall(r'\w+', document.lower())

#### <font color='#4a8bad'>Example</font>

In [3]:
words("The monkey was notorious.")

['the', 'monkey', 'was', 'notorious']

## <font color='#4a8bad'>Words Frequency</font>
***
<a id="frequency"></a>

Create a frequency table of all the words of the document

In [4]:
read_document = open("../input/big-txt/big.txt").read()
word_list = words(read_document)
all_words = Counter(word_list)

#### <font color='#4a8bad'>Example</font>

In [5]:
all_words["chair"]

135

#### <font color='#4a8bad'>Top 10</font>

Look at top 10 frequent words

In [6]:
all_words.most_common(10)

[('the', 79809),
 ('of', 40024),
 ('and', 38312),
 ('to', 28765),
 ('in', 22023),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681)]

## <font color='#4a8bad'>Edit Distance Functions</font>
***
<a id="edit-distance"></a>

Create all edits that are one edit away from `word`.

In [7]:
def edits_one(word):
    alphabets    = 'abcdefghijklmnopqrstuvwxyz'
    splits     = [(word[:i], word[i:])                   for i in range(len(word) + 1)]
    deletes    = [left + right[1:]                       for left, right in splits if right]
    inserts    = [left + c + right                       for left, right in splits for c in alphabets]
    replaces   = [left + c + right[1:]                   for left, right in splits if right for c in alphabets]
    transposes = [left + right[1] + right[0] + right[2:] for left, right in splits if len(right)>1]
    return set(deletes + inserts + replaces + transposes)

In [8]:
print(len(set(edits_one("monney"))))
print(edits_one("monney"))

336
{'mogney', 'munney', 'mowney', 'monnyey', 'tonney', 'monnezy', 'monneyu', 'monnfy', 'monniy', 'mobnney', 'monaey', 'mkonney', 'monjney', 'msnney', 'monneye', 'montey', 'mnoney', 'lmonney', 'mtnney', 'monneb', 'monqey', 'momney', 'ronney', 'vmonney', 'monxney', 'mznney', 'ymonney', 'mofney', 'mouney', 'nmonney', 'monneyj', 'monnehy', 'monnery', 'montney', 'mbonney', 'mohney', 'monneu', 'monneyy', 'monnec', 'monkey', 'moaney', 'mxonney', 'monnoey', 'mononey', 'monuey', 'dmonney', 'monsey', 'monnny', 'mhnney', 'mjnney', 'monnejy', 'mconney', 'monnhey', 'onney', 'monaney', 'monhey', 'monnkey', 'monneg', 'monpey', 'monnep', 'qonney', 'monmney', 'konney', 'monkney', 'monnefy', 'xmonney', 'moknney', 'lonney', 'mjonney', 'mqonney', 'emonney', 'mondney', 'mvnney', 'minney', 'mownney', 'moenney', 'mojnney', 'muonney', 'moznney', 'monhney', 'monncey', 'mopney', 'monnhy', 'moniey', 'monlney', 'monnvy', 'kmonney', 'mozney', 'monneyo', 'mnnney', 'moiney', 'moinney', 'monneyf', 'mynney', 'mongney

 Create all edits that are two edits away from `word`.

In [9]:
def edits_two(word):
    return (e2 for e1 in edits_one(word) for e2 in edits_one(e1))

In [10]:
print(len(set(edits_two("monney"))))
print(edits_two("monney"))

51013
<generator object edits_two.<locals>.<genexpr> at 0x7fb448383c50>


The subset of `words` that appear in the `all_words`.

In [11]:
def known(words):
    return set(word for word in words if word in all_words)

In [12]:
print(known(edits_one("monney")))
print(known(edits_two("monney")))

{'money', 'monkey'}
{'donkey', 'motley', 'money', 'donned', 'manned', 'bonnet', 'moaned', 'monkeys', 'honey', 'olney', 'convey', 'bonne', 'monday', 'donne', 'manner', 'bonny', 'tonne', 'moines', 'monger', 'morley', 'monkey'}


Generate possible spelling corrections for word.

In [13]:
def possible_corrections(word):
    return (known([word]) or known(edits_one(word)) or known(edits_two(word)) or [word])

In [14]:
print(possible_corrections("monney"))

{'money', 'monkey'}


Probability of `word`: Number of appearances of 'word' / total number of tokens.

In [15]:
def prob(word, N=sum(all_words.values())): 
    return all_words[word] / N

In [16]:
print(prob("money"))
print(prob("monkey"))

0.0002922233626303688
5.378344097491451e-06


Print the most probable spelling correction for `word` out of all the `possible_corrections`

In [17]:
def spell_check(word):
    correct_word = max(possible_corrections(word), key=prob)
    if correct_word != word:
        return "Did you mean " + correct_word + "?"
    else:
        return "Correct spelling."

In [18]:
print(spell_check("monney"))

Did you mean money?
