# 1: NLTK Using Jaccard distance Method

Jaccard distance, is a measure of how dissimilar two sets are. We get Jaccard distance by subtracting the Jaccard coefficient from 1. We can also get it by dividing the difference between the sizes of the union and the intersection of two sets by the size of the union. We work with Q-grams (these are equivalent to N-grams) which are referred to as characters instead of tokens. 

Jaccard Distance is given by the following formula: 
*Dj(A,B)= 1-J(A,B)= (|A ∪ B|-|A ∩ B|) / |A ∪ B|*

In [2]:
import nltk
nltk.download('words')
from nltk.corpus import words

[nltk_data] Downloading package words to /root/nltk_data...
[nltk_data]   Unzipping corpora/words.zip.


In [3]:
from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams

In [21]:
correct_words = words.words()
incorrect_words=['plaaay', 'Exapmle', 'azmaing', 'milion']

In [30]:
for word in incorrect_words:
    distances = [(jaccard_distance(set(ngrams(word, 2)), set(ngrams(w, 2))), w)
            for w in correct_words if w[0] == word[0]]
    print(sorted(distances, key=lambda val: val[0])[0])

(0.25, 'play')
(0.7777777777777778, 'Exaudi')
(0.5, 'amazing')
(0.16666666666666666, 'million')


didn't get 'Example' right!

# 2: NLTK Using Edit distance Method


Edit Distance measures dissimilarity between two strings by finding the minimum number of operations needed to transform one string into the other. 

Example:

Inserting a new character: `bat -> bats (insertion of 's')`

In [23]:
from nltk.metrics.distance import edit_distance

In [34]:
for word in incorrect_words:
    distance = [(edit_distance(word, w),w) for w in correct_words if w[0]==word[0]]
    print(sorted(distance, key=lambda val: val[0])[0])

(2, 'pacay')
(3, 'Earle')
(2, 'aiming')
(1, 'million')


OMG, only 'million' is correct

# 3. SpaCy & Contextual Spell Check 

In [None]:
!pip install contextualSpellCheck

In [2]:
import spacy
import contextualSpellCheck



In [4]:
nlp = spacy.load('en_core_web_sm')
contextualSpellCheck.add_to_pipe(nlp)

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/213k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/436M [00:00<?, ?B/s]

In [5]:
doc = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
doc._.performed_spellCheck

True

In [6]:
doc._.outcome_spellCheck

'Income was $9.4 million compared to the prior year of $2.7 million.'

# 4. TextBlob

In [14]:
from textblob import TextBlob

In [16]:
txt="machne learnig"
b = TextBlob(txt)
print("after spell correction: "+str(b.correct()))

after spell correction: machine learning


In [18]:
for word in incorrect_words:
    tb = TextBlob(word)
    print(tb.correct())

play
Example
amazing
million


# 5. autocorrect 

In [None]:
!pip install autocorrect

In [9]:
from autocorrect import Speller

spell = Speller(lang='en')

In [12]:
for word in incorrect_words:
    print(spell(word))

play
Example
amazing
million
