# Auto Correct

[**1. Data Preprocessing**](#1.-Data-Preprocessing)

[**2. String Manipulations**](#2.-String-Manipulations)

[**3. Combining the Edits**](#3.-Combining-the-Edits)  

## 1. Data Preprocessing

### Importing packages

In [1]:
import re
from collections import Counter
import numpy as np
import pandas as pd

### Defining process_data function

**Inputs** :  
- *file_name*: a file which is found in the current directory and will be read in

**Outputs** :  
- *words*: a list containing all the words in the corpus (text file we read) in lower case

In [2]:
def process_data(file_name):
    
    words = []
    
    with open(file_name , "r", encoding="utf8") as f:
        text = f.read()
    text_lower = str.lower(text)
    words = re.findall(r'\w+', text_lower)
    
    return words

**Reading the file and building a vocabulary set using the words list**

In [3]:
word_l = process_data('./src/Karamazov.txt')
vocab = set(word_l)
print(f"The first ten words in the text are: \n{word_l[0:10]}")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'brothers', 'karamazov', 'by', 'fyodor']
There are 12461 unique words in the vocabulary.


### Defining get_count function

**Inputs** :  
- *word_l*: a set of words representing the corpus

**Outputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency

In [4]:
def get_count(word_l):

    word_count_dict = {}

    for w in word_l:
        word_count_dict[w] = word_count_dict.get(w,0)+1    

    return word_count_dict

**Building the words count dictionary**

In [5]:
word_count_dict = get_count(word_l)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'love' is {word_count_dict.get('love',0)}")

There are 12461 key values pairs
The count for the word 'love' is 467


### Defining get_probs function

**Inputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency

**Outputs** :  
- *probs*: a dictionary where keys are the words and the values are the probability that a word will occur

In [6]:
def get_probs(word_count_dict):

    probs = {}
    
    for w in word_count_dict.keys():
        probs[w] = word_count_dict[w]/sum(word_count_dict.values())

    return probs

**Building the words probability dictionary**

In [7]:
probs = get_probs(word_count_dict)
print(f"Length of probs is {len(probs)}")
print(f"P('love') is {probs['love']:.4f}")

Length of probs is 12461
P('love') is 0.0013


## 2. String Manipulations

### Defining delete_letter function

**Inputs** :  
- *word*: input string
- *verbose*: if true, prints the result

**Outputs** :  
- *delete_l*: a list of all possible strings obtained by deleting 1 character from word

In [8]:
def delete_letter(word, verbose=False):

    delete_l = []
    split_l = []

    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
    
    for L,R in split_l:
        delete_l.append(L + R[1:])
    
    if verbose: print(f"input word = {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return delete_l

**Testing the function**

In [9]:
delete_word_l = delete_letter(word="love", verbose=True)

input word = love, 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e']], 
delete_l = ['ove', 'lve', 'loe', 'lov']


### Defining switch_letter function

**Inputs** :  
- *word*: input string
- *verbose*: if true, prints the result

**Outputs** :  
- *switches*: a list of all possible strings with one adjacent charater switched

In [10]:
def switch_letter(word, verbose=False):
    
    switch_l = []
    split_l = []
    
    for i in range(len(word)-1):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        switch_l.append(L + R[1] + R[0] + R[2:])
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}") 

    return switch_l

**Testing the function**

In [11]:
switch_word_l = switch_letter(word="love", verbose=True)

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've']] 
switch_l = ['olve', 'lvoe', 'loev']


### Defining replace_letter function

**Inputs** :  
- *word*: input string
- *verbose*: if true, prints the result

**Outputs** :  
- *replaces*: a list of all possible strings where we replaced one letter from the original word

In [12]:
def replace_letter(word, verbose=False):
    
    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    split_l = []
    
    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        for letter in letters:
            replace_l.append(L + letter + R[1:])
    
    replace_set = set(replace_l)
    replace_set.discard(word)
    
    replace_l = sorted(list(replace_set))
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")   
    
    return replace_l

**Testing the function**

In [13]:
replace_l = replace_letter(word='love', verbose=True)

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e']] 
replace_l ['aove', 'bove', 'cove', 'dove', 'eove', 'fove', 'gove', 'hove', 'iove', 'jove', 'kove', 'lave', 'lbve', 'lcve', 'ldve', 'leve', 'lfve', 'lgve', 'lhve', 'live', 'ljve', 'lkve', 'llve', 'lmve', 'lnve', 'loae', 'lobe', 'loce', 'lode', 'loee', 'lofe', 'loge', 'lohe', 'loie', 'loje', 'loke', 'lole', 'lome', 'lone', 'looe', 'lope', 'loqe', 'lore', 'lose', 'lote', 'loue', 'lova', 'lovb', 'lovc', 'lovd', 'lovf', 'lovg', 'lovh', 'lovi', 'lovj', 'lovk', 'lovl', 'lovm', 'lovn', 'lovo', 'lovp', 'lovq', 'lovr', 'lovs', 'lovt', 'lovu', 'lovv', 'lovw', 'lovx', 'lovy', 'lovz', 'lowe', 'loxe', 'loye', 'loze', 'lpve', 'lqve', 'lrve', 'lsve', 'ltve', 'luve', 'lvve', 'lwve', 'lxve', 'lyve', 'lzve', 'move', 'nove', 'oove', 'pove', 'qove', 'rove', 'sove', 'tove', 'uove', 'vove', 'wove', 'xove', 'yove', 'zove']


### Defining insert_letter function

**Inputs** :  
- *word*: input string
- *verbose*: if true, prints the result

**Outputs** :  
- *inserts*: a list of all possible strings with one new letter inserted at every offset

In [14]:
def insert_letter(word, verbose=False):

    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []
    
    for i in range(len(word)+1):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        for letter in letters:
            insert_l.append(L + letter + R)
            
    insert_set = set(insert_l)
    
    insert_l = sorted(list(insert_set))

    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")
    
    return insert_l

**Testing the function**

In [15]:
insert_l = insert_letter('love', True)
print(f"Number of strings output by insert_letter('love') is {len(insert_l)}")

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e'], ['love', '']] 
insert_l = ['alove', 'blove', 'clove', 'dlove', 'elove', 'flove', 'glove', 'hlove', 'ilove', 'jlove', 'klove', 'laove', 'lbove', 'lcove', 'ldove', 'leove', 'lfove', 'lgove', 'lhove', 'liove', 'ljove', 'lkove', 'llove', 'lmove', 'lnove', 'loave', 'lobve', 'locve', 'lodve', 'loeve', 'lofve', 'logve', 'lohve', 'loive', 'lojve', 'lokve', 'lolve', 'lomve', 'lonve', 'loove', 'lopve', 'loqve', 'lorve', 'losve', 'lotve', 'louve', 'lovae', 'lovbe', 'lovce', 'lovde', 'lovea', 'loveb', 'lovec', 'loved', 'lovee', 'lovef', 'loveg', 'loveh', 'lovei', 'lovej', 'lovek', 'lovel', 'lovem', 'loven', 'loveo', 'lovep', 'loveq', 'lover', 'loves', 'lovet', 'loveu', 'lovev', 'lovew', 'lovex', 'lovey', 'lovez', 'lovfe', 'lovge', 'lovhe', 'lovie', 'lovje', 'lovke', 'lovle', 'lovme', 'lovne', 'lovoe', 'lovpe', 'lovqe', 'lovre', 'lovse', 'lovte', 'lovue', 'lovve', 'lovwe', 'lovxe', 'lovye', 'lovze', 'lowve', 'loxv

## 3. Combining the Edits

### Defining edit_one_letter function

**Inputs** :  
- *word*: the string/word for which we will generate all possible wordsthat are one edit away
- *allow_switches*: if true, will add switch letter edits to the possible edits set

**Outputs** :  
- *edit_one_set*: a set of words with one possible edit

In [16]:
def edit_one_letter(word, allow_switches = True):

    edit_one_set = set()
    
    delete_l = delete_letter(word)
    replace_l = replace_letter(word)
    insert_l = insert_letter(word)
    
    if allow_switches:
        switch_l = switch_letter(word)
        complete = delete_l + replace_l + insert_l + switch_l
    
    else:
        complete = delete_l + replace_l + insert_l
    
    edit_one_set = set(complete)

    return edit_one_set

**Testing the function**

In [17]:
edit_one_set = edit_one_letter('love')
print(f"input word = love \nedit_one_set \n{sorted(list(edit_one_set))}\n")
print(f"Number of outputs from edit_one_letter('love') is {len(edit_one_set)}")

input word = love 
edit_one_set 
['alove', 'aove', 'blove', 'bove', 'clove', 'cove', 'dlove', 'dove', 'elove', 'eove', 'flove', 'fove', 'glove', 'gove', 'hlove', 'hove', 'ilove', 'iove', 'jlove', 'jove', 'klove', 'kove', 'laove', 'lave', 'lbove', 'lbve', 'lcove', 'lcve', 'ldove', 'ldve', 'leove', 'leve', 'lfove', 'lfve', 'lgove', 'lgve', 'lhove', 'lhve', 'liove', 'live', 'ljove', 'ljve', 'lkove', 'lkve', 'llove', 'llve', 'lmove', 'lmve', 'lnove', 'lnve', 'loae', 'loave', 'lobe', 'lobve', 'loce', 'locve', 'lode', 'lodve', 'loe', 'loee', 'loev', 'loeve', 'lofe', 'lofve', 'loge', 'logve', 'lohe', 'lohve', 'loie', 'loive', 'loje', 'lojve', 'loke', 'lokve', 'lole', 'lolve', 'lome', 'lomve', 'lone', 'lonve', 'looe', 'loove', 'lope', 'lopve', 'loqe', 'loqve', 'lore', 'lorve', 'lose', 'losve', 'lote', 'lotve', 'loue', 'louve', 'lov', 'lova', 'lovae', 'lovb', 'lovbe', 'lovc', 'lovce', 'lovd', 'lovde', 'lovea', 'loveb', 'lovec', 'loved', 'lovee', 'lovef', 'loveg', 'loveh', 'lovei', 'lovej', 'lov

### Defining edit_two_letters function

**Inputs** :  
- *word*: the string/word for which we will generate all possible words that are two edits away
- *allow_switches*: if true, will add switch letter edits to the possible edits set

**Outputs** :  
- *edit_two_set*: a set of strings with all possible two edits

In [18]:
def edit_two_letters(word, allow_switches = True):
    
    edit_two_set = set()
    
    edit_one_set = edit_one_letter(word, allow_switches)
    
    for w in edit_one_set:
        tmp_set = edit_one_letter(w, allow_switches)
        edit_two_set = edit_two_set.union(tmp_set)
    
    return edit_two_set

**Testing the function**

In [19]:
edit_two_set = edit_two_letters('love')
print(f"Input word = love \nNumber of strings with edit distance of two: {len(edit_two_set)}")
print(f"First 10 strings {sorted(list(edit_two_set))[:10]}")
print(f"Last 10 strings {sorted(list(edit_two_set))[-10:]}")

Input word = love 
Number of strings with edit distance of two: 24254
First 10 strings ['aalove', 'aaove', 'aave', 'ablove', 'above', 'abve', 'aclove', 'acove', 'acve', 'adlove']
Last 10 strings ['zwve', 'zxlove', 'zxove', 'zxve', 'zylove', 'zyove', 'zyve', 'zzlove', 'zzove', 'zzve']


### Defining get_suggestions function

The suggestion algorithm follows this logic:

- If the word is in the vocabulary, it suggests the word.
- Otherwise, if there are suggestions from edit_one_letter that are in the vocabulary, it uses those.
- Otherwise, if there are suggestions from edit_two_letters that are in the vocabulary, it uses those.
- Otherwise, it suggests the input word.

The idea is that words generated from fewer edits are more likely than words with more edits. Edits of one or two letters may 'restore' strings to either zero or one edit. This algorithm accounts for this by preferentially selecting lower distance edits first.

**Inputs** :  
- *word*: a user entered string to check for suggestions
- *probs*: a dictionary that maps each word to its probability in the corpus
- *vocab*: a set containing all the vocabulary
- *n*: number of possible word corrections user wants returned in the dictionary
- *verbose*: if true, prints the result

**Outputs** :  
- *n_best*: a list of tuples with the most probable n corrected words and their probabilities

In [20]:
def get_suggestions(word, probs, vocab, n=2, verbose = False):
    
    suggestions = []
    n_best = []
    
    list1 = []
    if word in vocab:
        list1.append(word)
    
    list2 = []
    for w in edit_one_letter(word):
        if w in vocab:
            list2.append(w)
    
    list3 = []
    for w in edit_two_letters(word):
        if w in vocab:
            list3.append(w)
    
    suggestions = list1 or list2 or list3 or word
    
    best_words = {}
    for w in suggestions:
        best_words[w] = probs.get(w,0)
    
    best_words = Counter(best_words)

    n_best = best_words.most_common(n)
    
    if verbose: print("Entered word = ", word, "\nSuggestions = ", suggestions,"\n")

    return n_best

**Testing the function**

In [21]:
my_word = 'lovste' 
corrections = get_suggestions(my_word, probs, vocab, 3, verbose=True)
for i, word_prob in enumerate(corrections):
    print(f"word {i}: {word_prob[0]}, probability: {word_prob[1]:.6f}")

Entered word =  lovste 
Suggestions =  ['lost', 'lose', 'love', 'loose', 'loves'] 

word 0: love, probability: 0.001285
word 1: lost, probability: 0.000237
word 2: loves, probability: 0.000102
