# Auto Correct

- Getting a word count given a corpus
- Getting a word probability in the corpus
- Manipulating strings
- Filtering strings
- Implementing Minimum edit distance to compare strings and to help find the optimal path for the edits

[**1. Data Preprocessing**](#1.-Data-Preprocessing)  
[**2. String Manipulations**](#2.-String-Manipulations)

## 1. Data Preprocessing

### Importing packages

In [1]:
import re
from collections import Counter
import numpy as np
import pandas as pd

### Defining process_data function

**Inputs** :  
- *file_name*: a file which is found in the current directory and will be read in

**Outputs** :  
- *words*: a list containing all the words in the corpus (text file we read) in lower case

In [12]:
def process_data(file_name):
    
    words = []
    
    with open(file_name , "r", encoding="utf8") as f:
        text = f.read()
    text_lower = str.lower(text)
    words = re.findall(r'\w+', text_lower)
    
    return words

**Reading the file and building a vocabulary set using the words list**

In [13]:
word_l = process_data('./src/Karamazov.txt')
vocab = set(word_l)
print(f"The first ten words in the text are: \n{word_l[0:10]}")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'brothers', 'karamazov', 'by', 'fyodor']
There are 12461 unique words in the vocabulary.


### Defining get_count function

**Inputs** :  
- *word_l*: a set of words representing the corpus

**Outputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency

In [24]:
def get_count(word_l):

    word_count_dict = {}

    for w in word_l:
        word_count_dict[w] = word_count_dict.get(w,0)+1    

    return word_count_dict

**Building the words count dictionary**

In [22]:
word_count_dict = get_count(word_l)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'love' is {word_count_dict.get('love',0)}")

There are 12461 key values pairs
The count for the word 'love' is 467


### Defining get_probs function

**Inputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency

**Outputs** :  
- *probs*: a dictionary where keys are the words and the values are the probability that a word will occur

In [25]:
def get_probs(word_count_dict):

    probs = {}
    
    for w in word_count_dict.keys():
        probs[w] = word_count_dict[w]/sum(word_count_dict.values())

    return probs

**Building the words probability dictionary**

In [26]:
probs = get_probs(word_count_dict)
print(f"Length of probs is {len(probs)}")
print(f"P('love') is {probs['love']:.4f}")

Length of probs is 12461
P('love') is 0.0013


## 2. String Manipulations

### Defining delete_letter function

**Inputs** :  
- *word*: input string

**Outputs** :  
- *delete_l*: a list of all possible strings obtained by deleting 1 character from word

In [37]:
def delete_letter(word, verbose=False):

    delete_l = []
    split_l = []

    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
    
    for L,R in split_l:
        delete_l.append(L + R[1:])
    
    if verbose: print(f"input word = {word}, \nsplit_l = {split_l}, \ndelete_l = {delete_l}")

    return delete_l

**Testing the function**

In [38]:
delete_word_l = delete_letter(word="love", verbose=True)

input word = love, 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e']], 
delete_l = ['ove', 'lve', 'loe', 'lov']


### Defining switch_letter function

**Inputs** :  
- *word*: input string

**Outputs** :  
- *switches*: a list of all possible strings with one adjacent charater switched

In [69]:
def switch_letter(word, verbose=False):
    
    switch_l = []
    split_l = []
    
    for i in range(len(word)-1):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        switch_l.append(L + R[1] + R[0] + R[2:])
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nswitch_l = {switch_l}") 

    return switch_l

**Testing the function**

In [74]:
switch_word_l = switch_letter(word="love", verbose=True)

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've']] 
switch_l = ['olve', 'lvoe', 'loev']


### Defining replace_letter function

**Inputs** :  
- *word*: input string

**Outputs** :  
- *replaces*: a list of all possible strings where we replaced one letter from the original word

In [75]:
def replace_letter(word, verbose=False):
    
    letters = 'abcdefghijklmnopqrstuvwxyz'
    replace_l = []
    split_l = []
    
    for i in range(len(word)):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        for letter in letters:
            replace_l.append(L + letter + R[1:])
    
    replace_set = set(replace_l)
    replace_set.discard(word)
    
    replace_l = sorted(list(replace_set))
    
    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \nreplace_l {replace_l}")   
    
    return replace_l

**Testing the function**

In [77]:
replace_l = replace_letter(word='love', verbose=True)

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e']] 
replace_l ['aove', 'bove', 'cove', 'dove', 'eove', 'fove', 'gove', 'hove', 'iove', 'jove', 'kove', 'lave', 'lbve', 'lcve', 'ldve', 'leve', 'lfve', 'lgve', 'lhve', 'live', 'ljve', 'lkve', 'llve', 'lmve', 'lnve', 'loae', 'lobe', 'loce', 'lode', 'loee', 'lofe', 'loge', 'lohe', 'loie', 'loje', 'loke', 'lole', 'lome', 'lone', 'looe', 'lope', 'loqe', 'lore', 'lose', 'lote', 'loue', 'lova', 'lovb', 'lovc', 'lovd', 'lovf', 'lovg', 'lovh', 'lovi', 'lovj', 'lovk', 'lovl', 'lovm', 'lovn', 'lovo', 'lovp', 'lovq', 'lovr', 'lovs', 'lovt', 'lovu', 'lovv', 'lovw', 'lovx', 'lovy', 'lovz', 'lowe', 'loxe', 'loye', 'loze', 'lpve', 'lqve', 'lrve', 'lsve', 'ltve', 'luve', 'lvve', 'lwve', 'lxve', 'lyve', 'lzve', 'move', 'nove', 'oove', 'pove', 'qove', 'rove', 'sove', 'tove', 'uove', 'vove', 'wove', 'xove', 'yove', 'zove']


### Defining insert_letter function

**Inputs** :  
- *word*: input string

**Outputs** :  
- *inserts*: a list of all possible strings with one new letter inserted at every offset

In [93]:
def insert_letter(word, verbose=False):

    letters = 'abcdefghijklmnopqrstuvwxyz'
    insert_l = []
    split_l = []
    
    for i in range(len(word)+1):
        split_l.append([word[:i],word[i:]])
        
    for L,R in split_l:
        for letter in letters:
            insert_l.append(L + letter + R)
            
    insert_set = set(insert_l)
    
    insert_l = sorted(list(insert_set))

    if verbose: print(f"Input word = {word} \nsplit_l = {split_l} \ninsert_l = {insert_l}")
    
    return insert_l

**Testing the function**

In [94]:
insert_l = insert_letter('love', True)
print(f"Number of strings output by insert_letter('love') is {len(insert_l)}")

Input word = love 
split_l = [['', 'love'], ['l', 'ove'], ['lo', 've'], ['lov', 'e'], ['love', '']] 
insert_l = ['alove', 'blove', 'clove', 'dlove', 'elove', 'flove', 'glove', 'hlove', 'ilove', 'jlove', 'klove', 'laove', 'lbove', 'lcove', 'ldove', 'leove', 'lfove', 'lgove', 'lhove', 'liove', 'ljove', 'lkove', 'llove', 'lmove', 'lnove', 'loave', 'lobve', 'locve', 'lodve', 'loeve', 'lofve', 'logve', 'lohve', 'loive', 'lojve', 'lokve', 'lolve', 'lomve', 'lonve', 'loove', 'lopve', 'loqve', 'lorve', 'losve', 'lotve', 'louve', 'lovae', 'lovbe', 'lovce', 'lovde', 'lovea', 'loveb', 'lovec', 'loved', 'lovee', 'lovef', 'loveg', 'loveh', 'lovei', 'lovej', 'lovek', 'lovel', 'lovem', 'loven', 'loveo', 'lovep', 'loveq', 'lover', 'loves', 'lovet', 'loveu', 'lovev', 'lovew', 'lovex', 'lovey', 'lovez', 'lovfe', 'lovge', 'lovhe', 'lovie', 'lovje', 'lovke', 'lovle', 'lovme', 'lovne', 'lovoe', 'lovpe', 'lovqe', 'lovre', 'lovse', 'lovte', 'lovue', 'lovve', 'lovwe', 'lovxe', 'lovye', 'lovze', 'lowve', 'loxv