# Auto Correct

- Getting a word count given a corpus
- Getting a word probability in the corpus
- Manipulating strings
- Filtering strings
- Implementing Minimum edit distance to compare strings and to help find the optimal path for the edits

[**1. Data Preprocessing**](#1.-Data-Preprocessing)  
[**2. String Manipulations**](#2.-String-Manipulations)

## 1. Data Preprocessing

### Importing packages

In [1]:
import re
from collections import Counter
import numpy as np
import pandas as pd

### Defining process_data function

**Inputs** :  
- *file_name*: a file which is found in the current directory and will be read in.

**Outputs** :  
- *words*: a list containing all the words in the corpus (text file we read) in lower case.

In [12]:
def process_data(file_name):
    
    words = []
    
    with open(file_name , "r", encoding="utf8") as f:
        text = f.read()
    text_lower = str.lower(text)
    words = re.findall(r'\w+', text_lower)
    
    return words

**Reading the file and building a vocabulary set using the words list**

In [13]:
word_l = process_data('./src/Karamazov.txt')
vocab = set(word_l)
print(f"The first ten words in the text are: \n{word_l[0:10]}")
print(f"There are {len(vocab)} unique words in the vocabulary.")

The first ten words in the text are: 
['the', 'project', 'gutenberg', 'ebook', 'of', 'the', 'brothers', 'karamazov', 'by', 'fyodor']
There are 12461 unique words in the vocabulary.


### Defining get_count function

**Inputs** :  
- *word_l*: a set of words representing the corpus. 

**Outputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency.

In [24]:
def get_count(word_l):

    word_count_dict = {}

    for w in word_l:
        word_count_dict[w] = word_count_dict.get(w,0)+1    

    return word_count_dict

**Building the words count dictionary**

In [22]:
word_count_dict = get_count(word_l)
print(f"There are {len(word_count_dict)} key values pairs")
print(f"The count for the word 'love' is {word_count_dict.get('love',0)}")

There are 12461 key values pairs
The count for the word 'love' is 467


### Defining get_probs function

**Inputs** :  
- *word_count_dict*: the wordcount dictionary where key is the word and value is its frequency. 

**Outputs** :  
- *probs*: a dictionary where keys are the words and the values are the probability that a word will occur.

In [25]:
def get_probs(word_count_dict):

    probs = {}
    
    for w in word_count_dict.keys():
        probs[w] = word_count_dict[w]/sum(word_count_dict.values())

    return probs

**Building the words probability dictionary**

In [26]:
probs = get_probs(word_count_dict)
print(f"Length of probs is {len(probs)}")
print(f"P('love') is {probs['love']:.4f}")

Length of probs is 12461
P('love') is 0.0013


## 2. String Manipulations