<b> Name: </b> MANAY, Justin Gabrielle A.

# Programming Exercise # 03: Spell Correction

## PART I. Language Model

In [1]:
import re
import os
import numpy as np

In [2]:
re_pattern = r'\b[a-zA-Z]+\-?\'?[a-zA-Z]+\b'

# Load corpus
text, tokens, vocab = [], [], []
corpus = {"Raw Text": text, "Tokens": tokens, "Vocabulary": vocab}
corpus_directory = "data\corpus"
files = os.listdir(corpus_directory)
for file_name in files:
    with open(os.path.join(corpus_directory, file_name), 'r', encoding='utf8') as f:
        for line in f:
            line = line.lower()
            text.append(line)
            tokens.extend(re.findall(re_pattern, line))
            tokens.append("START/END")
            vocab.extend(list(set(re.findall(re_pattern, line))))

vocab.append("START/END")
            
# Find vocabulary set
total_vocabulary = set()
total_vocabulary |= set(corpus["Vocabulary"])

total_vocabulary = list(total_vocabulary)
vocabulary_count = len(total_vocabulary)

We first load and tokenize the corpus. We used the 2016 News Typical corpus from the Leipzig Corpora Collection, which can be accessed through the following [link](http://wortschatz.uni-leipzig.de/en/download/). The corpus contains up to 1 million sentences (We used only 300,000 sentences so as to keep the program reasonably fast) "randomly selected from newspaper texts or texts randomly collected from the web."

Since we used the News Typical corpus, the words used in the sentences are simpler and are thus more frequently used in English. However, since the sentences are taken from newspaper/web texts, they will contain proper nouns and acronyms, which are not meant to be spell-checked. Also, the corpus will also miss out on a few frequently occurring English words. 

The sentences are loaded into the dictionary `corpus`, as well as the tokens and the vocabulary. At the end of every sentence, we append the token `START/END` to signify the beginning of a new sentence/end of an old one. This is in the event that we employ a bigram model to prevent the tokens of one sentence from seeping into another. We then generate the vocabulary from the tokens. 

In [3]:
# Create frequency list for unigram language model
freqDict = dict((word, 0) for word in corpus["Vocabulary"])
for word in corpus["Tokens"]:
    if word == "START/END":
        continue
    else:
        freqDict[word] += 1 / vocabulary_count

We will only be using the unigram language model for this assignment, which entails computing the frequency of each token in the vocabulary, as is done by the code above. 

## PART II. Error Model

In [4]:
# Load confusion matrix data
confusion_directory = "data\confusion"        
file_name = os.listdir(confusion_directory)[0]
confusion = {"Ins": {}, "Del": {}, "Sub": {}, "Trans": {}}
with open(os.path.join(confusion_directory, file_name), 'r') as f:
    for line in f:
        delim1 = line.find("|")
        delim2 = line.find("\t")
        wrong = line[: delim1]
        right = line[delim1 + 1 : delim2]
        
        # Use a regex pattern for the frequency
        freq_re_pattern = r'\b[0-9]+\b'
        freq = int(re.findall(freq_re_pattern, line)[0])
        
        # Classify errors
        if len(wrong) == len(right) and len(wrong) == 2:
            confusion["Trans"][(wrong, right)] = freq
        elif len(wrong) == len(right):
            confusion["Sub"][(wrong, right)] = freq
        elif len(wrong) > len(right):
            confusion["Ins"][(wrong, right)] = freq
        elif len(wrong) < len(right):
            confusion["Del"][(wrong, right)] = freq

To create the error model, we use data on the frequency of certain one-distance edits in the English language. This [data](https://norvig.com/ngrams/) was collected by Peter Norvig from Wikipedia and Roger Mutton and details how frequently one makes one-distance edits or misspellings based on other corpora. The misspellings are given in the form `wrong|right`, where `wrong` represents the misspelling and `right` represents the correct spelling. For example, `r|re` has a frequency of 392, signifying that based on the corpus, a word that should've been spelled with an re was spelled with an r 392 times.

We extract `wrong`, `right` and the frequency from the text file `count_1edit.txt` which was downloaded from the above link. Due to some strange formatting, we opted to extract `wrong` and `right` without using regex by using "|" and "\t" as delimiters. We extracted the frequency via regex. 

We then classified the errors based on whether they are transposition, substitution, insertion or deletion errors, after which we load them into the dictionary `confusion`. `confusion` is a dictionary within a dictionary, containing all the one-distance edits sorted by type of error. We use the tuple `(wrong, right)` as our keys and the frequency as our values. 

In [5]:
# Create confusion matrices
# insertion confusion matrix
ins_wrong_vals = list(set([key[0] for key in confusion["Ins"].keys()]))
ins_right_vals = list(set([key[1] for key in confusion["Ins"].keys()]))
insMatrix = np.zeros((len(ins_wrong_vals), len(ins_right_vals)))

for i in range(len(ins_wrong_vals)):
    for j in range(len(ins_right_vals)):       
        try:
            insMatrix[i, j] = confusion["Ins"][(ins_wrong_vals[i], ins_right_vals[j])]
        except:
            continue

# Perform add-one smoothing            
for i in range(len(ins_wrong_vals)):
    insMatrix[i, ] = (insMatrix[i, ] + 1) / (np.sum(insMatrix[i,]) + len(ins_wrong_vals) * len(ins_right_vals))

We then proceed to creating the confusion matrices for all possible error types. The rows in the confusion matrix represent the wrong values while the columns represent the right values. We first make a list of all unique wrong and right values and use them to create the confusion matrix `insMatrix` based on the values in `confusion`, employing exception handling in case the pair of values is not a key in the dictionary. 

Since we will ultimately be multiplying the error probability and the unigram probability and the confusion matrix has a lot of zeroes, there is a chance that we would end up with a probability of zero. Thus, we employ add-one smoothing to remove all the zeroes, adding one to all values and then dividing by the number of cells in the matrix.

We then repeat this procedure for the other three matrices.

In [6]:
# substitution confusion matrix
sub_wrong_vals = list(set([key[0] for key in confusion["Sub"].keys()]))
sub_right_vals = list(set([key[1] for key in confusion["Sub"].keys()]))
subMatrix = np.zeros((len(sub_wrong_vals), len(sub_right_vals)))

for i in range(len(sub_wrong_vals)):
    for j in range(len(sub_right_vals)):       
        try:
            subMatrix[i, j] = confusion["Sub"][(sub_wrong_vals[i], sub_right_vals[j])]
        except:
            continue
            
for i in range(len(sub_wrong_vals)):
    subMatrix[i, ] = (subMatrix[i, ] + 1) / (np.sum(subMatrix[i,]) + len(sub_wrong_vals) * len(sub_right_vals))
    
# deletion confusion matrix
del_wrong_vals = list(set([key[0] for key in confusion["Del"].keys()]))
del_right_vals = list(set([key[1] for key in confusion["Del"].keys()]))
delMatrix = np.zeros((len(del_wrong_vals), len(del_right_vals)))

for i in range(len(del_wrong_vals)):
    for j in range(len(del_right_vals)):       
        try:
            delMatrix[i, j] = confusion["Del"][(del_wrong_vals[i], del_right_vals[j])]
        except:
            continue
            
for i in range(len(del_wrong_vals)):
    delMatrix[i, ] = (delMatrix[i, ] + 1) / (np.sum(delMatrix[i,]) + len(del_wrong_vals) * len(del_right_vals))

# transposition confusion matrix
trans_wrong_vals = list(set([key[0] for key in confusion["Trans"].keys()]))
trans_right_vals = list(set([key[1] for key in confusion["Trans"].keys()]))
transMatrix = np.zeros((len(trans_wrong_vals), len(trans_right_vals)))

for i in range(len(trans_wrong_vals)):
    for j in range(len(trans_right_vals)):       
        try:
            transMatrix[i, j] = confusion["Trans"][(trans_wrong_vals[i], trans_right_vals[j])]
        except:
            continue
            
for i in range(len(trans_wrong_vals)):
    transMatrix[i, ] = (transMatrix[i, ] + 1) / (np.sum(transMatrix[i,]) + len(trans_wrong_vals) * len(trans_right_vals))

## PART III. Selecting Candidates through Edit Distance

Now that we have our language and error models set up, we can now create a procedure to generate all the possible candidates in the spell correction procedure. To do this, we use the Damerau-Levenshtein minimum edit distance algorithm and choose all words in `total_vocabulary` which are within one edit distance from the word to be spell-corrected. The implementation of this algorithm is shown below.

In [7]:
# Compute for Damerau-Levenshtein edit distance
def minEditDistance(word1, word2):
    # Initialize matrices
    distMatrix = np.zeros((len(word1) + 1, len(word2) + 1))
    ptrMatrix = [[[] for j in range(len(word2) + 1)] for i in range(len(word1) + 1)]
    insCost = 1
    delCost = 1
    transCost = 1
     
    distMatrix[0, 0] = 0     
    for i in range(1, len(word1) + 1):
        distMatrix[i, 0] = i * delCost
        ptrMatrix[i][0].append("UP")
    for j in range(1, len(word2) + 1):
        distMatrix[0, j] = j * insCost
        ptrMatrix[0][j].append("LEFT")
                
    # Fill up matrices
    for i in range(1, len(word1) + 1):
        for j in range(1, len(word2) + 1):
            
            if word1[i - 1] != word2[j - 1]:
                subCost = 1
            else:
                subCost = 0
            
            distMatrix[i, j] = min(distMatrix[i - 1, j] + insCost, distMatrix[i, j - 1] + delCost, distMatrix[i - 1, j - 1] + subCost)
            if word1[i - 1] == word2[j - 2] and word1[i - 2] == word2[j - 1] and i > 1 and j > 1:
                distMatrix[i, j] = min(distMatrix[i, j], distMatrix[i - 2, j - 2] + transCost)
            
            if distMatrix[i, j] == distMatrix[i - 1, j] + insCost:
                ptrMatrix[i][j].append("UP")
            if distMatrix[i, j] == distMatrix[i, j - 1] + delCost:
                ptrMatrix[i][j].append("LEFT")
            if distMatrix[i, j] == distMatrix[i - 1, j - 1] + subCost:
                ptrMatrix[i][j].append("DIAG")
            if distMatrix[i, j] == distMatrix[i - 2, j - 2] + transCost:
                ptrMatrix[i][j].append("FLIP")

    # Return min edit distance
    dist = int(distMatrix[len(word1), len(word2)])
    
    # Form backtrace path using ptrMatrix
    backtracePath = []
    rownum, colnum = len(word1), len(word2)
    
    # Prioritize diagonal movement. Upward and leftward movement are interchangeable.
    while (rownum != 0) or (colnum != 0):
        backtracePath.append((rownum, colnum))
        if "DIAG" in ptrMatrix[rownum][colnum]:
            rownum -= 1
            colnum -= 1
        elif "LEFT" in ptrMatrix[rownum][colnum]:
            colnum -= 1
        elif "UP" in ptrMatrix[rownum][colnum]:
            rownum -= 1
        elif "FLIP" in ptrMatrix[rownum][colnum]:
            rownum -= 2
            colnum -= 2
   
    # Reverse backtracePath
    backtracePath = backtracePath[::-1]
        
    # Print alignment.
    word1Print = ""
    alignPrint = ""
    word2Print = ""
    
    for rownum, colnum in backtracePath:
        if "DIAG" in ptrMatrix[rownum][colnum]:
            word1Print += word1[rownum - 1]
            word2Print += word2[colnum - 1]
            if word1[rownum - 1] == word2[colnum - 1]:
                alignPrint += "M"
            else:
                alignPrint += "S"
        elif "FLIP" in ptrMatrix[rownum][colnum]:
            word1Print += word1[rownum - 2]
            word1Print += word1[rownum - 1]
            alignPrint += "TT"
            word2Print += word2[colnum - 2]
            word2Print += word2[colnum - 1]
        elif "LEFT" in ptrMatrix[rownum][colnum]:
            word1Print += " "
            alignPrint += "I"
            word2Print += word2[colnum - 1]
        elif "UP" in ptrMatrix[rownum][colnum]:
            word1Print += word1[rownum - 1]
            alignPrint += "D"
            word2Print += " "
    return {"Distance": dist, "Misspelled": word2Print, "Correct": word1Print, "Alignment": alignPrint}

We create our own edit distance function since we want to also align the correct and the misspelled words. The function is roughly the same as the Levenshtein edit distance function we implemented earlier but it now takes transpositions into account. 

The costs for substitution, insertion, deletion and transposition are kept similar so that the algorithm isn't biased towards one type of editing. Also, the algorithm prioritizes matches/substitutions, and then tranpositions, and then insertions and then deletions.

The output of the function is a dictionary with the edit distance, the aligned correct and misspelled words and the alignment (i.e., the series of operations needed to go from the correct word to the misspelled word.)

## PART IV. Combining the Error and Language Models

We can now ask the user for input and combine everything discussed in Parts I through III to form the spellchecker.

In [8]:
def spellCorrect():
    # Get (and tokenize, if using a bigram/trigram model) an input value/hard coded value
    inputWord = input("Please enter a string: ")

    # Screen candidates from vocabulary list
    candidateList = []

    if inputWord in total_vocabulary:
        print("No error")
    else:
        for word in total_vocabulary:
            dist = minEditDistance(word, inputWord)
            if dist["Distance"] == 1:
                candidateList.append(word)
        
        if candidateList == []:
            print("No word in corpus close enough to input")
        
        probDict = {}
        for candidate in candidateList:
            dist = minEditDistance(candidate, inputWord)

            unigram_prob = freqDict[candidate]
            if "I" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("I")

                if misspell_idx != len(inputWord) - 1:
                    wrong = dist["Misspelled"][misspell_idx: misspell_idx + 2]
                else:
                    wrong = dist["Misspelled"][misspell_idx - 2: misspell_idx]
                    
                matrix_wrong_idx = ins_wrong_vals.index(wrong)

                if misspell_idx != 0:
                    right = dist["Correct"][misspell_idx - 1]
                else:
                    right = dist["Correct"][misspell_idx + 1]
                matrix_right_idx = ins_right_vals.index(right)  

                error_prob = insMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "S" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("S")

                wrong = dist["Misspelled"][misspell_idx]
                matrix_wrong_idx = sub_wrong_vals.index(wrong)

                right = dist["Correct"][misspell_idx]
                matrix_right_idx = sub_right_vals.index(right)

                error_prob = subMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "D" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("D")

                if misspell_idx != 0:
                    wrong = dist["Misspelled"][misspell_idx - 1]
                else:
                    wrong = dist["Misspelled"][misspell_idx + 1]
                matrix_wrong_idx = del_wrong_vals.index(wrong)

                if misspell_idx != len(candidate) - 1:    
                    right = dist["Correct"][misspell_idx - 1: misspell_idx + 1]
                else:
                    right = dist["Correct"][misspell_idx - 2: misspell_idx]
                matrix_right_idx = del_right_vals.index(right)

                error_prob = delMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "TT" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("T")

                wrong = dist["Misspelled"][misspell_idx: misspell_idx + 2]
                matrix_wrong_idx = trans_wrong_vals.index(wrong)

                right = dist["Correct"][misspell_idx: misspell_idx + 2]
                matrix_right_idx = trans_right_vals.index(right)

                error_prob = transMatrix[matrix_wrong_idx, matrix_right_idx]

            prob = unigram_prob * error_prob
            probDict[candidate] = prob

        for key in list(probDict.keys()):
            sum_prob = np.sum(list(probDict.values()))
            probDict[key] = probDict[key] / sum_prob

        # Sort dictionary by values
        # from: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
        sorted_probDict = sorted(probDict.items(), key = lambda kv: kv[1], reverse = True)

        for candidate, prob in sorted_probDict:
            print(candidate + " (" + str(prob) + ")")

After asking for input, we first check if the word is in `total_vocabulary` (i.e., we check if the word has been correctly spelled). If not, we proceed with the spell correction.

First, we generate all possible candidates in the spell correction procedure by checking which words in `total_vocabulary` are within one edit distance from `inputWord`. We then get the unigram probability `unigram_prob` from the frequency dictionary we generated in Part I.

The error probability `error_prob` is more difficult to compute. We first use the alignment between the correct word and the misspelled word to check which operation (insertion, substitution, deletion or transposition) had been performed. After this, we check where this operation occurs and get the letters concerned from `dist["Correct"]` and `dist["Misspelled"]`.

This is easy in the case of substitutions and transpositions. This easily becomes difficult in the case of either an insertion or a deletion.

Consider the following example.

In [9]:
example1 = minEditDistance("actress", "acress")
print(example1["Correct"])
print(example1["Alignment"])
print(example1["Misspelled"])

actress
MMDMMMM
ac ress


We could easily consider the pairing ` ` and `t`, but checking the deletion confusion matrix,

In [10]:
" " in del_right_vals

False

` ` is not part of `del_right_vals`! We can consider the pairing `c` (one letter to the left of the deletion) and `ct` (letter at the deletion and one letter to its left) or the pairing `t` and `tr `and sure enough,

In [11]:
"ct" in del_right_vals

True

In [12]:
"tr" in del_right_vals

True

We also run into a problem if the error happens to be in the beginning/end of a word. Consider this case.

In [13]:
example2 = minEditDistance("up", "tup")
print(example2["Correct"])
print(example2["Alignment"])
print(example2["Misspelled"])

 up
IMM
tup


We cannot use the rule from awhile ago and choose one letter to the left of the insertion, since it occurs at the beginning of the word.

Based on Norvig's confusion matrix data, a lot of the misspellings involving insertions and deletions are given in the form of two letters. However, the choice of which letters to consider for the spell correction is arbitrary. Thus, we use the rule from earlier (prioritize letters to the left of the insertion/deletion) and modify it accordingly whenever the error is at the beginning/end of the word. For example, if the error is at the beginning, we simply move to the right of the error and if it is at the end, we move to the left of the error.

After determining which letters to consider, we look them up in the list of right and wrong values to determine where in the confusion matrix to look. We get `error_prob` afterwards.

Based on the nosiy channel model, the correctly spelled word is the one which maximizes the product `unigram_prob * error_prob`. Thus, we compute this product for all candidates and store it in `probDict`. We then normalize the probabilities by dividing them by the sum of the probabilities, to make for easier comparability. We then sort the dictionary by values (in descending order) and then print all candidates, with the most likely correction up front.

To see how the program works, we use an example.

In [14]:
spellCorrect()

Please enter a string: wrip
wrap (0.006592001262366998)
trip (0.002536683010987814)
rip (0.00046466501117905774)
writ (0.0003337265006042013)
grip (4.203277283906145e-05)
whip (1.866035138671428e-05)
drip (1.1633235451019149e-05)


For the word `wrip`, the program suggests the corrections `wrap`, `trip`, `grip`, `whip`, `writ`, `drip` and `rip`. However, based on the noisy channel model, the most likely spelling correction is the word `wrap`. 

## PART IV. Limitations of the Program 

The program has a number of limitations due to the choice of corpus and confusion matrix data.

For one, because we use the corpus as our source of words, there is the possibility of encountering proper nouns, acronyms and abbreviations, as shown in the example below. The word `twp` ,for example, is an abbreviation/acronym and should not be counted as a correction for the word `trp`.

In [15]:
spellCorrect()

Please enter a string: trp
try (0.0912031670838464)
trip (0.00041494453398646707)
tip (5.606767653398902e-05)
top (3.532722008583494e-05)
trap (2.8105938048280677e-05)
tap (2.1950807310506906e-05)
tarp (2.138440727597923e-06)
twp (1.5307220066271836e-06)
tpp (1.2245471201697923e-06)
irp (7.876104834396554e-07)
ttp (7.112139996374704e-07)
trt (5.970346583874939e-07)
trc (4.859423781402116e-07)
tcp (3.571630238224614e-07)
thp (2.5406090963834277e-07)
tro (2.160949262322328e-07)
tri (1.0794661190039474e-07)
tre (5.427487479304777e-08)
tru (5.403922990354686e-08)
zrp (4.92251657194827e-08)
rp (1.963707131901836e-08)
tp (1.9567735778192652e-08)
tpr (1.500209490112472e-08)


We also encounter errors involving double letters. In cases where there should have been double letters (deletion error), the spell corrector works just fine. However, when we have mistakenly doubled a letter (insertion error), the program throws an error. This is because the confusion matrix data does not account for double letters like `ll` or `rr` (as in `parrent`, shown below). We can modify the program accordingly to account for double letters.

In [16]:
spellCorrect()

Please enter a string: parrent


ValueError: 'rr' is not in list

Editing the program,

In [17]:
def spellCorrectEdited():
    # Get (and tokenize, if using a bigram/trigram model) an input value/hard coded value
    inputWord = input("Please enter a string: ")

    # Screen candidates from vocabulary list
    candidateList = []

    if inputWord in total_vocabulary:
        print("No error")
    else:
        for word in total_vocabulary:
            dist = minEditDistance(word, inputWord)
            if dist["Distance"] == 1:
                candidateList.append(word)
        
        if candidateList == []:
            print("No word in corpus close enough to input")

        probDict = {}
        for candidate in candidateList:
            dist = minEditDistance(candidate, inputWord)

            unigram_prob = freqDict[candidate]
            if "I" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("I")

                if misspell_idx != len(inputWord) - 1:
                    wrong = dist["Misspelled"][misspell_idx: misspell_idx + 2]
                    # Account for mistaken double letters
                    if wrong[0] == wrong[1]:
                        wrong = dist["Misspelled"][misspell_idx - 1: misspell_idx + 1]
                else:
                    wrong = dist["Misspelled"][misspell_idx - 2: misspell_idx]
                matrix_wrong_idx = ins_wrong_vals.index(wrong)

                if misspell_idx != 0:
                    right = dist["Correct"][misspell_idx - 1]
                else:
                    right = dist["Correct"][misspell_idx + 1]
                matrix_right_idx = ins_right_vals.index(right)  

                error_prob = insMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "S" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("S")

                wrong = dist["Misspelled"][misspell_idx]
                matrix_wrong_idx = sub_wrong_vals.index(wrong)

                right = dist["Correct"][misspell_idx]
                matrix_right_idx = sub_right_vals.index(right)

                error_prob = subMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "D" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("D")

                if misspell_idx != 0:
                    wrong = dist["Misspelled"][misspell_idx - 1]
                else:
                    wrong = dist["Misspelled"][misspell_idx + 1]
                matrix_wrong_idx = del_wrong_vals.index(wrong)

                if misspell_idx != len(candidate) - 1:    
                    right = dist["Correct"][misspell_idx - 1: misspell_idx + 1]
                else:
                    right = dist["Correct"][misspell_idx - 2: misspell_idx]
                matrix_right_idx = del_right_vals.index(right)

                error_prob = delMatrix[matrix_wrong_idx, matrix_right_idx]
            elif "TT" in dist["Alignment"]:
                misspell_idx = dist["Alignment"].index("T")

                wrong = dist["Misspelled"][misspell_idx: misspell_idx + 2]
                matrix_wrong_idx = trans_wrong_vals.index(wrong)

                right = dist["Correct"][misspell_idx: misspell_idx + 2]
                matrix_right_idx = trans_right_vals.index(right)

                error_prob = transMatrix[matrix_wrong_idx, matrix_right_idx]

            prob = unigram_prob * error_prob
            probDict[candidate] = prob

        for key in list(probDict.keys()):
            sum_prob = np.sum(list(probDict.values()))
            probDict[key] = probDict[key] / sum_prob

        # Sort dictionary by values
        # from: https://stackoverflow.com/questions/613183/how-do-i-sort-a-dictionary-by-value
        sorted_probDict = sorted(probDict.items(), key = lambda kv: kv[1], reverse = True)

        for candidate, prob in sorted_probDict:
            print(candidate + " (" + str(prob) + ")")

In [18]:
spellCorrectEdited()

Please enter a string: parrent
parent (0.04757196007312527)
warrent (0.00031485278154048257)


However, since we choose the letter before and on the insertion, the program still throws an error when the double letter error occurs at the beginning of the word.

In [19]:
spellCorrectEdited()

Please enter a string: pparent


ValueError: '' is not in list

These are just some limitations associated with our choice of corpus and confusion matrix data. Ideally, one should create his/her own corpus and confusion matrix data and suit it to his/her own needs.

For a simple spell checker, for example, a list of sentences with high-frequency English words would be ideal. As for the confusion matrix data, limiting `wrong_vals` and `right_vals` to single characters would make the program much simpler and would make dealing with cases like the one above (double letters) easier.

## REFERENCES

[1] Goldhahn, D., Eckart, T., & Quasthoff, U. (2012, May). Building Large Monolingual Dictionaries at the Leipzig Corpora Collection: From 100 to 200 Languages. In <i> LREC </i> (Vol. 29, pp. 31-43).

[2] P. Norvig. (2009). Natural language corpus data. <i> Beautiful data </i>, 219-242.