# Building a Spelling Corrector

Automatic spelling correction is something we encounter every day, but how it works isn't immediately obvious. With this notebook I aim to show how to build an understandable spelling corrector that, as a bonus, also works reasonably well.

## First, some probability theory
### Equation to maximize
Given a misspelled word `w`, we want to create a function `correct(w)` which returns the word the user was most likely trying to type. As there is unavoidable ambiguity in correcting a word ("ther" could be "their", "there", or "the", just to name a few) we will assign probabilities to each possible correction (candidate). Each candidate `c`'s probability represents the likelihood that given misspelled word `w`, `c` was the intended word: `P(c|w)`. However, this is a difficult probability to evaluate as there are too many factors involved. Ideally, we want to break down the problem into smaller equations and then combine them.

[Bayes' Theorem](https://en.wikipedia.org/wiki/Bayes'_theorem) tells us that `P(c|w)` is equivalent to `(P(c) * P(w|c)) / (P(w)`. `P(w)` is simply the probability of the misspelled word appearing, which is the same for all candidates so we can ignore it as it doesn't factor into the maximization. This leaves us with the equation: `P(c) * P(w|c)` to maximize over all candidates `c` for a misspelled word `w`.

That was the most complicated bit of this entire exercise, it's all downhill from here!

### Breaking down the equations
`P(c)`: The probability of the candidate word occurring in an English text. Words like "the", "and" etc. occur far more frequently than words like "monopoly" or "horticulture". This gives them a higher probability of occurring. In other words, you're more likely to be trying to write "when" than "whence", so P("when") should be higher than P("whence").

`P(w|c)`: The probability that the misspelled word `w` was typed when the author meant `c`. For example, `P("clasroom"|"classroom")` is very high, while `P("cladfalsasrosfsafam"|"classroom")` is very low.

Over all candidates `c`: We can't evaluate the equation for every word in the English language, so instead we'll evaluate it on words that are a couple simple edits away from the misspelled word `w`.

## Candidate selection
We define a simple edit to a word as any of the following:
- Deletion: Remove one letter
- Swap: Swap two adjacent letters
- Substitution: Change one letter to another
- Insertion: Add a new letter

This is implemented by the function `one_edit_from`

In [1]:
LETTERS = "abcdefghijklmnopqrstuvwxyz"

In [2]:
def one_edit_from(word):
    # Generates splits of a word ex. "dog" -> [("", "dog"), ("d", "og"), ("do", "g"), ("dog", "")]
    splits = []
    for i in range(len(word) + 1):
        split = (word[:i], word[i:])
        splits.append(split)

    edits = []

    # Deletes ex. "dog" -> ["og", "dg", "do"]
    for (left, right) in splits:
        if right:
            deleted = left + right[1:]
            edits.append(deleted)

    # Swaps ex. "dog" -> ["odg", "dgo"]
    for (left, right) in splits:
        if len(right) > 1:
            swapped = left + right[1] + right[0] + right[2:]
            edits.append(swapped)

    # Substitutes ex. "dog" -> ["aog", "dag", "doa",  ...]
    for (left, right) in splits:
        for sub in LETTERS:
            substituted = left + sub + right[1:]
            edits.append(substituted)

    # Inserts ex. "cat" -> ["acat", "caat", "caat", "cata", ...]
    for (left, right) in splits:
        for insert in LETTERS:
            inserted = left + insert + right
            edits.append(inserted)

    # Convert from list to set to remove duplicates
    return set(edits)

In [3]:
len(one_edit_from("dog"))

182

In [4]:
list(one_edit_from("dog"))[:8]

['dogc', 'dyg', 'dogt', 'dopg', 'doi', 'dov', 'doq', 'dkg']

This list can get very long. We can handle this by filtering out words that don't exist in the English language. To do this, we create a constant `DICTIONARY` which stores a list of all the words in an English dictionary (read in from a text file). The `utils` functions are included at the end of the notebook.

In [5]:
from utils import get_dictionary
DICTIONARY = get_dictionary()

In [6]:
sorted(list(DICTIONARY))[:8]

['aah',
 'aardvark',
 'aardvarks',
 'abacus',
 'abacuses',
 'abalone',
 'abalones',
 'abandon']

We can now make the function `filter_unknown` which removes all the words not in `DICTIONARY` from a list of words 

In [7]:
def filter_unknown(words):
    known_words = []
    for word in words:
        if word in DICTIONARY:
            known_words.append(word)
    return set(known_words)

In [8]:
filter_unknown(one_edit_from("dog"))

{'bog',
 'cog',
 'dag',
 'dig',
 'do',
 'dob',
 'doc',
 'doe',
 'dog',
 'dogs',
 'doh',
 'don',
 'dos',
 'dot',
 'dug',
 'fog',
 'hog',
 'jog',
 'log',
 'tog',
 'wog'}

This is a far more manageable list, and is more useful as well because we don't want to correct misspelled words into words that don't exist in the English language.

To expand our search-space, we'll also include candidates that are two simple edits from the original word. This function, `two_edits_from` just stacks calls to `one_edit_from`

In [9]:
def two_edits_from(word):
    # Find all words one edit away
    first_edits = one_edit_from(word)

    # Find all words one edit away from the first edits
    second_edits = set()
    for first_edit in first_edits:
        second_edits.update(one_edit_from(first_edit))

    all_edits = first_edits.union(second_edits)

    return all_edits

In [10]:
filter_unknown(two_edits_from("example"))

{'ample', 'examine', 'example', 'examples', 'sample', 'trample'}

Tying it all together, we write the function `get_candidates` which accepts a word `w` and returns `w` if `w` is known, else all known words one edit from `w`. If there are no known words one edit from `w` then it returns all known words two edits from `w`. Failing that, it just returns `w`.

This function is meant to represent `P(w|c)` as it returns the candidates most likely to be the intended word. However, it falsely assumes that any candidate one edit away from the original word is infinitely more likely to be intended than a word two edits away. This isn't always true. For example, if `w` is "rember" then our model thinks "member" is infinitely more likely than "remember" which is clearly not true.

In [11]:
def get_candidates(word):
    # If the word is in the dictionary, no need to correct it
    if word in DICTIONARY:
        return [word]

    # If we have words that are one edit away and in the dictionary, return those
    first_edits = filter_unknown(one_edit_from(word))
    if len(first_edits) != 0:
        return first_edits

    # If we have words that are two edits away and in the dictionary, return those
    second_edits = filter_unknown(two_edits_from(word))
    if len(second_edits) != 0:
        return second_edits

    # Otherwise, just return the original word as we couldn't find a candidate
    return [word]

In [12]:
get_candidates("compture")

{'capture',
 'compare',
 'compere',
 'composure',
 'compute',
 'computer',
 'couture'}

## Evaluating P(c)
To calculate the `p(c)` for any word `c`, we will use a large corpus of text and count the occurrences of `c` within the text and divide it by the total amount of words in the corpus. This will give us a good, if a bit crude, estimate of the probability of `c` showing up in our text. Our corpus will be a text file containing a large sample of ebooks from Project Gutenberg. We create a Counter that reads in all the words accumulates counts for each word.

In [13]:
from utils import get_word_counts
WORD_COUNTS = get_word_counts()

In [14]:
WORD_COUNTS["the"]

2272721

In [15]:
WORD_COUNTS["monopoly"]

141

In [16]:
WORD_COUNTS["horticulture"]

39

In [17]:
WORD_COUNTS.most_common(8)

[('the', 2272721),
 ('of', 1241239),
 ('and', 1239810),
 ('to', 1038194),
 ('a', 820395),
 ('in', 667929),
 ('i', 588076),
 ('that', 528540)]

We can now create the function `probability_of` which returns our estimate of `p(c)` given candidate word `c`

In [18]:
def probability_of(word):
    return WORD_COUNTS[word] / sum(WORD_COUNTS.values())

In [19]:
probability_of("the") # Large

0.05866879067049132

In [20]:
probability_of("monopoly") # Not large

3.639821819105502e-06

## Putting it together
We can now generate candidates and evaluate their probability of occurring. From these ingredients we can define a simple function to return the candidate with the maximum probability, completing our spelling corrector.

In [21]:
def correct(word):
    return max(get_candidates(word), key=probability_of)

In [22]:
misspellings = ["speling", "computinga", "clasrom", "ptyhon", "prgrammin", "jonahtan"]
for misspelling in misspellings:
    print(f"{misspelling} -> {correct(misspelling)}")

speling -> spelling
computinga -> computing
clasrom -> classroom
ptyhon -> python
prgrammin -> programming
jonahtan -> jonahtan


And it works! Though it apparently doesn't like names, can you figure out why? (Hint: Think about how we're filtering out unknown words)

## Conclusion
I hope you found this notebook insightful into how spelling correction works at a base level. I've always found it interesting to see how technologies we use every day might work and put together this notebook to share that interest.

Please reach out if you have any questions or suggestions!

## Attributions
- The dictionary file was sourced from SCOWL's 12Dicts package: http://wordlist.aspell.net/12dicts/
- The Gutenberg corpus was compiled from Shibamouli Lahiri's work: https://web.eecs.umich.edu/~lahiri/gutenberg_dataset.html

## Utility functions
These are the `utils` functions that we imported throughout the notebook. They generally just handle reading files.

In [23]:
import re
from collections import Counter
from os import listdir
from random import shuffle


def only_words(text: str):
    return re.findall(r"[^_\W]+", text.lower())


def get_words():
    with open("./datasets/dictionary.txt") as reader:
        return set(only_words(reader.read()))


def get_word_counts():
    with open("./datasets/corpus.txt") as reader:
        return Counter(only_words(reader.read()))
