(dl/04-sequence-models/04z-spell-corrector)=
# Appendix: Spelling correction

The following is adapted from [Norvig's *How to Write a Spelling Corrector*](https://norvig.com/spell-correct.html). It turns out that a spelling corrector can be built using techniques from language modeling. For this, we need a list of possible words and their frequency (for estimating probabilities). We obtain this from Project Gutenberg books using our `ProjectGutenberg` class. Hence, the words are normalized to only contain ASCII lowercase characters. 

In [1]:
from chapter import *

In [2]:
import re

def clean(text: str):
    return re.sub('[^A-Za-z]+', ' ', text).lower().strip()

urls = get_top100_books()
texts = []
for url in tqdm(urls):
    try:
        book = ProjectGutenberg(url=url, data_dir="./data")
        corpus, vocab = book.build()
    except:
        continue
        
    if len(clean(book.title)) < 5:  # language: en
        continue
    else:
        texts.append(book.text)

  0%|          | 0/100 [00:00<?, ?it/s]

Downloading text from https://www.gutenberg.org/cache/epub/74958/pg74958.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/74957/pg74957.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/74953/pg74953.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/74956/pg74956.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/74955/pg74955.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/5740/pg5740.txt ... Downloading text from https://www.gutenberg.org/cache/epub/74954/pg74954.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/33283/pg33283.txt ... Downloading text from https://www.gutenberg.org/cache/epub/39287/pg39287.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/74947/pg74947.txt ... OK!
Downloading text from https://www.gutenberg.org/cache/epub/26471/pg26471.txt ... Downloading text from https://www.gutenberg.org/cache/epub/120/pg120.txt ... OK

Filter out possible typos:

In [3]:
from collections import Counter

WORDS = Counter("".join(texts).split())
WORDS = dict([(w, f) for w, f in WORDS.items() if f > 1])
len(WORDS)

169608

<br>

## General approach

Given a string that a user typed, the function $f$ returns the correct word:

$$f(\textsf{typed}) = {\text{arg max}}_{\textsf{word} \in \mathcal{V}}\; P(\textsf{word} \mid \textsf{typed}).$$

To estimate the probabilities, we use Bayes' law:

$$P(\textsf{word} \mid \textsf{typed}) \sim P(\textsf{typed} \mid \textsf{word}) \times P(\textsf{word}).$$

In particular, if the typed word exists, then word must be returned. 
Decomposing $P(\textsf{word} \mid \textsf{typed})$ into two factors allow us to focus on two issues separately in determining the intended word: (1) the probability of making a typo from a candidate word, and (2) the probability of that candidate word.

For example, if a user typed "thew", and consider two candidates "the" and "thaw". One way to determine the intended word is that each word is 1 edit away from "thew", then we have to determine which error is more probable, say, based on keyboard layout and common keystroke errors. On the other hand, "the" is more probable that "thaw", so that this is a more likely choice of next word. Thus, the decomposition makes these separate considerations less transparent.

**Remark.** There are four elements in this approach that can be played with:

| Element | Description | Implementation |
| :-: | :-:| :-: |
|Selection mechanism     |  Method for selecting the word out of candidates  |  $\text{argmax}$   | 
|Candidate model | Set of candidates to consider |   $\mathcal{V}$ |
| Language model | Probability of candidate word | $P(\textsf{word})$ |
| Error model | Probability of typographical error from a candidate | $P(\textsf{typed} \mid \textsf{word})$ |

For example, if $\mathcal{V}$ is very large, then we may not consider all words. Also the language model can be extended to include the previously typed words as context. Thus, we evaluate $P(\textsf{w}_1, \ldots, \textsf{w}_\tau, \textsf{w})$ where $\textsf{w}$ is the candidate word and $\tau$ is the context size.

<br>

## Norvig's spelling corrector

1. **Selection mechanism.** Simply use argmax. 

2. **Language model.** Based on the word's empirical frequency in the corpus:

In [4]:
def P(word): 
    N = sum(WORDS.values())
    return WORDS.get(word, 0) / N


print(P("qwerty"))
print(P("dawg"))

0.0
1.2838775155172647e-07


3. **Candidate model.** Known words (i.e. exists in the corpus) that are 0, 1, or 2 edits away: 

In [5]:
def known(words: list[str] | set[str]):
    return set(w for w in words if w in WORDS)

def edits1(word):
    "All edits that are one edit away from `word`. Transpose means swap adjacent."
    letters    = "abcdefghijklmnopqrstuvwxyz"
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def edits2(word):
    return set([e2 for e1 in edits1(word) for e2 in edits1(e1)])

**Remark.** Let $n$ be the length of the word, and we have $26$ letters. 
Then, there are $n$ deletes, $n - 1$ transposes, $26n$ replaces, and $26(n + 1)$ inserts.
Hence, there are at most $54n + 25$ edits. Two edits will therefore be squared of this. But since we only consider known words and the results are deduplicated, so that this number goes down significantly.

In [6]:
len(edits1("somthing")), 54 * 8 + 25

(442, 457)

In [7]:
known(edits1("smething"))

{'seething', 'something'}

In [8]:
known(edits2("smething"))

{'scathing',
 'seething',
 'seethings',
 'setling',
 'setting',
 'sheathing',
 'sketching',
 'smashing',
 'smearing',
 'smelling',
 'smelting',
 'smiting',
 'smoothing',
 'somethin',
 'something',
 'somethings',
 'somethink',
 'soothing',
 'swathing',
 'teething'}

4. **Error model.** Norvig's error model does not depend on user data. This makes it an interesting baseline. All known words of edit distance 1 are infinitely more probable than known words of edit distance 2, and infinitely less probable than a known word of edit distance 0. In practice, we can instead limit the candidate words and assign
$P(\textsf{typed} \mid \textsf{word}) = 1$ to this subset. Hence, the most probable words are returned.

The following algorithm is used to produce the list of candidates in order of priority:

```text
1. The original word, if it is known; otherwise
2. The list of known words at edit distance one away, if there are any; otherwise
3. The list of known words at edit distance two away, if there are any; otherwise
4. Return empty set.
```

In [9]:
def candidates(word): 
    """Return 0-edit words if non-empty, else 1-edit, else 2-edit, else {}."""
    return known([word]) or known(edits1(word)) or known(edits2(word)) or set()

<br>

**Final model.** Finally, we have the function:

In [10]:
import numpy as np

def correction(word, n=1):
    c = list(candidates(word))
    n = min(n, len(c))
    p = [-P(w) for w in c]
    words = [c[i] for i in np.argpartition(p, n-1)]
    return words[:n]

Trying it out with a word that has no obvious correction:

In [11]:
correction("qwertyu")

[]

A more natural example:

In [12]:
correction("sumthing")  # top 1

['something']

Checking top 5 suggestions:

In [13]:
correction("sumthing", n=10)  # top 10

['soothing',
 'summing',
 'seething',
 'something',
 'scathing',
 'suiting',
 'swathing']

<br> 

## Evaluation

The model has to be evaluated versus actual user data. For example, data from typing tests. To do this, we follow Norvig and use Roger Mitton's [Birkbeck spelling error corpus](https://www.dcs.bbk.ac.uk/~ROGER/missp.dat) from the Oxford Text Archive. 

In [14]:
!wget https://www.dcs.bbk.ac.uk/~ROGER/missp.dat -O data/birbeck.txt

--2024-12-22 23:20:34--  https://www.dcs.bbk.ac.uk/~ROGER/missp.dat
Resolving www.dcs.bbk.ac.uk (www.dcs.bbk.ac.uk)... 193.61.29.21
Connecting to www.dcs.bbk.ac.uk (www.dcs.bbk.ac.uk)|193.61.29.21|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 368933 (360K)
Saving to: ‘data/birbeck.txt’


2024-12-22 23:20:37 (473 KB/s) - ‘data/birbeck.txt’ saved [368933/368933]



In [15]:
from pprint import pprint

def errors_dataset():
    f = open("data/birbeck.txt").read()
    lines = [l.lower() for l in f.strip().split("\n") if len(l) > 0]
    
    i, n = 0, len(lines)
    data = {}
    while i < n:
        if lines[i][0] == "$":
            word = lines[i][1:]
            data[word] = [word]
        else:
            data[lines[i]] = data.get(lines[i], []) + [word]
        i += 1
    return data


data = errors_dataset()
print(len(data))
pprint(dict(list(data.items())[:10]))

38893
{'ab': ['albert', 'obvious'],
 'albert': ['albert'],
 'ameraca': ['america'],
 'ameracan': ['american'],
 'amercia': ['america'],
 'america': ['america'],
 'american': ['american'],
 'apirl': ['april'],
 'april': ['april'],
 'austrian': ['austrian']}


Checking top 1, 3, 5 accuracy:

In [16]:
import random
from tqdm.notebook import tqdm


def evaluate(f, sample, top_k=[1, 3, 5]):
    evals = {}
    for top in tqdm(top_k):
        hits = 0
        for w in tqdm(sample):
            hits += int(len(set(data[w]) & set(f(w, n=top))) >= 1)
        evals[top] = hits / len(sample)
    return evals


TEST_FRAC = 0.005
sample = random.sample(data.keys(), k=int(TEST_FRAC * len(data)))
evaluate(correction, sample)

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

{1: 0.4381443298969072, 3: 0.4536082474226804, 5: 0.4536082474226804}

<br>

## Improving the error model

Let's try to improve the error model. Instead of setting  We suppose the probability of an error is proportional to a decay factor $G_m = 10^{-3m}$ for $m = 0, 1, 2$ edits with $G_m = 0$ for $m > 2.$ In other words, we only consider up to two edits as candidate words, similar to Norvig's model.

In [17]:
def correction2(word, n=1):
    e0 = list(known([word]))
    e1 = list(known(edits1(word)))
    e2 = list(known(edits2(word)))
    c = e0 + e1 + e2
    G = lambda m: 10 ** -(3 * m)
    p = [-G(0) * P(w) for w in e0] + \
        [-G(1) * P(w) for w in e1] + \
        [-G(2) * P(w) for w in e2]
    
    n = min(n, len(p))
    words = [c[i] for i in np.argpartition(p, n-1)]
    return words[:n]


evaluate(correction2, sample)  # same validation set

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

  0%|          | 0/194 [00:00<?, ?it/s]

{1: 0.4536082474226804, 3: 0.520618556701031, 5: 0.5309278350515464}

One implication of this is that even if the typed word is known, if it's extremely rare, then our model assumes that it has been typed by mistake. A more probable edited word is then suggested:

In [18]:
print(bool(known(["debar"]))) 
print(correction2("debar"), P("debar"), P("dear") * 0.001)

True
['dear'] 3.2096937887931614e-07 3.8458550977319665e-07


It follows that the decay factors can be tuned based on data. For example, if we use $G_m = 10^{-m}$ then "segmnt" is corrected to "sent":

In [19]:
P("segment") < P("sent") * 0.1 ** 2

True

<br>

**Final evaluation.** Note that the above parameters were manually tuned on the given sample. So we test on a separate test set:

In [20]:
comp = set(data.keys()) - set(sample)
test = random.sample(comp, int(TEST_FRAC * len(comp)) * 3)
print(evaluate(correction, test))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

{1: 0.4214162348877375, 3: 0.45595854922279794, 5: 0.46632124352331605}


In [21]:
print(evaluate(correction2, test))

  0%|          | 0/3 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

  0%|          | 0/579 [00:00<?, ?it/s]

{1: 0.4214162348877375, 3: 0.49740932642487046, 5: 0.5215889464594128}
