# Exercise 4 - Spelling Correcting | NLP

### Task 1
- Implement a language-model function for $P(w)$.
- You must load the file "data.txt", filter the alphabetic expressions (tip: use regex) and calculate the frequencies of occurrences of the existing words.
- What are the ten most frequent words and what is their probability of occurrence?

In [1]:
import collections
import re
corpus = [w.lower() for w in re.findall(r'\w+', open('data.txt').read())]
words = collections.Counter(corpus)

In [2]:
def languageModel(word, n = sum(list(words.values()))):
    return words[word] / n

In [3]:
print("{:<5}{:>8}\t{:<4}".format("w", "freq(w)", "p(w)"), "\n", "-"*18)
for word, freq in words.most_common(10):
    print("{:<5}{:8d}\t{:0.2f}".format(word, freq, languageModel(word)))

w     freq(w)	p(w) 
 ------------------
the     46220	0.08
of      25494	0.05
and     16778	0.03
in      13292	0.02
to      12640	0.02
a       10957	0.02
is       6592	0.01
it       5271	0.01
that     4527	0.01
by       4339	0.01


### Task 2
- Implement a function that creates a set of possible correction candidates $C_w$ for a given word $w$.
- You do not need to concatenate operations!
- To make things even simpler:
    - You can ignore unknown words (words, that do not occur in our corpus)
    - If non of the candidates is known, return the word itself.
- What is the set of Candidates for the word "frod"?

In [4]:
import string
# fthe ollowing functions were written by Peter Norvig ;) (https://norvig.com/spell-correct.html)
def edits1(word):
    letters    = string.ascii_lowercase
    splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
    deletes    = [L + R[1:]               for L, R in splits if R]
    transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
    replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
    inserts    = [L + c + R               for L, R in splits for c in letters]
    return set(deletes + transposes + replaces + inserts)

def known(wordies): 
    return set(w for w in wordies if w in words)

def candidates(word): 
    return (known([word]) or known(edits1(word)) or [word])

In [5]:
candidates("nie")

{'die', 'lie', 'nee', 'nice', 'nile', 'nine', 'nip', 'pie', 'tie', 'vie'}

In [6]:
candidates("frod")

{'food', 'ford', 'fro', 'from', 'rod'}

## Error Model

### Task 3
- Implement a function `correction(word)`
- If $w$ is a known word, then $P(w) > P(c^{(k)})$ and therefore return $w$.
- If $w$ is unknown  $argmax_k P(c^{(k)})$ and return $c^{(k)}$.
- If both $w$ and its corrections are unknown, return $w$.


In [7]:
def correction(word): 
    return max(candidates(word), key=languageModel)

In [9]:
correction("halo")

'half'

## Evaluation
The file `validate.csv` contains a list of evaluation data of the form:

    "right, misspelled"

### Task 4
- Use the validation data set to test your first version of the spelling correction function.
- What is your success rate?

In [10]:
import csv
def validate(correction, verbose=False):
    with open("validate.csv", "r") as file:
        r = csv.reader(file)
        testset = [row for row in r if row]
    failed = []
    if verbose:
        print("  i{:>20}{:>20}{:>20}".format("expected", "actual", "original"))
    for i, (right, wrong) in enumerate(testset):
        c = correction(wrong)
        if c != right: 
            failed.append((right, c, wrong))
            if verbose:
                print("{:3d}{:>20}{:>20}{:>20}".format(i, right, c, wrong))
    success_rate = 1 - len(failed) / len(testset)
    return failed, success_rate

In [11]:
failed0, sr0 = validate(correction)
print(sr0)

0.7486033519553073


## The Noise Channel Model
### Task 5
- Implement the Noice Channel Model as a new error model.
- You can use the confusion matrices given below.
- Revalidate your correction function and present your results.
- Again you can confine your functions to only consider known words with edit distance one.

In [12]:
import ast
# from "A Spelling Correction Program Based on a Noisy Channel Model" by Kernighan et all.
with open('addconfusion.data', 'r') as file:
    addmatrix=data=ast.literal_eval(file.read())
with open('subconfusion.data', 'r') as file:
    submatrix=data=ast.literal_eval(file.read())
with open('revconfusion.data', 'r') as file:
    revmatrix=data=ast.literal_eval(file.read())
with open('delconfusion.data', 'r') as file:
    delmatrix=data=ast.literal_eval(file.read())

In [13]:
def edit_substitutions(L, R):
    edits = []
    if len(R) >= 1:
        for c in string.ascii_lowercase:
            word = L + c + R[1:]
            if word in words:
                edits.append(("sub", word, R[0] + c))
    return edits

def edit_transposes(L, R):
    edits = []
    if len(R) > 1:
        word = L + R[1] + R[0] + R[2:]
        #L + R[1] + R[0] + R[2:]
        if word in words:
            edits.append(("rev", word, R[0] + R[1]))
    return edits

def edit_deletes(L, R):
    edits = []
    if R != "":
        if L == "":
            word = L + R[1:]
            ed = "#" + R[0]
        else:
            word = L + R[1:]
            ed = L[-1] + R[0]
        if word in words:
            edits.append(("del", word, ed))
    return edits

def edit_adds(L, R):
    edits = []
    for c in string.ascii_lowercase:
        if L == "":
            word = L + c + R
            xy = "#" + c
        else:
            word = L + c + R
            xy = L[-1] + c
        if word in words:
            edits.append(("add", word,  xy))
    return edits

In [14]:
def candidates(word):
    if known([word]):
        return None
    splits = [(word[:i], word[i:])  for i in range(len(word) + 1)]
    edits = [f(L, R) for L, R in splits for f in [edit_adds, edit_deletes, edit_transposes, edit_substitutions]]
    edits = [i for e in edits for i in e if i]
    return edits

In [15]:
def errorModel(edit, candidate, xy):
    corpus = " ".join(words)
    x, y = xy
    try:
        if edit == 'add':
            if x == '#':
                return addmatrix[x+y]/corpus.count(' '+y)
            else:
                return addmatrix[x+y]/corpus.count(x)
        if edit == 'sub':
            return submatrix[(x+y)[0:2]]/corpus.count(y)
        if edit == 'rev':
            return revmatrix[x+y]/corpus.count(x+y)
        if edit == 'del':
            if x == '#':
                return delmatrix[x+y]/corpus.count(' '+y)
            else:
                return delmatrix[x+y]/corpus.count(x+y)
    except Exception as e:
        print(x, y, edit, candidate)
        raise Exception()

In [16]:
def model(edit, candidate , xy):
    return languageModel(candidate) * errorModel(edit, candidate, xy)

In [17]:
import numpy as np
def correction(word): 
    cs = candidates(word)
    if cs:
        idx = np.argmax([model(*c) for c in cs])
        return cs[idx][1]
    return word

In [18]:
failed, sr = validate(correction)
print(sr)

0.7877094972067039
