# 2. Simple Spellchecker

One simple application we can use our corpus for is creating a basic spellchecker, like you might use in Microsoft Word.

There are two approaches to creating a spellchecker system. 
1. Store a huge list of words in the language, and check that every typed word is also a word in that list.
2. Store just roots, and use morphological information to determine if a typed word is a valid form of the root.

While approach 2 certainly seems more ideal, it will take a lot more work to implement effectively. We will use approach 1 for now, which is how standard tools such as Microsoft's spellchecker work.

First, we'd like to compile a list of all the words we have in our corpus. To do this, we'll need to process the corpus further.

In [1]:
import os

# First, let's combine all of our corpus entries into a single, huge string.
# We will save one corpus entry, 68, for testing
corpus_directory = "corpus-usp"

corpus = ""

for file_name in os.listdir(corpus_directory):
    # Skip this file
    if file_name == "68.txt" or ".txt" not in file_name:
        continue
        
    # Read the file as a string
    file_path = os.path.join(corpus_directory, file_name)
    with open(file_path, 'r') as file:
        file_contents = file.read()
        corpus += (file_contents + "\n")
        
print(corpus[:1000])

antonses chib'aanik tanb'ij iin.
Jinon li... tijb'ij taq qaqaaj,
qachuuch.
Pwes ti... toos qaqaaj,
qachuuch.
Tinb'ij iin qaqaaj,
qachuuch,
tinb'ij iin li qamaam ójor taq tziij.
Tijb'ij taq qamaam qet',
ójor,
ójor taq tziij li.
Ójor,
cha' kongan chee',
kongan sii'.
Ri' li tijb'ij taq,
kongan sii',
kongana chee' naqaaj.
Nimaq taq chee',
entons ri' li tijb'ij taq.
Toos kwand wi' chee',
cha'.
Tpeti jaab',
cha',
kwando wi' ta't.
Pores tijb'ij taq li ójor taq tziij.
Kwand ooj,
xojk'iyk ojb'enaa li sii'.
Atb'i'tqa' li sii'.
Jataq li sii',
per makataq maq ra chee'.
Porke maq ra chee' nimi' jq'iij.
I nosol ma'an taq re maq tra chee'.
Makach' taq jwich taq ra chee'.
Ri' li atyuter taq lajasok,
ta' t'el awanm atkamk,
cha' taq.
Toons ri' li,
toos ri' limaq taq chee'.
Toos maa b'ensaj k'ex re chee'.
Porke chee' re nimi' jq'iij,
nimi' jpetiik,
cha' taq.
Chee' marechtqe,
marechtqe ju...n,
marechtqe kib',
uxub' q'iij jwich.
Noke nimi' jq'iij,
cha' taq.
Ri rere tijya's,
tijya' teew,
cha' taq.
Tijya' te

In [2]:
# How many characters are in our corpus?
len(corpus)

255361

Accent marks are used to indicate tone in the transcriptions. However, a speaker might not write them, so we will strip them.

In [3]:
import unicodedata

def strip_accents(text):
    return ''.join(c for c in unicodedata.normalize('NFD', text)
                  if unicodedata.category(c) != 'Mn')

strip_accents("ójor taq tziij kita' jaa,")

"ojor taq tziij kita' jaa,"

In [4]:
corpus = strip_accents(corpus)

In [5]:
# Let's also make everything lowercase
corpus = corpus.lower()

## Create a word list
Now, let's create a list of every word that occurs in our corpus, using word tokenization. We will ignore punctuation marks and assume that a word is surrounded by spaces or punctuation. Additionally, we'll keep a count of the frequency of each word for use later on.

In [6]:
# Let's see what characters appear in our corpus
set(corpus)

{'\n',
 ' ',
 '!',
 "'",
 '(',
 ')',
 ',',
 '.',
 '/',
 ':',
 '?',
 '[',
 ']',
 'a',
 'b',
 'c',
 'd',
 'e',
 'f',
 'g',
 'h',
 'i',
 'j',
 'k',
 'l',
 'm',
 'n',
 'o',
 'p',
 'q',
 'r',
 's',
 't',
 'u',
 'v',
 'w',
 'x',
 'y',
 'z',
 '¡',
 '¿',
 'ß',
 '≈'}

In [7]:
import re

# Find just words
word_regex = r"[\w|\']+"

def tokenize(text):
    return re.findall(word_regex, text)

words = tokenize(corpus)
words[:15]

['antonses',
 "chib'aanik",
 "tanb'ij",
 'iin',
 'jinon',
 'li',
 "tijb'ij",
 'taq',
 'qaqaaj',
 'qachuuch',
 'pwes',
 'ti',
 'toos',
 'qaqaaj',
 'qachuuch']

In [8]:
# Now, let's get a set of words and their frequencies
lexicon = dict()
for word in words:
    if word in lexicon:
        lexicon[word] += 1
    else:
        lexicon[word] = 1

%store lexicon
len(lexicon)

Stored 'lexicon' (dict)


6771

In [9]:
# Let's see what the twenty most common words are
sorted(lexicon.items(), key=lambda x: x[1], reverse=True)[:20]

[('taq', 1337),
 ('re', 1267),
 ('li', 1203),
 ("cha'", 1010),
 ('i', 988),
 ('man', 809),
 ("ta'", 782),
 ('jun', 740),
 ("wi'", 581),
 ('ra', 575),
 ('ri', 419),
 ("ri'", 386),
 ('anm', 361),
 ('chaq', 360),
 ('chi', 350),
 ('ke', 328),
 ('ya', 322),
 ('chik', 316),
 ('iin', 283),
 ('qe', 265)]

## Building a spellchecker
Now we're ready to build our spellchecker program. To do this, we will parse and tokenize the user's input, and then we will check each word against our lexicon. If a word doesn't appear in the lexicon, we will return it in the list of mispelled words, as well as the position where it occurs.

In [10]:
def spellcheck(s):
    # Preprocess the input
    s = strip_accents(s)
    s = s.lower()
    input_words = set(tokenize(s))
    
    mispelled = []
        
    for word in input_words:
        if not word in lexicon.keys():
            # A spelling error!
            # Find the indices of the word in the original text
            word_regex = f"(^|\W)({word})($|\W)"
            for match in re.finditer(word_regex, s):
                mispelled.append((word, match.start(2)))
            
    return sorted(mispelled, key = lambda x: x[1])

print("Mispelled:")
for mispelled_word, location in spellcheck("Kwand xink'uli'k', re ójr taq tzijj in ák'el na."):
    print(f"{mispelled_word} at {location}")

Mispelled:
tzijj at 30


At this point, we are detecting spelling errors and reporting them appropriately. But this isn't a great tool for a user to use, so let's make it nicer to input text and see output.

In [12]:
%pip install termcolor
%pip install gradio

Note: you may need to restart the kernel to use updated packages.
Collecting gradio
  Downloading gradio-3.15.0-py3-none-any.whl (13.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m13.8/13.8 MB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
Collecting httpx
  Downloading httpx-0.23.1-py3-none-any.whl (84 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m85.0/85.0 kB[0m [31m8.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting fastapi
  Downloading fastapi-0.88.0-py3-none-any.whl (55 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m55.5/55.5 kB[0m [31m4.5 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting orjson
  Downloading orjson-3.8.3-cp310-cp310-macosx_10_9_x86_64.macosx_11_0_arm64.macosx_10_9_universal2.whl (493 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m493.6/493.6 kB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting altair>=4.2.0
  Downloading altair-4.2.0-py3-none-any.whl (8

In [13]:
import termcolor
import ipywidgets as widgets
from IPython.display import clear_output

def display_spellchecked(text):
    mispellings = spellcheck(text)

    mispelled_indices = []
    
    for word, start_index in mispellings:
        mispelled_indices += range(start_index, start_index + len(word))
        
    for i in range(len(text)):
        if i in mispelled_indices:
            termcolor.cprint(text[i], "red", end="", attrs=["underline"])
        else:
            print(text[i], end="")
    
    return mispellings


# Prompts the user for input and spellchecks it
def spellchecker():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            display_spellchecked(text)

    text.observe(on_change, names=["value"])
    
spellchecker()

Text(value='', placeholder='Start typing some text...')

Output()

In [15]:
import gradio 

demo = gradio.Interface(fn=spellcheck, inputs="text", outputs="text", live=True)
demo.launch()

Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




## Allowing for new words
Let's see how this behaves against a real, unseen text from our corpus. 

In [38]:
test_text = ""

with open("corpus-usp/68.txt", 'r') as file:
    test_text = file.read()

_ = display_spellchecked(test_text[:1000])

in pwes in tinyol pwes loke nmoo oj anm [4m[31mo[0m[4m[31mj[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31mp[0m[4m[31mo[0m[4m[31mn[0m la jaa.
I kwando oj [4m[31mb[0m[4m[31m'[0m[4m[31mi[0m[4m[31mt[0m[4m[31mk[0m ri' [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mm[0m[4m[31ma[0m[4m[31ma[0m[4m[31mj[0m jb'anik [4m[31mq[0m[4m[31ma[0m[4m[31mm[0m[4m[31me[0m[4m[31ms[0m
i [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31m'[0m[4m[31ma[0m[4m[31mj[0m qlen qe
[4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31m'[0m[4m[31ma[0m[4m[31mj[0m [4m[31mq[0m[4m[31ma[0m[4m[31mt[0m[4m[31mz[0m[4m[31mi[0m
i ri' [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31ma[0m[4m[31mq[0m chuch kaa'.
xaq jun kitz re qadesayun
i [4m[31mt[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31mq[0m[4m[31mm[0m[4m[31ma[0m[4m[31ma

There's a ton of false spelling errors detected! Because our system was built using only a small corpus, it will not contain every valid word in the language. Common word processing tools fix this problem by easily allowing the user to add a word to the dictionary, so let's modify our tool to do that. 

In [51]:
def add_to_lexicon(word):
    if word in lexicon:
        lexicon[word] += 1
    else:
        lexicon[word] = 1

# A better spellchecker, that lets you handle mispellings
def spellchecker2():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            mispellings = display_spellchecked(text)
            print()
            
            for i, (word, start) in enumerate(mispellings):
                print("\nMispelled: " + termcolor.colored(word, 'red'))
                # print("(a)dd to dictionary, (i)gnore, add a(l)l to dictionary")
                add_button = widgets.Button(description="Add to dictionary")
                display(add_button)
                
                def add_button_clicked(b):
                    add_to_lexicon(word)
                    on_change(change)
                add_button.on_click(add_button_clicked)


    text.observe(on_change, names=["value"])
        
spellchecker2()

Text(value='', placeholder='Start typing some text...')

Output()

Now, we can easily add any words that are correctly spelled to our dictionary, and they will not be marked as errors in the future!

## Spell Correction
Lastly, it would be nice to update our spellchecker so it gives suggestions for correct spelling when there was an error. To do this, we need to determine what word in our lexicon is closest to what was typed. We will use **edit distance**, a measure of how many edits (additions, deletions, changes) it takes to get from one string to another.

In [48]:
import nltk

def spelling_suggestions(word, n):
    # 1. Calculate the edit distance between the word and every word in the lexicon
    candidate_spellings = []
    for item in lexicon.items():
        edit_distance = nltk.edit_distance(item[0], word)
        candidate_spellings.append((item[0], item[1], edit_distance))
    
    # 2. Find the top n closest words, sorted first by edit distance x[2] and then by word frequency x[1]
    sorted_candidates = sorted(candidate_spellings, key=lambda x: (x[2], -x[1]))
    top_n_candidates = sorted_candidates[:n]
    top_n_words_only = [candidate[0] for candidate in top_n_candidates]
    return top_n_words_only

spelling_suggestions("tzijj", 3)

['tzijj', 'tziij', 'tzij']

In [52]:
def spellchecker3():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            mispellings = display_spellchecked(text)
            print()
            
            for i, (word, start) in enumerate(mispellings):
                print("\nMispelled: " + termcolor.colored(word, 'red'))
                # print("(a)dd to dictionary, (i)gnore, add a(l)l to dictionary")
                add_button = widgets.Button(description="Add to dictionary")
                display(add_button)
                
                def add_button_clicked(b):
                    add_to_lexicon(word)
                    on_change(change)
                add_button.on_click(add_button_clicked)
                
                suggestions = spelling_suggestions(word, 3)
                print(f"Suggestions: {suggestions[0]}, {suggestions[1]}, {suggestions[2]}")


    text.observe(on_change, names=["value"])

spellchecker3()

Text(value='', placeholder='Start typing some text...')

Output()

## Summary
In this tutorial, we built a spellchecker tool for a low-resource language. This included:
- Building a lexicon from source texts
- Detecting mispelled words
- Predicting the correct spelling using similarity metrics

To see the spellchecker as a standalone app, go to **2a. Spellchecker**