# Spellchecker

One simple application we can use our corpus for is creating a basic spellchecker, like you might use in Microsoft Word.

There are two approaches to creating a spellchecker system. 
1. Store a huge list of words in the language, and check that every typed word is also a word in that list.
2. Store just roots, and use morphological information to determine if a typed word is a valid form of the root.

While approach 2 certainly seems more ideal, it will take a lot more work to implement effectively. In fact, modern tools like Word tend to use approach 1, so we'll do that.

## Data Preparation
### Load the corpus
First, we'd like to compile a list of all the words we have in our corpus. To do this, we'll read in each file and concatenate them into one giant string.

In [22]:
import os
from typing import List, Dict, Tuple

# If you're using your own corpus, change this to the correct directory
corpus_directory = "../corpora/usp"

# First, let's combine all of our corpus entries into a single, huge string.
corpus = ""

# Loop over each file in the corpus so we can read it in
for file_name in os.listdir(corpus_directory):
    
    # We will save one corpus entry, 68, for testing
    if file_name == "68.txt" or ".txt" not in file_name:
        continue
        
    # Read the current file as a string
    file_path = os.path.join(corpus_directory, file_name)
    with open(file_path, 'r') as file:
        file_contents = file.read()
        corpus += (file_contents + "\n")
        
print(corpus[:300])

Byeen pwees e... chwaaj tanyool júnkitz,
neen jb'aniik xan k'ex loq'laj uleew chi qawch.
Pero ajki' maas ójor
raaj lajori juun kawunaq junaab' o se'a kwarenta años.
E... jb'aniik k'ex loq'laj muund,
xan porke nen b'i ri re e..., e.... ójor xqil na,
ki ta' tzaqsáj kaxlaan mees,
ta' tib'ansáj juun seb


In [2]:
# How many characters are in our corpus?
len(corpus)

255361

### Normalize characters
In Uspanteko, accent marks are used to indicate tone in the transcriptions. However, a speaker might not write them, so we will strip them.

<div class="alert alert-block alert-info">
    If your language uses accent marks in the writing system, feel free to skip this cell.
</div>

In [16]:
import unicodedata

def strip_accents(text: str) -> str:
    # For each character, "normalize" it to a unicode character without an accent mark
    return ''.join(c for c in unicodedata.normalize('NFD', text)
                  if unicodedata.category(c) != 'Mn')

corpus = strip_accents(corpus)
strip_accents("ójor taq tziij kita' jaa,")

"ojor taq tziij kita' jaa,"

In [4]:
# Finally, let's also make everything lowercase
corpus = corpus.lower()

## Create a word list
Now, let's create a list of every word that occurs in our corpus. We will ignore punctuation marks and assume that a word is surrounded by spaces or punctuation. Additionally, we'll keep a count of the frequency of each word for use later on.

In [8]:
# Let's see what characters appear in our corpus
# Using a set creates a list of the unique characters from our string
print(set(corpus))

{'o', 'x', '≈', '¡', 'm', "'", 'n', 's', 'r', 'g', 'h', '(', 'c', 'j', 'z', '[', '/', 'l', '.', ']', 'u', 'y', 'b', '\n', 'ß', 'k', 'a', 'q', 'f', ' ', 'p', 'e', ',', '!', 'i', 't', 'd', '?', ')', 'v', '¿', ':', 'w'}


### Tokenize words using a regular expression
> **Tokenization** refers to the process of breaking a string up into tokens. Tokens might be words, characters, or morphemes. In this case, we are tokenizing into words.

We will use a [regular expression](./skills/regex.ipynb) that looks for clumps of letters and apostrophes. When we run the regex over our text, each clump it finds is a separate word.

For instance, in the following string:

```ójor taq tziij kita' jaa```

The regex will produce:

```["ojor", "taq", "tziij", "kita'", "jaa"]```

In Uspanteko, words are always divided by punctuation or whitespace. Therefore, we can assume each clump that contains only letters must be a word.

<div class="alert alert-block alert-warning">
Your language might need a custom regex for detecting words. Please refer to the lesson on regular expressions for information, and you can use a regex testing tool such as <a href='https://regex101.com'>regex101</a> to make sure your regex works the way you expect.
</div>

In [21]:
import re

# Find just words
# If your language uses some other character within words (like hyphens) you may need to update this regex appropriately
word_regex = r"[\w|\']+"

# Takes a string and breaks it into a list of words
def tokenize(text: str) -> List[str]:
    return re.findall(word_regex, text)

words = tokenize(corpus)
words[:15]

['Byeen',
 'pwees',
 'e',
 'chwaaj',
 'tanyool',
 'júnkitz',
 'neen',
 "jb'aniik",
 'xan',
 "k'ex",
 "loq'laj",
 'uleew',
 'chi',
 'qawch',
 'Pero']

### Create a lexicon

> A **lexicon** refers to the entire vocabulary of words used in the corpus

To create a lexicon, we will iterate over every word in the entire corpus. We use a [dictionary](./skills/sets.ipynb) to keep track of each word and its frequency. The *keys* of the dictionary are each word in the vocabulary, and the *values* are the number of times the word appears in the corpus. For instance, we might see the following entry in an English corpus:

```{ 'the': 10,000 } ```

In [12]:
lexicon = dict()

for word in words:
    # Check if the word is in the lexicon already (we've seen it before)
    # If so, add one to the count
    if word in lexicon:
        lexicon[word] += 1
    else:
        lexicon[word] = 1

# Store the lexicon to permanent storage so we can retrieve it later if needed
%store lexicon

print(f"Created lexicon with {len(lexicon)} unique words")

Stored 'lexicon' (dict)
Created lexicon with 6771 unique words


In [13]:
# Let's see what the twenty most common words are
# This line sorts the lexicon by frequency and picks the first 20 items
sorted(lexicon.items(), key=lambda x: x[1], reverse=True)[:20]

[('taq', 1337),
 ('re', 1267),
 ('li', 1203),
 ("cha'", 1010),
 ('i', 988),
 ('man', 809),
 ("ta'", 782),
 ('jun', 740),
 ("wi'", 581),
 ('ra', 575),
 ('ri', 419),
 ("ri'", 386),
 ('anm', 361),
 ('chaq', 360),
 ('chi', 350),
 ('ke', 328),
 ('ya', 322),
 ('chik', 316),
 ('iin', 283),
 ('qe', 265)]

## Building a spellchecker
At this point, we have a lexicon with all of our words and their frequencies. Now we're ready to build our spellchecker program. 

### Find mispelled words
Let's create a function that will take a sentence and find any mispelled words. To do this, we will do the following:
1. Preprocess the sentence to remove accents and make everything lowercase, tokenize
2. For each word, check if it occurs in our lexicon. If not, it's a spelling error.
3. Use regex to find where the word occurs in the original text.
4. Return the mispelled words and their positions.

<div class="alert alert-block alert-info">
Right now, any time we see a word that isn't in our lexicon, we report it as a spelling error (even if its a new, correctly spelled word). We'll improve this later.
</div>

In [20]:
def spellcheck(s: str) -> List[Tuple[str, int]]:
    """Finds mispelled words in a string.
    :return: A list of tuples. Each tuple is (word, index) where `word` is the mispelled word and `index` is the index where it occurs.
    """
    # 1. Preprocess and tokenize
    s = strip_accents(s)
    s = s.lower()
    
    # We can use a set, so we don't have to check duplicate words
    input_words = set(tokenize(s))
    
    
    # 2. Check each word in the input
    mispelled = [] 
    for word in input_words:
        # Does the word occur in our lexicon?
        if not word in lexicon.keys():
            # 3. Find the indices of the word in the original text
            # This regex searches for the given word, surrounded by whitespace or punctuation
            word_regex = f"(^|\W)({word})($|\W)"
            
            # There might be multiple matches if we mispelled a word multiple times
            for match in re.finditer(word_regex, s):
                mispelled.append((word, match.start(2)))
    
    # 4. Return the mispelled words, sorted by their position
    return sorted(mispelled, key = lambda x: x[1])


# This word has one mispelling ('tzijj')
# If using your own langauge, replace with some test sentence
test_sentence = "Kwand xink'uli'k', re ójr taq tzijj in ák'el na."
print(test_sentence)

mispellings = spellcheck(test_sentence)
for mispelled_word, location in mispellings:
    print(f"{mispelled_word} at {location}")

Kwand xink'uli'k', re ójr taq tzijj in ák'el na.
tzijj at 30


### Improving the user experience

Our function `spellcheck` works to detect spelling errors. But this isn't a great tool for a user to use, so let's make it nicer to input text and see output.

In [23]:
import termcolor
import ipywidgets as widgets
from IPython.display import clear_output

def display_spellchecked(text):
    mispellings = spellcheck(text)

    mispelled_indices = []
    
    for word, start_index in mispellings:
        mispelled_indices += range(start_index, start_index + len(word))
        
    for i in range(len(text)):
        if i in mispelled_indices:
            termcolor.cprint(text[i], "red", end="", attrs=["underline"])
        else:
            print(text[i], end="")
    
    return mispellings


# Prompts the user for input and spellchecks it
def spellchecker():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            display_spellchecked(text)

    text.observe(on_change, names=["value"])
    
spellchecker()

ModuleNotFoundError: No module named 'termcolor'

In [15]:
import gradio 

demo = gradio.Interface(fn=spellcheck, inputs="text", outputs="text", live=True)
demo.launch()

Running on local URL:  http://127.0.0.1:7861

To create a public link, set `share=True` in `launch()`.




## Allowing for new words
Let's see how this behaves against a real, unseen text from our corpus. 

In [38]:
test_text = ""

with open("corpus-usp/68.txt", 'r') as file:
    test_text = file.read()

_ = display_spellchecked(test_text[:1000])

in pwes in tinyol pwes loke nmoo oj anm [4m[31mo[0m[4m[31mj[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31mp[0m[4m[31mo[0m[4m[31mn[0m la jaa.
I kwando oj [4m[31mb[0m[4m[31m'[0m[4m[31mi[0m[4m[31mt[0m[4m[31mk[0m ri' [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mm[0m[4m[31ma[0m[4m[31ma[0m[4m[31mj[0m jb'anik [4m[31mq[0m[4m[31ma[0m[4m[31mm[0m[4m[31me[0m[4m[31ms[0m
i [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31m'[0m[4m[31ma[0m[4m[31mj[0m qlen qe
[4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31m'[0m[4m[31ma[0m[4m[31mj[0m [4m[31mq[0m[4m[31ma[0m[4m[31mt[0m[4m[31mz[0m[4m[31mi[0m
i ri' [4m[31mt[0m[4m[31mq[0m[4m[31ma[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31ma[0m[4m[31mq[0m chuch kaa'.
xaq jun kitz re qadesayun
i [4m[31mt[0m[4m[31mc[0m[4m[31mh[0m[4m[31ma[0m[4m[31mq[0m[4m[31mm[0m[4m[31ma[0m[4m[31ma

There's a ton of false spelling errors detected! Because our system was built using only a small corpus, it will not contain every valid word in the language. Common word processing tools fix this problem by easily allowing the user to add a word to the dictionary, so let's modify our tool to do that. 

In [51]:
def add_to_lexicon(word):
    if word in lexicon:
        lexicon[word] += 1
    else:
        lexicon[word] = 1

# A better spellchecker, that lets you handle mispellings
def spellchecker2():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            mispellings = display_spellchecked(text)
            print()
            
            for i, (word, start) in enumerate(mispellings):
                print("\nMispelled: " + termcolor.colored(word, 'red'))
                # print("(a)dd to dictionary, (i)gnore, add a(l)l to dictionary")
                add_button = widgets.Button(description="Add to dictionary")
                display(add_button)
                
                def add_button_clicked(b):
                    add_to_lexicon(word)
                    on_change(change)
                add_button.on_click(add_button_clicked)


    text.observe(on_change, names=["value"])
        
spellchecker2()

Text(value='', placeholder='Start typing some text...')

Output()

Now, we can easily add any words that are correctly spelled to our dictionary, and they will not be marked as errors in the future!

## Spell Correction
Lastly, it would be nice to update our spellchecker so it gives suggestions for correct spelling when there was an error. To do this, we need to determine what word in our lexicon is closest to what was typed. We will use **edit distance**, a measure of how many edits (additions, deletions, changes) it takes to get from one string to another.

In [48]:
import nltk

def spelling_suggestions(word, n):
    # 1. Calculate the edit distance between the word and every word in the lexicon
    candidate_spellings = []
    for item in lexicon.items():
        edit_distance = nltk.edit_distance(item[0], word)
        candidate_spellings.append((item[0], item[1], edit_distance))
    
    # 2. Find the top n closest words, sorted first by edit distance x[2] and then by word frequency x[1]
    sorted_candidates = sorted(candidate_spellings, key=lambda x: (x[2], -x[1]))
    top_n_candidates = sorted_candidates[:n]
    top_n_words_only = [candidate[0] for candidate in top_n_candidates]
    return top_n_words_only

spelling_suggestions("tzijj", 3)

['tzijj', 'tziij', 'tzij']

In [52]:
def spellchecker3():
    text = widgets.Text(value='',
                        placeholder='Start typing some text...',
                        disabled=False)
    out = widgets.Output()
    display(text)
    display(out)
    
    def on_change(change):
        text = change['new']
        with out:
            clear_output()
            mispellings = display_spellchecked(text)
            print()
            
            for i, (word, start) in enumerate(mispellings):
                print("\nMispelled: " + termcolor.colored(word, 'red'))
                # print("(a)dd to dictionary, (i)gnore, add a(l)l to dictionary")
                add_button = widgets.Button(description="Add to dictionary")
                display(add_button)
                
                def add_button_clicked(b):
                    add_to_lexicon(word)
                    on_change(change)
                add_button.on_click(add_button_clicked)
                
                suggestions = spelling_suggestions(word, 3)
                print(f"Suggestions: {suggestions[0]}, {suggestions[1]}, {suggestions[2]}")


    text.observe(on_change, names=["value"])

spellchecker3()

Text(value='', placeholder='Start typing some text...')

Output()

## Summary
In this tutorial, we built a spellchecker tool for a low-resource language. This included:
- Building a lexicon from source texts
- Detecting mispelled words
- Predicting the correct spelling using similarity metrics

To see the spellchecker as a standalone app, go to **2a. Spellchecker**