# **Spellchecker**

<div>
<img src="../assets/spelling.png" width="500" style=" display: block; margin-left: auto; margin-right: auto;"/>
</div>

Let's create an app to spellcheck text in our language. While Microsoft Word and other text editors support spellchecking in high-resource languages such as English or Spanish, most low-resource langaguages aren't supported.

#### How do we do it?
There are two possible approaches to creating a spellchecker system. 
1. Store a huge list of every word in the language. When the user types a word, check if the word is in that list.
2. Store the *stems* or *root words* for every word in your language. When a user types a word, check if the word is a valid form of one of the stems. This will require knowledge about the morphology of the language.

Although you might guess otherwise, most modern software uses **Approach 1**. It's easier to implement and runs faster.

***

## **Data Preparation**

### Load the corpus

<div>
<img src="../assets/corpus.png" width="500" style=" display: block; margin-left: auto; margin-right: auto;"/>
</div>

The **corpus** we will use for this project and future projects is an [Uspanteko](https://www.ethnologue.com/language/usp) corpus. Uspanteko is a Mayan language with around 5,000 speakers, a Latin-based script, and concatenative morphology.

Our corpus has 23 plain text files of Uspanteko text. We will read all the files and concatenate them together.

<div class="alert alert-block alert-warning">
    If you want to use your own language and have a corpus already, feel free to use that instead.
</div>

In [None]:
from typing import List, Dict, Tuple
import os

corpus = ""


# If you're using your own corpus, change this to the correct directory
corpus_directory = "../../corpora/usp"

# Loop over each file in the corpus so we can read it in
for file_name in os.listdir(corpus_directory):
    
    # We will save one corpus entry, 50, for testing
    if file_name == "50.txt":
        continue
        
    
    #  Make sure we only read text files
    if ".txt" not in file_name:
        continue
        
        
    # Read the current file as a string
    file_path = os.path.join(corpus_directory, file_name)
    
    with open(file_path, 'r') as file:
        file_contents = file.read()
        corpus += (file_contents + "\n")
        
print(corpus[:300])

### Preprocessing
Let's preprocess our text a little bit for spellchecking.

In [None]:
# First, strip accent marks (they aren't usually written in Uspanteko).

from util import strip_accents

corpus = strip_accents(corpus)
strip_accents("ójor taq tziij kita' jaa,")

#### **Exercise 1**
Finish the next cell to make the entire corpus lowercase.
<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">corpus = corpus.lower()</code></pre>
</details>

In [None]:
# TODO: Finish me!!


print(corpus[:99])

## **Creating a Word List**
Let's create a list of every word that occurs in our corpus. We will ignore punctuation marks and assume that a word is surrounded by spaces or punctuation. Additionally, we'll keep a count of the frequency of each word for use later on.

### Tokenize words using a regular expression

> **Tokenization** refers to the process of breaking a string up into tokens. Tokens might be words, characters, or morphemes. In this case, we are tokenizing into words.

We will use a [regular expression](./skills/regex.ipynb) that looks for clumps of letters and apostrophes. When we run the regex over our text, each clump it finds is a separate word.

For instance, in the following string:

```ójor taq tziij kita' jaa```

The regex will produce:

```["ojor", "taq", "tziij", "kita'", "jaa"]```

In Uspanteko, words are always divided by punctuation or whitespace. Therefore, we can assume each clump that contains only letters must be a word.

<div class="alert alert-block alert-warning">
Your language might need a custom regex for detecting words. Please refer to the lesson on regular expressions for information, and you can use a regex testing tool such as <a href='https://regex101.com'>regex101</a> to make sure your regex works the way you expect.
</div>

In [None]:
import re

# If your language uses some other character within words (like hyphens) you may need to update this regex appropriately
word_regex = r"[\w|\']+"

def tokenize(text: str) -> List[str]:
    return re.findall(word_regex, text)

words = tokenize(corpus)
print(words[:100])

### Create a lexicon

A **lexicon** refers to the entire vocabulary of words used in the corpus

To create a lexicon, we will iterate over every word in the entire corpus. We use a [set](./skills/sets.ipynb) to create a list of all the unique words in the lexicon. 

#### **Exercise 2**
Create a set of all the unique words in the lexicon called `lexicon`. Then, print the number of elements in the lexicon.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">lexicon = set()
for word in words:
    lexicon.add(word)</code></pre>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">print(len(lexicon))
</code></pre>
</details>

In [None]:
# TODO: Loop over the corpus and add each word to the lexicon


# TODO: Print the number of elements in the lexicon


# Store the lexicon to permanent storage so we can retrieve it later if needed
%store lexicon

## **Building a Spellchecker**
At this point, we have a lexicon with all of our words and their frequencies. Now we're ready to build our spellchecker program. 

### Detect mispelled words
Let's create a function that will take a sentence and find any mispelled words. For each word, check if it occurs in our lexicon. If not, it's a spelling error.

Right now, any time we see a word that isn't in our lexicon, we report it as a spelling error (even if its a new, correctly spelled word). We'll improve this later.

#### **Exercise 3**
Finish the following code to create a function that finds mispelled words.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">def spellcheck(s: str):
    # Preprocess and tokenize
    # TODO: Strip accents
    s = strip_accents(s)<br/>
    # TODO: Make s lowercase
    s = s.lower()<br/>
    # TODO: Tokenize s
    input_tokenized = tokenize(s)<br/>
    # TODO: For each word in the input, check if the word is in the lexicon.
    # If not, add the word to "mispelled_words".
    mispelled_words = []<br/>
    for word in input_tokenized:
        if word not in lexicon:
            mispelled_words.append(word)<br/>
    return mispelled_words
</code></pre>
</details>

<details>
  <summary>Show alternate answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">def spellcheck(s: str):
        # Preprocess and tokenize
    # TODO: Strip accents
    s = strip_accents(s)<br/>
    # TODO: Make s lowercase
    s = s.lower()<br/>
    # TODO: Tokenize s
    input_tokenized = tokenize(s)<br/>
    # TODO: For each word in the input, check if the word is in the lexicon.
    # If not, add the word to "mispelled_words".
    mispelled_words = [word for word in input_tokenized if word not in lexicon]
    return mispelled_words</code></pre>
</details>

In [None]:
def spellcheck(s: str):
    # Preprocess and tokenize
    # TODO: Strip accents
    
    
    # TODO: Make s lowercase
    
    
    # TODO: Tokenize s
    
    
    # TODO: For each word in the input, check if the word is in the lexicon.
    # If not, add the word to "mispelled_words".
    
    # TODO: Return mispelled_words

# This sentence has one mispelling ('tzijj')
test_sentence = "Kwand xink'uli'k', re ójr taq tzijj in ák'el na."
mispelled_words = spellcheck(test_sentence)
print("Mispelled words:", mispelled_words)
print("✅ Correct" if mispelled_words == ['tzijj'] else "❌ Incorrect")

### Find mispelled word positions
In the future, we might want to know *where* the mispelled words occur in the text. The following function finds the location of mispelled words using our `spellcheck` function.

In [None]:
def find_mispelled(text: str):
    """Finds mispelled words in a string.
    :return: A list of tuples. Each tuple is (word, index) where `word` is the mispelled word and `index` is the index where it occurs.
    """
    mispelled_words = spellcheck(text)
    
    mispelled_words_and_positions = []
    
    for word in mispelled_words:
        # This regex searches for the given word, surrounded by whitespace or punctuation
        word_regex = f"(^|\W)({word})($|\W)"
        
        # There might be multiple matches if we mispelled a word multiple times
        for match in re.finditer(word_regex, text):
            mispelled_words_and_positions.append({
                'word': word,
                'start': match.start(2),
                'end': match.start(2) + len(word),
                'entity': 'MISPELLED'
            })
        
    # 4. Return the mispelled words, sorted by their position
    return {
        'text': text,
        'entities': sorted(mispelled_words_and_positions, key = lambda x: x['start'])
    }

print(find_mispelled(test_sentence))

## **Make the Spellchecker a Standalone App**

Our function `find_mispelled` works to detect spelling errors. But this isn't a great tool for a user to use. Let's create a standalone app that users can use.

We will use [Gradio](https://gradio.app), a free framework that lets you turn Python code into shareable web apps. Using Gradio is as easy as three lines of code.

In [None]:
import gradio as gr

gr.close_all()

spellchecker = gr.Interface(fn=find_mispelled, inputs="text", outputs="text", live=True)
spellchecker.launch(share=True)

### Improving the UI

This works great, and we can even share our app using the web link above. Now, let's make the UI a little nicer.

In [None]:
gr.close_all()

with gr.Blocks(theme=gr.themes.Soft(), title="Uspanteko Spellchecker") as spellchecker:
    gr.Markdown("# Uspanteko Spellchecker")
    
    with gr.Row():
        input_textbox = gr.Textbox(label="Text", info="Text to spellcheck", lines=3)
        output_textbox = gr.HighlightedText(label="Spellchecked", combine_adjacent=True)
    
    gr.Examples(
        examples=["Kwand xink'uli'k', re ójr taq tzijj in ák'el na."],
        inputs=input_textbox,
        outputs=output_textbox,
        fn=find_mispelled,
    )
    
    # Connect the input to the output using our function
    input_textbox.change(find_mispelled, input_textbox, output_textbox)
    
    
spellchecker.launch(share=True)

## **Allowing for New Words**
There's a few problems with our app. First, if the user types a word that is spelled correctly, but doesn't appear in our corpus, it will be marked as incorrect.

Remember when we saved one file from our corpus for testing? Let's see how many unseen words occur in that file.

In [None]:
test_text = ""

with open("../../corpora/usp/50.txt", 'r') as file:
    test_text = file.read()

len(find_mispelled(test_text)['entities'])

### Implementing "Add word" button in the UI

There's a ton of false spelling errors detected! Because our system was built using only a small corpus, it will not contain every valid word in the language. Common word processing tools fix this problem by easily allowing the user to add a word to the lexicon, so let's modify our tool to do that. 

To do this, we will introduce the idea of [Gradio State](https://gradio.app/docs/#state). Keeping a `State` lets us use a global variable in our Gradio interfaces. We will use a `State` variable to keep track of the mispelled words.

#### **Exercise 4**
Finish the function `add_word_to_lexicon` to add the `word` to the `lexicon`. The argument `mispelled` is the result of calling `find_mispelled`.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">def add_word_to_lexicon(mispelled):
    first_mispelled_word = mispelled['entities'][0]['word']<br/>
    lexicon.add(first_mispelled_word)</code></pre>
</details>

In [None]:
gr.close_all()

def add_word_to_lexicon(mispelled):
    # TODO: Add the first mispelled word to the lexicon
    

# NEW: We'll use this function when the input changes instead of `find_mispelled`. 
# We need to update 1) the output text, 2) the button, and 3) the global state
def input_text_changed(text):
    mispelled = find_mispelled(text)
    
    should_show_button = len(mispelled['entities']) > 0
    first_mispelled_word = mispelled['entities'][0]['word'] if should_show_button else ''

    # Here we use an "update" function to change the textbox properties. Learn more: https://gradio.app/docs/#update
    add_to_lexicon_button_updater = gr.Button.update(visible=should_show_button, 
                                                     value=f"Add '{first_mispelled_word}' to lexicon")

    return mispelled, add_to_lexicon_button_updater, mispelled

    
# Interface
with gr.Blocks(theme=gr.themes.Soft(), title="Uspanteko Spellchecker") as spellchecker:
    # NEW: A 'State' variable that keeps track of the mispelled words
    mispelled_words_state = gr.State(None)
    
    gr.Markdown("# Uspanteko Spellchecker")
    
    with gr.Row():
        input_textbox = gr.Textbox(label="Text", info="Text to spellcheck", lines=3)
        
        with gr.Column():
            output_textbox = gr.HighlightedText(label="Spellchecked", combine_adjacent=True)
            
            # NEW: A button that adds the first mispelled word to the lexicon
            add_word_button = gr.Button(value=f"Add word to lexicon", visible=False)
            
    gr.Examples(
        examples=["Kwand xink'uli'k', re ójr taq tzijj in ák'el na."],
        inputs=input_textbox,
        outputs=output_textbox,
        fn=find_mispelled,
    )
    
    input_textbox.change(input_text_changed, input_textbox, [output_textbox, add_word_button, mispelled_words_state])
    
    # NEW: Run a function when the button is called, then update the state
    add_word_button \
        .click(add_word_to_lexicon, [mispelled_words_state]) \
        .then(input_text_changed, input_textbox, [output_textbox, add_word_button, mispelled_words_state])
    
        
spellchecker.launch(share=True)

Now, we can easily add any words that are correctly spelled to our dictionary, and they will not be marked as errors in the future!

## **Spell Correction**
Lastly, it would be nice to update our spellchecker so it gives suggestions for correct spelling when there was an error. To do this, we need to determine what word in our lexicon is closest to what was typed. We will use **edit distance**, a measure of how many edits (additions, deletions, changes) it takes to get from one string to another.

In [None]:
import nltk

def spelling_suggestions(mispelled_word, n):
    # 1. Calculate the edit distance between the word and every word in the lexicon
    candidate_spellings = []
    
    for word in lexicon:
        edit_distance = nltk.edit_distance(word, mispelled_word)
        candidate_spellings.append((word, edit_distance))
    
    # 2. Find the top n closest words
    sorted_candidates = sorted(candidate_spellings, key=lambda x: (x[1]))
    top_n_candidates = sorted_candidates[:n]
    top_n_words_only = [candidate[0] for candidate in top_n_candidates]
    return top_n_words_only

spelling_suggestions("tzijj", 3)

### Implementing suggestions in the UI
Let's implement this functionality in the UI!

In [None]:
gr.close_all()

def input_text_changed2(text):
    mispelled = find_mispelled(text)
    
    should_show_button = len(mispelled['entities']) > 0
    first_mispelled_word = mispelled['entities'][0]['word'] if should_show_button else ''

    add_to_lexicon_button_updater = gr.Button.update(visible=should_show_button, 
                                                     value=f"Add '{first_mispelled_word}' to lexicon")
    
    # NEW: Generate suggestions for the closest spelling
    suggestions = spelling_suggestions(first_mispelled_word, 3)
    suggestions_text = " | ".join(suggestions)

    return mispelled, add_to_lexicon_button_updater, mispelled, suggestions_text

    
# Interface
with gr.Blocks(theme=gr.themes.Soft(), title="Uspanteko Spellchecker") as spellchecker:
    mispelled_words_state = gr.State(None)
    
    gr.Markdown("# Uspanteko Spellchecker")
    
    with gr.Row():
        input_textbox = gr.Textbox(label="Text", info="Text to spellcheck", lines=3)
        
        with gr.Column():
            output_textbox = gr.HighlightedText(label="Spellchecked", combine_adjacent=True)
            add_word_button = gr.Button(value=f"Add word to lexicon", visible=False)
            
            # NEW: Show suggested spellings
            suggestions_textbox = gr.Textbox(label="Suggestions", interactive=False)
            
    gr.Examples(
        examples=["Kwand xink'uli'k', re ójr taq tzijj in ák'el na."],
        inputs=input_textbox,
        outputs=output_textbox,
        fn=find_mispelled,
    )
    
    input_textbox.change(input_text_changed2, input_textbox, [output_textbox, add_word_button, mispelled_words_state, suggestions_textbox])
    
    add_word_button \
        .click(add_word_to_lexicon, [mispelled_words_state]) \
        .then(input_text_changed2, input_textbox, [output_textbox, add_word_button, mispelled_words_state, suggestions_textbox])
    
        
spellchecker.launch(share=True)

## **Summary**
In this tutorial, we built a spellchecker tool for a low-resource language. This included:
- Building a lexicon from source texts
- Detecting mispelled words
- Predicting the correct spelling using similarity metrics

### **Challenges**
1. Right now, our spellchecker only lets you add the first mispelled word to the lexicon. Enhance this functionality by creating two new buttons. One button will let you **Add all** mispelled words to the lexicon. The other button will let you **Ignore** a mispelled word and move to the next one. 
2. Our spellchecker displays suggestions, but it doesn't let you do anything with them. Replace the textbox with buttons for each suggestion. When you click on a suggestion, it should replace the mispelled word in the input textbox.