# **2. Predictive Text, Part I: N-Grams**

<div>
<img src="assets/autocomplete.png" width="500" style=" display: block; margin-left: auto; margin-right: auto;"/>
</div>

In this section, we will build **predictive text**, a system that suggests the next word while a user is typing. Predictive text is used in mobile phone keyboards, search applications, AI-powered email composition, and more.

The primary goal of a predictive text system is *given some sequence of words, predict the most likely next word*.

#### How do we do it?
In Natural Language Processing (NLP), this can be solved with a **language model**. A language model learns the distribution of words in text. We will build a simple language model that learns common strings of two or three words.
***

## **Data Preparation**
### Load the corpus
For this application, we are using English text from the COCA corpus. 

<div class="alert alert-block alert-warning">
    Like before, feel free to use your own language instead.
</div>

In [None]:
# Load our data
import util

# REPLACE WITH YOUR CORPUS DIRECTORY
corpus = util.load_raw_text(corpus_directory="../corpora/eng")
corpus[:1000]

### Preprocessing and Tokenization
Like the spellchecker, the next step is to tokenize the text into individual words. The only difference here is that we keep punctuation.

#### **Exercise 1**
Use the provided regex to tokenize the text, and return the result.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">return re.findall(word_or_punctuation_regex, text)</code></pre>
</details>

In [None]:
import re

word_or_punctuation_regex = r"[\w|\']+|[\.|\,|\?|\!]"

def preprocess(text):
    text = util.strip_accents(text)
    text = text.lower()

    # TODO: Use the provided regex to tokenize the text, and return the result

tokens_filtered = preprocess(corpus)

print(len(tokens_filtered), "total tokens")
print(tokens_filtered[:200])

## **N-Gram Modeling**
Now, let's build our first model. For this we will use *n-grams*.

An n-gram is simply a sequence of *n* words that occurs in our text. For instance, consider the sentence:

> What is your name?

If we got the list of all of the *bigrams* (2 words each) in the sentence, we would have:

- *What is*
- *is your*
- *your name*
- *name ?*

If we got the list of all of the *trigrams* (3 words each) we would get:

- *What is your*
- *is your name*
- *your name ?*

And so on, and so forth.

### Using N-grams as a language model

We can easily use the idea of n-grams to build a simple language model. For instance, if we are trying to predict the next word, using trigrams, given the input:

> What is your ...

We can look at all of the trigrams that start with *is your*, and choose the most common one. Let's use our tokenized text to get a list of all of the n-grams that occur in the text. 

### Padding sentences
Right now, we would have n-grams that cross sentence boundaries, such as "headlines . you". This isn't super useful, so a common technique is to **pad** each sentence with tokens representing the start of the sentence.

In [None]:
def pad(text: list, num_padding: int):
    
    padded_text = []
    
    # Add initial padding to the first sentence
    for _ in range(num_padding):
        padded_text.append("<s>")
    
    for word in text:
        padded_text.append(word)

        # Every time we see an end punctuation mark, add <s> tokens after it
        # REPLACE IF YOUR LANGUAGE USES DIFFERENT END PUNCTUATION
        if word in [".", "?", "!"]:
            for _ in range(num_padding):
                padded_text.append("<s>")
        
        
    return padded_text

print(pad(tokens_filtered, 2)[:30])

### Create a list of n-grams
The following code uses the **NLTK** library to create a list of trigrams.

#### **Exercise 2**
What would we change to use bigrams instead?

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">padded_tokens = pad(tokens_filtered, 1)
trigrams = list(ngrams(sequence=padded_tokens, n=2))</code></pre>
</details>

In [None]:
from nltk.util import ngrams

# Now, we can actually create the list of n-grams using the NLTK library
padded_tokens = pad(tokens_filtered, 2)
trigrams = list(ngrams(sequence=padded_tokens, n=3))
trigrams[:30]

Now that we have a list of trigrams, we can count up the frequency of each different trigram.

#### **Exercise 3**
Using the list of trigrams, fill the dictionary `all_trigrams` such that the each key is a unique trigram and the value is how many times that trigram occurs in the text.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">for gram in trigrams:
    if gram in all_trigrams:
        all_trigrams[gram] += 1
    else:
        all_trigrams[gram] = 1</code></pre>
</details>

In [None]:
# A dict of all trigrams and their frequency
all_trigrams = dict()

# TODO: Add each unique trigram to the dictionary and set the value to how many times that trigram occurs in the text
        
len(all_trigrams)

In [None]:
# Let's see what the twenty most common trigrams are
sorted(all_trigrams.items(), key=lambda x: x[1], reverse=True)[:30]

## Making predictions using n-grams

Now that we have a count of all trigrams, we can make predictions by looking for the most common trigram that matches our input. 

In [None]:
def predict_trigram_model(text, number_results = 3):
    input_tokens = pad(preprocess(text), 2)
    
    # Find the last 2 tokens in the input
    last_two_tokens = input_tokens[-2:]
    
    # Search our list of all trigrams to find matching trigrams
    matching_trigrams = []
    for item in all_trigrams.items():
        gram = item[0]
        
        # Check if the first and second item in the trigram are a match
        if gram[0] == last_two_tokens[0] and gram[1] == last_two_tokens[1]:
            matching_trigrams.append(item)
    
    # Now, sort the matching trigrams by popularity and return the first `number_results` results
    sorted_matching_trigrams = sorted(matching_trigrams, key=lambda x: x[1], reverse=True)
    top_matching_trigrams = sorted_matching_trigrams[:number_results]
    
    # Last, let's just get the predicted word (the last word of the trigram)
    predictions = [trigram[0][2] for trigram in top_matching_trigrams]
    return predictions

predict_trigram_model("how is")

## Make the predictive text a standalone app

#### **Exercise 4**
Create a UI using the `predict_trigram_model` function. Refer to the spellchecker for help.

<details>
  <summary>Show answer</summary>
      <pre style="background-color: honeydew; padding: 10px; border-radius: 5px;"><code style="background: none;">autocompleter = gr.Interface(fn=predict_trigram_model, inputs="text", outputs="text", live=True)
</code></pre>
</details>

In [None]:
import gradio as gr

autocompleter = None

# TODO: Create a UI using Gradio for predictive text

autocompleter.launch()

# **Improving N-Grams with Backoff**
Our app has one significant issue. If the phrase we enter ends with two words that don't match *any* trigram, then our model has no predictions to make.

One common solution for this is to use a process called **backoff**. With backoff, if our trigram model doesn't find any results, we try to find matching *bigrams* instead. If that fails, we go to *unigrams* (which means we just use the most frequent words). This means that we always get a result, even if it isn't quite as good.

Below is a model that uses backoff and an arbitrary size n-gram to make predictions, putting together everything so far.

In [None]:
n_gram_models = dict()

def create_ngram_model(n = 3):
    padded_tokens = pad(tokens_filtered, n - 1)
    grams = list(ngrams(sequence=padded_tokens, n=n))
    
    all_ngrams = dict()
    for gram in grams:
        if gram in all_ngrams:
            all_ngrams[gram] += 1
        else:
            all_ngrams[gram] = 1
    return all_ngrams


def predict_ngram_model(text, n = 3, number_results = 3):
    input_tokens = pad(preprocess(text), n - 1)
    
    while n > 0:
        # Find the last n - 1 tokens in the input
        last_tokens = tuple(input_tokens[-(n-1):])
    
        if not n in n_gram_models:
            n_gram_models[n] = create_ngram_model(n)
        
        matching_ngrams = []
        
        for item in n_gram_models[n].items():
            gram = item[0]
            if gram[:-1] == last_tokens:
                matching_ngrams.append(item)
    
        # Now, sort the matching n-grams by popularity and return the first `number_results` results
        sorted_matching_ngrams = sorted(matching_ngrams, key=lambda x: x[1], reverse=True)
        top_matching_ngrams = sorted_matching_ngrams[:number_results]
        
        # BACKOFF: If there are no results, drop n and try again
        if len(top_matching_ngrams) == 0:
            print("backing off to:", n-1)
            n = n - 1
            continue
    
        # Last, let's just get the predicted word (the last word of the trigram)
        predictions = [gram[0][-1] for gram in top_matching_ngrams]
        return predictions
    return []

predict_ngram_model("why is", 4)

## **Summary**
In this tutorial, we built a predictive tool for a low-resource language. This included:
- Creating a list of n-grams from the corpus
- Using n-grams to predict the next word in a string
- Using backoff to utilize different size n-grams as needed

### **Challenges**
1. Use the prediction function repeatedly to predict multiple sequential words given some string. For example, the string "Why is" might be followed by "he doing" or "she doing".
1. Improve the predictive text app with buttons that let you add a suggestion to the text you are typing.
1. It's a little boring to always show the top predictions. Add an element of randomness using the Python `random` library, so that the prediction function doesn't show the same results every time.