# Auto Complete

[**1. Data preprocessing**](#1.-Data-preprocessing)

[**2. Developing n-gram based language model**](#2.-Developing-n\-gram-based-language-model)

[**3. Perplexity**](#3.-Perplexity)

[**4. Building the auto-complete system**](#4.-Building-the-auto\-complete-system)

## 1. Data preprocessing

### 1.1. Importing packages

In [1]:
import nltk
import random
import numpy as np
import pandas as pd

### 1.2. Loading the data

In [2]:
with open("./src/Karamazov.txt", "r", encoding="utf8") as f:
    data = f.read()
print("Data type:", type(data))
print("Number of letters:", len(data))
print("First 300 letters of the data")
print("-------")
display(data[0:300])
print("-------")

print("Last 300 letters of the data")
print("-------")
display(data[-300:])
print("-------")

Data type: <class 'str'>
Number of letters: 1962720
First 300 letters of the data
-------


'The Project Gutenberg EBook of The Brothers Karamazov by Fyodor\nDostoyevsky\n\n\n\nThis ebook is for the use of anyone anywhere in the United States and most\nother parts of the world at no cost and with almost no restrictions\nwhatsoever. You may copy it, give it away or re‐use it under the terms of\nthe '

-------
Last 300 letters of the data
-------


'  http://www.gutenberg.org\n\n\nThis Web site includes information about Project Gutenberg™, including how\nto make donations to the Project Gutenberg Literary Archive Foundation,\nhow to help produce our new ebooks, and how to subscribe to our email\nnewsletter to hear about new ebooks.\n\n\n\n\n\n\n***FINIS***'

-------


### 1.3. Pre-processing the data

#### Defining get_tokenized_data function

Making a list of tokenized sentences

**Inputs** :  
- *data*: a string corresponding to the string we are starting with

**Outputs** :  
- *tokenized_sentences*: a list of lists of tokens

In [3]:
def get_tokenized_data(data):
    
    # Spliting data into sentences using "." as the delimiter.
    sentences = data.split(".")
    # Removing leading and trailing spaces from each sentence
    sentences = [s.strip() for s in sentences]
    # Droping sentences if they are empty strings.
    sentences = [s for s in sentences if len(s) > 0]
    # Spliting each sentence into tokens
    tokenized_sentences = []
    for sentence in sentences:
        sentence = sentence.lower()
        tokenized = nltk.word_tokenize(sentence)
        tokenized_sentences.append(tokenized)
    
    return tokenized_sentences

**Testing the function**

In [4]:
x = "Sky is high.\nGrass is green\nRoses are red.\n\nI have a pen.\n  I have an apple. \nAh\nApple pen.   \n"
print(x)
get_tokenized_data(x)

Sky is high.
Grass is green
Roses are red.

I have a pen.
  I have an apple. 
Ah
Apple pen.   



[['sky', 'is', 'high'],
 ['grass', 'is', 'green', 'roses', 'are', 'red'],
 ['i', 'have', 'a', 'pen'],
 ['i', 'have', 'an', 'apple'],
 ['ah', 'apple', 'pen']]

#### Spliting the data into train and test sets

In [5]:
tokenized_data = get_tokenized_data(data)
random.seed(1)
random.shuffle(tokenized_data)

train_size = int(len(tokenized_data) * 0.8)
train_data = tokenized_data[0:train_size]
test_data = tokenized_data[train_size:]

print("{} sentences are split into {} train and {} test sets".format(len(tokenized_data), len(train_data), len(test_data)))
print("First training sample:")
print(train_data[0])
print("First test sample")
print(test_data[0])

19856 sentences are split into 15884 train and 3972 test sets
First training sample:
['when', 'he', 'entered', 'the', 'household', 'of', 'his', 'patron', 'and', 'benefactor', ',', 'yefim', 'petrovitch', 'polenov', ',', 'he', 'gained', 'the', 'hearts', 'of', 'all', 'the', 'family', ',', 'so', 'that', 'they', 'looked', 'on', 'him', 'quite', 'as', 'their', 'own', 'child']
First test sample
['who', 'knows', ',', 'he', 'may', 'be', 'of', 'use', 'and', 'make', 'his', 'own', 'career', ',', 'too']


### 1.4. Handling 'Out of Vocabulary' words

#### Defining count_words function

Counting the number of word appearence in the tokenized sentences.

**Inputs** :  
- *tokenized_sentences*: a list of lists of tokens as strings

**Outputs** :  
- *word_counts*: a dictionary that maps word (str) to the frequency (int)

In [6]:
def count_words(tokenized_sentences):
        
    word_counts = {}

    for sentence in tokenized_sentences:
        for token in sentence:
            if not token in word_counts.keys():
                word_counts[token] = 1
            else:
                word_counts[token] += 1
    
    return word_counts

**Testing the function**

In [7]:
tokenized_sentences = [['sky', 'is', 'high', '.'],
                       ['grass', 'is', 'green', '.'],
                       ['roses', 'are', 'red', '.']]
count_words(tokenized_sentences)

{'sky': 1,
 'is': 2,
 'high': 1,
 '.': 3,
 'grass': 1,
 'green': 1,
 'roses': 1,
 'are': 1,
 'red': 1}

#### Defining get_words_with_nplus_frequency function

Finding the words that appear N times or more.

**Inputs** :  
- *tokenized_sentences*: a list of lists of tokens as strings  
- *count_threshold*: minimum number of occurrences for a word to be in the closed vocabulary

**Outputs** :  
- *closed_vocab*: list of words that appear N times or more

In [8]:
def get_words_with_nplus_frequency(tokenized_sentences, count_threshold):

    closed_vocab = []
    
    word_counts = count_words(tokenized_sentences)

    for word, cnt in word_counts.items():
        if cnt >= count_threshold:
            closed_vocab.append(word)
    
    return closed_vocab

**Testing the function**

In [9]:
tokenized_sentences = [['sky', 'is', 'high', '.'], ['grass', 'is', 'green', '.'], ['roses', 'are', 'red', '.']]
tmp_closed_vocab = get_words_with_nplus_frequency(tokenized_sentences, count_threshold=2)
print(f"Closed vocabulary:")
print(tmp_closed_vocab)

Closed vocabulary:
['is', '.']


#### Defining replace_oov_words_by_unk function

Replacing words not in the given vocabulary with "\<unk>" token.

**Inputs** :  
- *tokenized_sentences*: a list of lists of tokens as strings  
- *closed_vocab*: list of strings that we will use  
- *unknown_token*: a string representing unknown (out-of-vocabulary) words

**Outputs** :  
- *replaced_tokenized_sentences*: a list of lists of strings, with words not in the vocabulary replaced

In [10]:
def replace_oov_words_by_unk(tokenized_sentences, closed_vocab, unknown_token="<unk>"):
    
    # Placing vocabulary into a set for faster search
    closed_vocab = set(closed_vocab)
    
    replaced_tokenized_sentences = []
    
    for sentence in tokenized_sentences:
        replaced_sentence = []
        for token in sentence:
            if token in closed_vocab:
                replaced_sentence.append(token)
            else:
                replaced_sentence.append(unknown_token)
        
        replaced_tokenized_sentences.append(replaced_sentence)
        
    return replaced_tokenized_sentences

**Testing the function**

In [11]:
tokenized_sentences = [["dogs", "run"], ["cats", "sleep"]]
closed_vocab = ["dogs", "sleep"]
tmp_replaced_tokenized_sentences = replace_oov_words_by_unk(tokenized_sentences, closed_vocab)
print(f"Original sentence:")
print(tokenized_sentences)
print(f"tokenized_sentences with less frequent words converted to '<unk>':")
print(tmp_replaced_tokenized_sentences)

Original sentence:
[['dogs', 'run'], ['cats', 'sleep']]
tokenized_sentences with less frequent words converted to '<unk>':
[['dogs', '<unk>'], ['<unk>', 'sleep']]


#### Defining process_data function

Finding tokens that appear at least N times in the training data.  
Replacing tokens that appear less than N times by "\<unk>" both for training and test data. 

**Inputs** :  
- *train_data*: a list of lists of tokens as strings  
- *test_data*: a list of lists of tokens as strings  
- *count_threshold*: minimum number of occurrences for a word to be in the closed vocabulary

**Outputs** :  
- *train_data_replaced*: training data with low frequent words replaced by "\<unk>"  
- *test_data_replaced*: test data with low frequent words replaced by "\<unk>"  
- *closed_vocab*: vocabulary of words that appear n times or more in the training data

In [12]:
def process_data(train_data, test_data, count_threshold):

    closed_vocab = get_words_with_nplus_frequency(train_data, count_threshold)
    
    train_data_replaced = replace_oov_words_by_unk(train_data, closed_vocab, unknown_token="<unk>")
    
    test_data_replaced = replace_oov_words_by_unk(test_data, closed_vocab, unknown_token="<unk>")
    
    return train_data_replaced, test_data_replaced, closed_vocab

**Testing the function**

In [13]:
tmp_train = [['sky', 'is', 'high', '.'],['grass', 'is', 'green']]
tmp_test = [['roses', 'are', 'red', '.']]
tmp_train_repl, tmp_test_repl, tmp_vocab = process_data(tmp_train, tmp_test, count_threshold = 1)

print("tmp_train_repl")
print(tmp_train_repl)
print()
print("tmp_test_repl")
print(tmp_test_repl)
print()
print("tmp_vocab")
print(tmp_vocab)

tmp_train_repl
[['sky', 'is', 'high', '.'], ['grass', 'is', 'green']]

tmp_test_repl
[['<unk>', '<unk>', '<unk>', '.']]

tmp_vocab
['sky', 'is', 'high', '.', 'grass', 'green']


#### Processing the train and test data sets

In [14]:
minimum_freq = 2
train_data_processed, test_data_processed, closed_vocab = process_data(train_data, test_data, minimum_freq)
print("First processed training sample:")
print(train_data_processed[0])
print()
print("First processed test sample:")
print(test_data_processed[0])
print()
print("First 10 vocabulary:")
print(closed_vocab[0:10])
print()
print("Size of vocabulary:", len(closed_vocab))

First processed training sample:
['when', 'he', 'entered', 'the', 'household', 'of', 'his', '<unk>', 'and', 'benefactor', ',', 'yefim', 'petrovitch', 'polenov', ',', 'he', 'gained', 'the', 'hearts', 'of', 'all', 'the', 'family', ',', 'so', 'that', 'they', 'looked', 'on', 'him', 'quite', 'as', 'their', 'own', 'child']

First processed test sample:
['who', 'knows', ',', 'he', 'may', 'be', 'of', 'use', 'and', 'make', 'his', 'own', 'career', ',', 'too']

First 10 vocabulary:
['when', 'he', 'entered', 'the', 'household', 'of', 'his', 'and', 'benefactor', ',']

Size of vocabulary: 7298


## 2. Developing n-gram based language model

#### Defining count_n_grams function

Counting all n-grams in the data.

**Inputs** :  
- *data*: a list of lists of tokens as strings  
- *n*: number of words in a sequence  
- *start_token*: a token to indicate the beginning of the sentence  
- *end_token*: a token to indicate the end of the sentence  

**Outputs** :  
- *n_grams*: a dictionary that maps a tuple of n-words to its frequency

In [15]:
def count_n_grams(data, n, start_token='<s>', end_token = '<e>'):

    n_grams = {}

    for sentence in data:
        sentence = [start_token] * n + sentence + [end_token]
        sentence = tuple(sentence)
        for i in range(len(sentence)-n+1):
            n_gram = sentence[i:i+n]
            if n_gram in n_grams.keys():
                n_grams[n_gram] += 1
            else:
                n_grams[n_gram] = 1

    return n_grams

**Testing the function**

In [16]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
print("Uni-gram:")
print(count_n_grams(sentences, 1))
print("Bi-gram:")
print(count_n_grams(sentences, 2))

Uni-gram:
{('<s>',): 2, ('i',): 1, ('like',): 2, ('a',): 2, ('cat',): 2, ('<e>',): 2, ('this',): 1, ('dog',): 1, ('is',): 1}
Bi-gram:
{('<s>', '<s>'): 2, ('<s>', 'i'): 1, ('i', 'like'): 1, ('like', 'a'): 2, ('a', 'cat'): 2, ('cat', '<e>'): 2, ('<s>', 'this'): 1, ('this', 'dog'): 1, ('dog', 'is'): 1, ('is', 'like'): 1}


#### Defining estimate_probability function

Estimating the probability of a next word using the n-gram counts with k-smoothing.

**Inputs** :  
- *word*: next word  
- *previous_n_gram*: a sequence of words of length n  
- *n_gram_counts*: a dictionary of counts of n-grams  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary_size*: number of words in the vocabulary  
- *k*: positive constant, smoothing parameter

**Outputs** :  
- *probability*: probability of word given the prior 'n' words using the n-gram counts

In [17]:
def estimate_probability(word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):

    previous_n_gram = tuple(previous_n_gram)
    previous_n_gram_count = n_gram_counts.get(previous_n_gram,0)
    denominator = previous_n_gram_count + (vocabulary_size*k)

    n_plus1_gram = previous_n_gram + (word,)
    n_plus1_gram_count = n_plus1_gram_counts.get(n_plus1_gram,0)
    numerator = n_plus1_gram_count + k

    probability = numerator/denominator

    return probability

**Testing the function**

In [18]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
tmp_prob = estimate_probability("cat", "a", unigram_counts, bigram_counts, len(unique_words), k=1)

print(f"The estimated probability of word 'cat' given the previous n-gram 'a' is: {tmp_prob:.4f}")

The estimated probability of word 'cat' given the previous n-gram 'a' is: 0.3333


#### Defining estimate_probabilities function

Looping over all words in vocabulary to calculate probabilities for all possible words.

**Inputs** :  
- *previous_n_gram*: a sequence of words of length n  
- *n_gram_counts*: a dictionary of counts of n-grams  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary*: list of words  
- *k*: positive constant, smoothing parameter

**Outputs** :  
- *probabilities*: a dictionary mapping from next words to the probability

In [19]:
def estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0):

    previous_n_gram = tuple(previous_n_gram)
    
    vocabulary = vocabulary + ["<e>", "<unk>"]
    vocabulary_size = len(vocabulary)
    
    probabilities = {}
    for word in vocabulary:
        probability = estimate_probability(word, previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=k)
        probabilities[word] = probability

    return probabilities

**Testing the function**

In [20]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
estimate_probabilities("a", unigram_counts, bigram_counts, unique_words, k=1)

{'is': 0.09090909090909091,
 'dog': 0.09090909090909091,
 'cat': 0.2727272727272727,
 'i': 0.09090909090909091,
 'a': 0.09090909090909091,
 'like': 0.09090909090909091,
 'this': 0.09090909090909091,
 '<e>': 0.09090909090909091,
 '<unk>': 0.09090909090909091}

#### Defining make_count_matrix function

Presenting n-gram counts as count matrix.

**Inputs** :  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary*: list of words  

**Outputs** :  
- *count_matrix*: count matrix

In [21]:
def make_count_matrix(n_plus1_gram_counts, vocabulary):

    vocabulary = vocabulary + ["<e>", "<unk>"]
    
    # obtaining unique n-grams
    n_grams = []
    for n_plus1_gram in n_plus1_gram_counts.keys():
        n_gram = n_plus1_gram[0:-1]
        n_grams.append(n_gram)
    n_grams = list(set(n_grams))
    
    # mapping from n-gram to row
    row_index = {n_gram:i for i, n_gram in enumerate(n_grams)}
    # mapping from next word to column
    col_index = {word:j for j, word in enumerate(vocabulary)}
    
    nrow = len(n_grams)
    ncol = len(vocabulary)
    count_matrix = np.zeros((nrow, ncol))
    for n_plus1_gram, count in n_plus1_gram_counts.items():
        n_gram = n_plus1_gram[0:-1]
        word = n_plus1_gram[-1]
        if word not in vocabulary:
            continue
        i = row_index[n_gram]
        j = col_index[word]
        count_matrix[i, j] = count
    
    count_matrix = pd.DataFrame(count_matrix, index=n_grams, columns=vocabulary)
    
    return count_matrix

**Testing the function**

In [22]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print('bigram counts')
display(make_count_matrix(bigram_counts, unique_words))
print('\ntrigram counts')
trigram_counts = count_n_grams(sentences, 3)
display(make_count_matrix(trigram_counts, unique_words))

bigram counts


Unnamed: 0,is,dog,cat,i,a,like,this,<e>,<unk>
"(like,)",0.0,0.0,0.0,0.0,2.0,0.0,0.0,0.0,0.0
"(dog,)",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(a,)",0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
"(<s>,)",0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
"(cat,)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(is,)",0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
"(this,)",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(i,)",0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0



trigram counts


Unnamed: 0,is,dog,cat,i,a,like,this,<e>,<unk>
"(dog, is)",0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0
"(i, like)",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
"(this, dog)",1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(<s>, <s>)",0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.0,0.0
"(is, like)",0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0,0.0
"(like, a)",0.0,0.0,2.0,0.0,0.0,0.0,0.0,0.0,0.0
"(a, cat)",0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0
"(<s>, this)",0.0,1.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
"(<s>, i)",0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,0.0


#### Defining make_probability_matrix function

Calculating the probabilities of each word given the previous n-gram, and storing this in matrix form.

**Inputs** :  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary*: list of words  
- *k*: positive constant, smoothing parameter

**Outputs** :  
- *prob_matrix*: probability matrix

In [23]:
def make_probability_matrix(n_plus1_gram_counts, vocabulary, k):
    
    count_matrix = make_count_matrix(n_plus1_gram_counts, unique_words)
    count_matrix += k
    prob_matrix = count_matrix.div(count_matrix.sum(axis=1), axis=0)
    
    return prob_matrix

**Testing the function**

In [24]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))
bigram_counts = count_n_grams(sentences, 2)
print("bigram probabilities")
display(make_probability_matrix(bigram_counts, unique_words, k=1))
print("trigram probabilities")
trigram_counts = count_n_grams(sentences, 3)
display(make_probability_matrix(trigram_counts, unique_words, k=1))

bigram probabilities


Unnamed: 0,is,dog,cat,i,a,like,this,<e>,<unk>
"(like,)",0.090909,0.090909,0.090909,0.090909,0.272727,0.090909,0.090909,0.090909,0.090909
"(dog,)",0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
"(a,)",0.090909,0.090909,0.272727,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909
"(<s>,)",0.090909,0.090909,0.090909,0.181818,0.090909,0.090909,0.181818,0.090909,0.090909
"(cat,)",0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.272727,0.090909
"(is,)",0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1
"(this,)",0.1,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1
"(i,)",0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1


trigram probabilities


Unnamed: 0,is,dog,cat,i,a,like,this,<e>,<unk>
"(dog, is)",0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1
"(i, like)",0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1,0.1
"(this, dog)",0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1,0.1
"(<s>, <s>)",0.090909,0.090909,0.090909,0.181818,0.090909,0.090909,0.181818,0.090909,0.090909
"(is, like)",0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1,0.1
"(like, a)",0.090909,0.090909,0.272727,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909
"(a, cat)",0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.090909,0.272727,0.090909
"(<s>, this)",0.1,0.2,0.1,0.1,0.1,0.1,0.1,0.1,0.1
"(<s>, i)",0.1,0.1,0.1,0.1,0.1,0.2,0.1,0.1,0.1


## 3. Perplexity

#### Defining calculate_perplexity function

Calculating perplexity for a sentence.

**Inputs** :  
- *sentence*: a list of strings  
- *n_gram_counts*: a dictionary of counts of n-grams  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary_size*: number of unique words in the vocabulary  
- *k*: positive constant, smoothing parameter

**Outputs** :  
- *perplexity*: perplexity score

In [25]:
def calculate_perplexity(sentence, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k=1.0):

    n = len(list(n_gram_counts.keys())[0]) 

    sentence = ["<s>"] * n + sentence + ["<e>"]
    sentence = tuple(sentence)
    
    N = len(sentence)

    product_pi = 1.0

    for t in range(n, N):
        n_gram = sentence[t-n:t]
        word = sentence[t]
        probability = estimate_probability(word, n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary_size, k)
        product_pi *= (1/probability)

    perplexity = product_pi ** (1/N)

    return perplexity

**Testing the function**

In [26]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

perplexity_train1 = calculate_perplexity(sentences[0], unigram_counts, bigram_counts, len(unique_words), k=1.0)
print(f"Perplexity for first train sample: {perplexity_train1:.4f}")

test_sentence = ['i', 'like', 'a', 'dog']
perplexity_test = calculate_perplexity(test_sentence, unigram_counts, bigram_counts, len(unique_words), k=1.0)
print(f"Perplexity for test sample: {perplexity_test:.4f}")

Perplexity for first train sample: 2.8040
Perplexity for test sample: 3.9654


## 4. Building the auto-complete system

#### Defining suggest_a_word function

Computing probabilities for all possible next words and suggesting the most likely one.

**Inputs** :  
- *previous_tokens*: the sentence we input where each token is a word (Must have length > n)  
- *n_gram_counts*: a dictionary of counts of n-grams  
- *n_plus1_gram_counts*: a dictionary of counts of (n+1)-grams  
- *vocabulary*: list of words  
- *k*: positive constant, smoothing parameter  
- *start_with*: if not None, specifies the first few letters of the next word  

**Outputs** :  
- *suggestion*: a string of the most likely next word  
- *max_prob*: corresponding probability

In [27]:
def suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=1.0, start_with=None):
    
    n = len(list(n_gram_counts.keys())[0]) 

    previous_n_gram = previous_tokens[-n:]

    probabilities = estimate_probabilities(previous_n_gram, n_gram_counts, n_plus1_gram_counts, vocabulary, k=k)
    
    suggestion = None
    max_prob = 0
    
    for word, prob in probabilities.items(): 
        if start_with: 
            if not word.startswith(start_with): 
                continue 
        if prob > max_prob:
            suggestion = word
            max_prob = prob
    
    return suggestion, max_prob

**Testing the function**

In [28]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)

previous_tokens = ["i", "like"]
tmp_suggest1 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0)
print(f"The previous words are 'i like',\n\tand the suggested word is `{tmp_suggest1[0]}` with a probability of {tmp_suggest1[1]:.4f}")

print()
tmp_starts_with = 'c'
tmp_suggest2 = suggest_a_word(previous_tokens, unigram_counts, bigram_counts, unique_words, k=1.0, start_with=tmp_starts_with)
print(f"The previous words are 'i like', the suggestion must start with `{tmp_starts_with}`\n\tand the suggested word is `{tmp_suggest2[0]}` with a probability of {tmp_suggest2[1]:.4f}")

The previous words are 'i like',
	and the suggested word is `a` with a probability of 0.2727

The previous words are 'i like', the suggestion must start with `c`
	and the suggested word is `cat` with a probability of 0.0909


#### Defining get_suggestions function

Looping over varioud n-gram models to get multiple suggestions.

**Inputs** :  
- *previous_tokens*: the sentence we input where each token is a word (Must have length > n)  
- *n_gram_counts_list*: list of n-gram counts each containing a dictionary of counts of n-grams  
- *vocabulary*: list of words  
- *k*: positive constant, smoothing parameter  
- *start_with*: if not None, specifies the first few letters of the next word  

**Outputs** :  
- *suggestions*: a list of the most likely next words  

In [29]:
def get_suggestions(previous_tokens, n_gram_counts_list, vocabulary, k=1.0, start_with=None):
    
    model_counts = len(n_gram_counts_list)
    suggestions = []
    
    for i in range(model_counts-1):
        n_gram_counts = n_gram_counts_list[i]
        n_plus1_gram_counts = n_gram_counts_list[i+1]
        suggestion = suggest_a_word(previous_tokens, n_gram_counts, n_plus1_gram_counts, vocabulary, k=k, start_with=start_with)
        suggestions.append(suggestion)
        
    return suggestions

In [30]:
sentences = [['i', 'like', 'a', 'cat'], ['this', 'dog', 'is', 'like', 'a', 'cat']]
unique_words = list(set(sentences[0] + sentences[1]))

unigram_counts = count_n_grams(sentences, 1)
bigram_counts = count_n_grams(sentences, 2)
trigram_counts = count_n_grams(sentences, 3)
quadgram_counts = count_n_grams(sentences, 4)
qintgram_counts = count_n_grams(sentences, 5)

n_gram_counts_list = [unigram_counts, bigram_counts, trigram_counts, quadgram_counts, qintgram_counts]
previous_tokens = ["i", "like"]
tmp_suggest3 = get_suggestions(previous_tokens, n_gram_counts_list, unique_words, k=1.0)

print(f"The previous words are 'i like', the suggestions are:")
display(tmp_suggest3)

The previous words are 'i like', the suggestions are:


[('a', 0.2727272727272727),
 ('a', 0.2),
 ('is', 0.1111111111111111),
 ('is', 0.1111111111111111)]

#### Building the list of n-gram counts

In [31]:
n_gram_counts_list = []
for n in range(1, 6):
    print("Computing n-gram counts with n =", n, "...")
    n_model_counts = count_n_grams(train_data_processed, n)
    n_gram_counts_list.append(n_model_counts)

Computing n-gram counts with n = 1 ...
Computing n-gram counts with n = 2 ...
Computing n-gram counts with n = 3 ...
Computing n-gram counts with n = 4 ...
Computing n-gram counts with n = 5 ...


#### Suggesting multiple words using n-grams of varying length

In [32]:
previous_tokens = ["here", "i", "am"]
tmp_suggest = get_suggestions(previous_tokens, n_gram_counts_list, closed_vocab, k=1.0)

print(f"The previous words are {previous_tokens}, the suggestions are:")
display(tmp_suggest)

The previous words are ['here', 'i', 'am'], the suggestions are:


[('not', 0.010126582278481013),
 ('not', 0.010094556606184513),
 ('again', 0.00027393507738665936),
 ('when', 0.000136986301369863)]