In [58]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2023-spring/lab2-1.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [59]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# Course 236299
## Lab 2-1 – Language modeling with $n$-grams

We turn from tasks that _classify_ texts – mapping texts into a finite set of classes – to tasks that _model_ texts by providing a full probability distribution over texts (or providing a similar scoring metric). Such _language models_ attempt to answer the question "How likely is a token sequence to be generated as an instance of the language?".

We'll start with $n$-gram language models. Given a token sequence $\vect{w} = w_1, w_2, \ldots, w_N$, its probability $\Prob(w_1, w_2, \ldots, w_N)$ can be calculated using the chain rule of probability:

$$\Prob(A, B \given \theta)= \Prob(A \given \theta) \cdot \Prob(B \given A, \theta) $$

Thus, 

\begin{align}
\Prob(w_1, w_2, \ldots, w_N) & = \Prob(w_1) \cdot \Prob(w_2, \ldots, w_N \given w_1) \\
& = \Prob(w_1) \cdot \Prob(w_2 \given w_1) \cdot \Prob(w_3, \ldots, w_N \given w_1, w_2) \\
& \cdots \\
& = \prod_{i=1}^N \Prob (w_i \given w_1, \cdots, w_{i-1}) \\
& \approx \prod_{i=1}^N \Prob (w_i \given w_{i-n+1}, \cdots, w_{i-1})\tag{1}
\end{align}

In the last step, we replace the probability $\Prob (w_i \given w_1, \cdots, w_{i-1})$, which conditions $w_i$ on _all_ of the preceding tokens, with $\Prob (w_i \given w_{i-n+1}, \cdots, w_{i-1})$, which conditions $w_i$ only on the $n-1$ preceding tokens. We call the $n-1$ preceding tokens ($w_{i-n+1}, \cdots, w_{i-1}$) the _context_ and $w_i$ the _target_. Taken together, these $n$ tokens form an $n$-gram, hence the term _$n$-gram model_.

In this lab you'll work with $n$-gram models: generating them, sampling from them, and scoring held-out texts according to them. We'll find some problems with $n$-gram models as language models:

1. They are profligate with memory.
2. They are sensitive to very limited context.
3. They don't generalize well across similar words.

In the next lab, we'll explore neural models to address these failings.

New bits of Python used for the first time in the _solution set_ for this lab, and which you may therefore find useful:

* [`itertools.product`](https://docs.python.org/3.8/library/itertools.html#itertools.product)
* [`list`](https://docs.python.org/3/library/functions.html#func-list)
* [`tuple`](https://docs.python.org/3/library/functions.html#func-tuple)

# Preparation – Loading packages and data

In [60]:
import itertools
import math
import random
import re
import wget

from collections import defaultdict, Counter
from sys import getsizeof
import nltk

nltk.download('punkt', quiet=True) # this module is used to tokenize the text

# Set random seeds
SEED = 1234
random.seed(SEED)

In [61]:
# Some utilities to manipulate the corpus

def preprocess(text):
    """Strips #comments and empty lines from a string
    """
    result = []
    for line in text.split("\n"):
        line = line.strip()              # trim whitespace
        line = re.sub('#.*$', '', line)  # trim comments
        if line != '':                   # drop blank lines
            result.append(line)
    return result

def nltk_normpunc_tokenize(str):
    return nltk.tokenize.word_tokenize(str.lower())


def geah_tokenize(lines):
    """Specialized tokenizer for GEaH corpus handling speaker IDs"""
    result = []
    for line in lines:
        # tokenize
        tokens = nltk_normpunc_tokenize(line)
        # revert the speaker ID token
        if tokens[0] == "sam":
            tokens[0] = "SAM:"
        elif tokens[0] == "guy":
            tokens[0] = "GUY:"
        else:
            raise ValueError("format problem - bad speaker ID")
        # add a start of sentence token
        result += ["<s>"] + tokens
    return result
                    
def postprocess(tokens):
    """Converts `tokens` to a string with one sentence per line"""
    return ' '.join(tokens)\
              .replace("<s> ", "\n")

# Read the GEaH data and preprocess into training and test streams of tokens
geah_filename = ("https://github.com/nlp-236299/data/raw/master/Seuss/"
                 "seuss - 1960 - green eggs and ham.txt")
os.makedirs('data', exist_ok=True)
wget.download(geah_filename, out="data/")

def split(list, portions, offset):
    """Splits `list` into a "large" and a "small" part, returning them as a pair.
    
    The two parts are formed by partitioning `list` into `portions` disjoint pieces.
    The small part is the piece at index `offset`; the large part is the remainder.
    """
    return ([list[i] for i in range(0, len(list)) if i % portions != offset],
            [list[i] for i in range(0, len(list)) if i % portions == offset])

with open("data/seuss - 1960 - green eggs and ham.txt", 'r') as fin:
    lines = preprocess(fin.read())
    train_lines, test_lines = split(lines, 12, 0)
    train_tokens = geah_tokenize(train_lines)
    test_tokens = geah_tokenize(test_lines)

We've already loaded in the text of _Green Eggs and Ham_ for you and split it (about 90%/10%) into two token sequences, `train_tokens` and `test_tokens`. Here's a preview:

In [62]:
print(train_tokens[:50])
print(postprocess(train_tokens[:50]))

['<s>', 'SAM:', 'i', 'am', 'sam', '.', '<s>', 'SAM:', 'sam', 'i', 'am', '.', '<s>', 'GUY:', 'that', 'sam-i-am', '!', '<s>', 'GUY:', 'that', 'sam-i-am', '!', '<s>', 'GUY:', 'i', 'do', 'not', 'like', 'that', 'sam-i-am', '!', '<s>', 'SAM:', 'do', 'you', 'like', 'green', 'eggs', 'and', 'ham', '?', '<s>', 'GUY:', 'i', 'do', 'not', 'like', 'them', ',', 'sam-i-am']

SAM: i am sam . 
SAM: sam i am . 
GUY: that sam-i-am ! 
GUY: that sam-i-am ! 
GUY: i do not like that sam-i-am ! 
SAM: do you like green eggs and ham ? 
GUY: i do not like them , sam-i-am


In [63]:
print(test_tokens[:50])
print(postprocess(test_tokens[:50]))

['<s>', 'SAM:', 'i', 'am', 'sam', '.', '<s>', 'GUY:', 'i', 'do', 'not', 'like', 'green', 'eggs', 'and', 'ham', '.', '<s>', 'GUY:', 'not', 'in', 'a', 'box', '.', '<s>', 'SAM:', 'eat', 'them', '!', '<s>', 'GUY:', 'i', 'do', 'not', 'like', 'them', 'with', 'a', 'mouse', '.', '<s>', 'GUY:', 'not', 'in', 'a', 'car', '!', '<s>', 'SAM:', 'in']

SAM: i am sam . 
GUY: i do not like green eggs and ham . 
GUY: not in a box . 
SAM: eat them ! 
GUY: i do not like them with a mouse . 
GUY: not in a car ! 
SAM: in


We extract the vocabulary from the training text.

In [64]:
# Extract vocabulary from dataset
vocabulary = list(set(train_tokens))
print(vocabulary)

['am', '?', '!', 'ham', ',', 'there', 'goat', 'may', 'mouse', 'could', 'in', 'eat', 'a', 'try', 'they', 'see', 'tree', 'fox', 'if', 'eggs', 'with', 'sam-i-am', 'good', '<s>', 'you', '.', 'me', 'rain', 'would', 'do', 'thank', 'anywhere', 'so', 'green', 'dark', 'or', 'on', '...', 'i', 'like', 'car', 'train', 'and', 'let', 'GUY:', 'boat', 'that', 'say', 'sam', 'house', 'them', 'box', 'the', 'here', 'will', 'be', 'not', 'SAM:', 'are']


# Generating $n$-grams

The $n$-grams in a text are the contiguous subsequences of $n$ tokens. (We'll implement them as Python tuples.) In theory, any sequence of $n$ tokens is a potential $n$-gram type. Let's generate a list of all the possible $n$-gram types over a vocabulary. (Notice how the type/token distinction is useful for talking about $n$-grams, just as it is for words.)
<!--
BEGIN QUESTION
name: all_ngrams
-->

In [65]:
#TODO
def all_ngrams(vocabulary, n):
    """Returns a list of all `n`-long *tuples* of elements of the `vocabulary`.
    
    For instance,  
        >>> all_ngrams(["one", "two"], 3)
        [('one', 'one', 'one'),
         ('one', 'one', 'two'),
         ('one', 'two', 'one'),
         ('one', 'two', 'two'),
         ('two', 'one', 'one'),
         ('two', 'one', 'two'),
         ('two', 'two', 'one'),
         ('two', 'two', 'two')]
         
    Order of returned list is not specified or guaranteed.
    When `n` is 0, returns `[()]`.
    """
    return list(itertools.product(vocabulary, repeat=n))

In [66]:
grader.check("all_ngrams")

We can generate a list of all of the $n$-grams (tokens, not types) in a text.

In [67]:
def ngrams(tokens, n):
    """Returns a list of all `n`-gram instances in a list of `tokens`, in order.
    
    For instance, 
    
    >>> ngrams(nltk_normpunc_tokenize('I am Sam! Sam I am.'), 3)
    [('i', 'am', 'sam'),
     ('am', 'sam', '!'),
     ('sam', '!', 'sam'),
     ('!', 'sam', 'i'),
     ('sam', 'i', 'am'),
     ('i', 'am', '.')]
    """
    return [tuple(tokens[i : i + n])
            for i in range(0, len(tokens) - n + 1)]

In [68]:
print (train_tokens[:6])
print (ngrams(train_tokens[:6], 3))

['<s>', 'SAM:', 'i', 'am', 'sam', '.']
[('<s>', 'SAM:', 'i'), ('SAM:', 'i', 'am'), ('i', 'am', 'sam'), ('am', 'sam', '.')]


# Counting $n$-grams

We conceptualize an $n$-gram as having two parts:

* The _context_ is the first $n-1$ tokens in the $n$-gram.
* The _target_ is the final token in the $n$-gram.

An $n$-gram language model specifies a probability for each $n$-gram type. We'll implement a model as a 2-D dictionary, indexed first by context and then by target, providing the probability for the $n$-gram.

We start by generating a similar data structure for counting up the $n$-grams in a token sequence.

In [69]:
def ngram_counts(vocabulary, tokens, n):
    """Returns a dictionary of counts of the `n`-grams in `tokens`.
    
    The dictionary is structured with first index by (n-1)-gram context
    and second index by the final target token.
    """
    context_dict = defaultdict(lambda: defaultdict(int))
    # zero all ngrams
    for context in all_ngrams(vocabulary, n - 1):
        for target in vocabulary:
            context_dict[context][target] = 0
    # add counts for attested ngrams
    for ngram, count in Counter(ngrams(tokens, n)).items():
        context_dict[ngram[:-1]][ngram[-1]] = count
    return context_dict

Use the `ngram_counts` function to generate count data structures for unigrams, bigrams, and trigrams for the _Green Eggs and Ham_ training text.
<!--
BEGIN QUESTION
name: ngram_counts
-->

In [70]:
#TODO
unigram_counts = ngram_counts(vocabulary, train_tokens, 1)
bigram_counts = ngram_counts(vocabulary, train_tokens, 2)
trigram_counts = ngram_counts(vocabulary, train_tokens, 3)

In [71]:
grader.check("ngram_counts")

Check your work by examining the total count of unigrams, bigrams, and trigrams. Do the numbers make sense?

In [72]:
# Calculate total counts of tokens, unigrams, bigrams, and trigrams
token_count = len(train_tokens)
unigram_count = sum(len(unigram_counts[cntxt]) for cntxt in unigram_counts)
bigram_count = sum(len(bigram_counts[cntxt]) for cntxt in bigram_counts)
trigram_count = sum(len(trigram_counts[cntxt]) for cntxt in trigram_counts)               

# Report on the totals
print(f"Tokens:   {token_count:6}\n"
      f"Unigrams: {unigram_count:6}\n"
      f"Bigrams:  {bigram_count:6}\n"
      f"Trigrams: {trigram_count:6}")

Tokens:     1145
Unigrams:     59
Bigrams:    3481
Trigrams: 205379


--## Calculating $n$-gram probabilities

We can convert the counts into a probability model by _normalizing_ the counts. Given an $n$-gram type $w_1, w_2, \ldots, w_n$, instead of storing the count $\cnt{w_1, w_2, \ldots, w_n}$, we store an estimate of the probability 

\begin{align*}
  \Pr(w_n \given w_1, w_2, \ldots, w_{n-1})
  & \approx \frac{\cnt{w_1, w_2, \ldots, w_n}}{\cnt{w_1, w_2, \ldots, w_{n-1}}} \\
  & = \frac{\cnt{w_1, w_2, \ldots, w_n}}{\sum_{w'} \cnt{w_1, w_2, \ldots, w_{n-1}, w'}}
\end{align*}

that is, the ratio of the count of the $n$-gram and the sum of the counts of all $n$-grams with the same context. Fortunately, all of those counts are already stored in the count data structures we've already built. 

Write a function that takes an $n$-gram count data structure and returns an $n$-gram probability data structure. As with the counts, the probabilities should be stored indexed first by context and then by target.
<!--
BEGIN QUESTION
name: ngram_model
-->

In [73]:
#TODO
#creating a similar data structure as ngram-counts dictionary but instead of counts probability
def ngram_model(ngram_counts):
    """Returns an n-gram probability model calculated by normalizing the 
       provided `ngram-counts` dictionary
    """
    prob_dict = defaultdict(lambda: defaultdict(int))
    # copy all ngrams and target and assign prob
    for cntxt in ngram_counts:
      sum_ngram = sum(ngram_counts[cntxt][target] for target in ngram_counts[cntxt])
      for target, count in ngram_counts[cntxt].items():
        prob_dict[cntxt][target] = 0
        if(sum_ngram != 0):
          prob_dict[cntxt][target] = count / sum_ngram
    return prob_dict

In [74]:
grader.check("ngram_model")

We can now build some $n$-gram models – unigram, bigram, and trigram – based on the counts.

In [75]:
unigram_model = ngram_model(unigram_counts)
bigram_model = ngram_model(bigram_counts)
trigram_model = ngram_model(trigram_counts)

# Space considerations

For the most part, we aren't too concerned about matters of time or space efficiency, though these are crucial issues in the engineering of NLP systems. But the size of $n$-gram models merits consideration, looking especially at their size as $n$ grows. We can use Python's [`sys.getsizeof`](https://docs.python.org/3/library/sys.html#sys.getsizeof) function to get a rough sense of the size of the models we've been working with.

In [76]:
print(f"Tokens:   {getsizeof(train_tokens):6}\n"
      f"Unigrams: {getsizeof(unigram_model):6}\n"
      f"Bigrams:  {getsizeof(bigram_model):6}\n"
      f"Trigrams: {getsizeof(trigram_model):6}")

Tokens:    10328
Unigrams:    240
Bigrams:    2280
Trigrams: 147560


<!-- BEGIN QUESTION -->

**Question:** What do these sizes tell you about the memory usage of $n$-gram models? With a larger vocabulary of, say, 10,000 words, would it be practical to run, say, 5-gram models on your laptop?
<!--
BEGIN QUESTION
name: open_response_sizes
manual: true
-->

---
the size of the memory is a function of the vocabulary size to the power of n (each word in the vocabulary can appear in every part of an n-gram). 
Thus we expect the memory usage for a n-gram model to increases exponentially as n increases.

For a larger vocabulary of 10,000 words it **wouldn't** be practical to run a 5-gram model on our laptop since the memory usage would be too high.

---



<!-- END QUESTION -->



# Sampling from an $n$-gram model

We have cleverly constructed the models to index by context. This allows us to sample a word given its context. For instance, in the trigram context `("<s>", "SAM:")`, the following probability distribution captures which words can come next and with what probability:

In [77]:
trigram_model[("<s>", "SAM:")]

defaultdict(int,
            {'am': 0.0,
             '?': 0.0,
             '!': 0.0,
             'ham': 0.0,
             ',': 0.0,
             'there': 0.0,
             'goat': 0.0,
             'may': 0.0,
             'mouse': 0.0,
             'could': 0.08823529411764706,
             'in': 0.029411764705882353,
             'eat': 0.029411764705882353,
             'a': 0.11764705882352941,
             'try': 0.08823529411764706,
             'they': 0.0,
             'see': 0.0,
             'tree': 0.0,
             'fox': 0.0,
             'if': 0.0,
             'eggs': 0.0,
             'with': 0.0,
             'sam-i-am': 0.0,
             'good': 0.0,
             '<s>': 0.0,
             'you': 0.14705882352941177,
             '.': 0.0,
             'me': 0.0,
             'rain': 0.0,
             'would': 0.2647058823529412,
             'do': 0.029411764705882353,
             'thank': 0.0,
             'anywhere': 0.0,
             'so': 0.029411764705882353,


We can sample a single token according to this probability distribution. Here's one way to do so.

In [78]:
def sample(model, context):
    """Returns a token sampled from the `model` assuming the `context`"""
    distribution = model[context]
    prob_remaining = random.random()
    for token, prob in sorted(distribution.items()):
        if prob_remaining < prob:
            return token
        else:
            prob_remaining -= prob
    raise ValueError

We can extend the sampling to a sequence of words by updating the context as we sample each word.

Define a function `sample_sequence` that performs this sampling of a sequence. It's given a model and a starting context and begins by sampling the first token based on the starting context, then updates the starting context to reflect the word just sampled, repeating the process until a specified number of tokens have been sampled.

> Hint: You might find function [`list`](https://docs.python.org/3/library/functions.html#func-list) helpful for converting immutable tuples to lists, and conversely [`tuple`](https://docs.python.org/3/library/functions.html#func-tuple) helpful for converting lists to tuples.

<!--
BEGIN QUESTION
name: sample_sequence
-->

In [79]:
#TODO
def sample_sequence(model, start_context, count=100):
    """Returns a sequence of `count` tokens sampled successively
       from the `model` *following the `start_context`*.
       The length of the returned list should be `count+len(start_context)`.
    """
    random.seed(SEED) # for reproducibility, do not change
    sampled_sequence = list(start_context)
    context_list = list(start_context)
    for i in range(count):
      sampled_sequence.append( sample(model,tuple(context_list)) )
      index = len(context_list)
      if(index != 0):
            context_list = sampled_sequence[-index:]
    return sampled_sequence

In [80]:
grader.check("sample_sequence")

Let's try it.

In [81]:
print(postprocess(sample_sequence(unigram_model, ())))

would anywhere ! tree with i like . not 
, on SAM: i i 

. ! do would , fox could i . i GUY: ham in do SAM: ? with box eggs ! do ! 
could a , i 
ham 
with not would 
GUY: , GUY: sam-i-am 

would 
the , a SAM: SAM: say dark not could say them anywhere not sam-i-am GUY: . 
and . eggs thank do say in in SAM: like sam-i-am 
tree GUY: 
them not or are . a , GUY: ,


In [82]:
print(postprocess(sample_sequence(bigram_model, ("<s>",))))


SAM: sam ! 
SAM: try them , so good , will let me be ! 
GUY: and i would eat them here or there . 
GUY: i do not eat them anywhere . 
GUY: and ham . 
GUY: i do not , would you will eat them ! 
SAM: could not with a train ! 
GUY: i would not like them in the dark . 
GUY: i do not , sam-i-am . 
SAM: would you , sam-i-am . 
SAM: eat them on a train ! 
GUY: and ham !


In [83]:
print(postprocess(sample_sequence(trigram_model, ("<s>", "SAM:"))))


SAM: you may , i will not eat green eggs and ham ? 
GUY: i do not like green eggs and ham . 
GUY: i do not like them anywhere ! 
SAM: say ! 
GUY: and i will eat them in a house . 
GUY: that sam-i-am ! 
GUY: not in a tree ! 
GUY: i do not like them in a tree . 
GUY: not in a box . 
GUY: not in the dark . 
GUY: not in the dark ! 
SAM: would you , in a car !


# Evaluating text according to an $n$-gram model

## The probability metric

The main point of a language model is to assign probabilities (or similar scores) to texts. For $n$-gram models, that's done according to Equation (1) at the start of the lab. Let's implement that. We define a function `probability` that takes a token sequence and an $n$-gram model (and the $n$ of the model as well) and returns the probability of the token sequence  according to the model. It merely multiplies all of the $n$-gram probabilities for all of the $n$-grams in the token sequence.

> Throughout this lab, we ignore the scores of the first $n-1$ tokens as our $n$-gram model cannot score them due to the lack of context. In the next lab you will see how to solve this issue in practice.

In [84]:
def probability(tokens, model, n):
    """Returns the probability of a sequence of `tokens` according to an
       `n`-gram `model`
    """
    score = 1.0
    context = tokens[0:n-1]
    # Ignores the scores of the first n-1 tokens
    for token in tokens[n-1:]:
        prob = model[tuple(context)][token]
        score *= prob
        context = (context + [token])[1:]
    return score

We test it on the test text that we held out from the training text.

In [85]:
print(f"Test probability - unigram: {probability(test_tokens, unigram_model, 1):6e}\n"
      f"Test probability -  bigram: {probability(test_tokens, bigram_model, 2):6e}\n"
      f"Test probability - trigram: {probability(test_tokens, trigram_model, 3):6e}")

Test probability - unigram: 6.404571e-154
Test probability -  bigram: 9.147262e-44
Test probability - trigram: 0.000000e+00


## The negative log probability metric

Yikes, those probabilities are _really small_. Multiplying all those small numbers is likely to lead to underflow. 

To solve the underflow problem, we'll do our usual trick of using negative log probabilities 

$$ - \log_2 \left(\prod_{i=1}^N \Prob (w_i \given w_{i-n+1}, \cdots, w_{i-1})\right)$$

instead of probabilities.

Define a function `neglogprob` that takes a token sequence and an $n$-gram model (and the $n$ of the model as well) and returns the negative log probability of the token sequence according to the model, calculating it in such a way as to avoid underflow. (You'll want to simplify the formula above before implementing it.)

> Be careful when confronting zero probabilities. Taking `-math.log2(0)` raises a "Math domain error". Instead, you should use `math.inf` (Python's representation of infinity) as the value for the negative log of zero. This accords with our understanding that an impossible event would require infinite bits to specify.

<!--
BEGIN QUESTION
name: neglogprob
-->

In [86]:
#TODO
def neglogprob(tokens, model, n):
    """Returns the negative log probability of a sequence of `tokens`
       according to an `n`-gram `model`
    """
    score = 0
    context = tokens[0:n-1]
    for token in tokens[n-1:]:
        prob = model[tuple(context)][token]
        if(prob > 0):
          score -= math.log2(prob)
        else:
          score += math.inf
        context = (context + [token])[1:]
    return score

In [87]:
grader.check("neglogprob")

We compute the negative log probabilities of the test text using the different models and report on them.

In [88]:
unigram_test_nlp = neglogprob(test_tokens, unigram_model, 1)
bigram_test_nlp = neglogprob(test_tokens, bigram_model, 2)
trigram_test_nlp = neglogprob(test_tokens, trigram_model, 3)

print(f"Test neglogprob - unigram: {unigram_test_nlp:6f}\n"
      f"Test neglogprob -  bigram: {bigram_test_nlp:6f}\n"
      f"Test neglogprob - trigram: {trigram_test_nlp:6f}")

Test neglogprob - unigram: 508.897825
Test neglogprob -  bigram: 142.971496
Test neglogprob - trigram:    inf


There, those numbers seem more manageable. We can even convert the neglogprobs back into probabilities as a sanity check.

In [89]:
print(f"Test probability - unigram: {2 ** (-unigram_test_nlp):6e}\n"
      f"Test probability -  bigram: {2 ** (-bigram_test_nlp):6e}\n"
      f"Test probability - trigram: {2 ** (-trigram_test_nlp):6e}")

Test probability - unigram: 6.404571e-154
Test probability -  bigram: 9.147262e-44
Test probability - trigram: 0.000000e+00


<!-- BEGIN QUESTION -->

**Question:** Why does the bigram model assign a lower neglogprob (that is, a higher probability) to the test text than the unigram model? Why does the trigram model assign a higher neglogprob (lower probability) to the test text than the other models?
<!--
BEGIN QUESTION
name: open_response_ordering
manual: true
-->

---
The unigram model doesnt use context, therefore each word has a low probability of being generated by the unigram model.

On the other hand, the bigram model hasa higher probability since it uses contexr which increases the probability of each term in the bigram compared to those in the unigram model.

As for the trigram model, we get zero probability. That is because, there exists a sequence in the test set that doesnt appear in the training set.

---

<!-- END QUESTION -->

## The perplexity metric

Another metric that is commonly used is _perplexity_. Jurafsky and Martin give a definition for perplexity as the "inverse probability of the test set normalized by the number of words":

$$ PP(x_1, x_2, \ldots, x_N) = 
     \sqrt[N]{\frac{1}{\prod_{i=1}^N \Prob (x_i \given x_{i-n+1}, \cdots, x_{i-1})}}
$$

Define a function `perplexity` that takes a token sequence and an $n$-gram model (and the $n$ of the model as well) and returns the perplexity of the token sequence according to the model, calculating it in such a way as to avoid underflow. (By now you're smart enough to realize that you'll want to carry out most of that calculation inside a $\log$.)

> Remember that we ignored the scores of the first n-1 tokens, what should the number of words `N` be?

> Hint: Use the `neglogprob` function you defined above.

<!--
BEGIN QUESTION
name: perplexity
-->

In [90]:
#TODO
def perplexity(tokens, model, n):
    """Returns the perplexity of a sequence of `tokens` according to an
       `n`-gram `model`
    """
    prob = 2**neglogprob(tokens,model,n) # raise to the power in order to drop the log
    N = len(tokens)-(n-1) # we ignore the first n-1 tokens 
    return (prob)**(1/N)

In [91]:
grader.check("perplexity")

We can look at the perplexity of the test sample according to each of the models.

In [92]:
print(f"Test perplexity - unigram: {perplexity(test_tokens, unigram_model, 1):.3f}\n"
      f"Test perplexity -  bigram: {perplexity(test_tokens, bigram_model, 2):.3f}\n"
      f"Test perplexity - trigram: {perplexity(test_tokens, trigram_model, 3):.3f}")

Test perplexity - unigram: 31.761
Test perplexity -  bigram: 2.668
Test perplexity - trigram: inf


A perplexity value of $P$ can be interpreted as a measure of a model's average uncertainty in selecting each word equivalent to selecting among $P$ equiprobable words on average. The bigram model gives a perplexity of less than 3, indicating that at each word in the sentence, the model is acting as if selecting among (slightly less than) three equiprobable words.

For comparison, state of the art $n$-gram language models for more representative English text achieve perplexities of about 250.

<!-- BEGIN QUESTION -->

# Smoothing $n$-gram language models

> **This section is more open-ended in nature.**

The models we've been using have lots of zero-probability $n$-grams. Essentially any $n$-gram that doesn't appear in the training text is imputed a probability of zero, which means that any sentence that contains that $n$-gram will also be given a zero probability. Clearly this is not an accurate estimate.

There are many ways to _smooth_ $n$-gram models, just as you smoothed classification models in earlier labs. The simplest is probably add-$\delta$ smoothing. 

$$ \Prob(w_i \given w_1 \ldots w_{i-1})
  \approx \frac{\cnt{w_1, w_2, \ldots, w_n} + \delta}{\cnt{w_1, w_2, \ldots, w_{n-1}} + \delta \cdot |V|}
$$

Another useful method is to interpolate multiple $n$-gram models, for instance, estimating probabilities as an interpolation of trigram, bigram, and unigram models.

$$ \Prob(w_i \given w_1 \ldots w_{i-1}) \approx
     \lambda_2 \Prob(w_i \given w_{i-2}, w_{i-1}) 
     + \lambda_1 \Prob(w_i \given w_{i-1}) 
     + (1 - \lambda_1 - \lambda_2) \Prob(w_i)
$$

Finally, a method called _backoff_ uses higher-order $n$-gram probabilities where available, "backing off" to lower order where necessary.

$$
\Prob(w_i \given w_1 \ldots w_{i-1}) \approx \begin{cases}
    \Prob(w_i \given w_{i-2}, w_{i-1}) & \mbox{if $\Prob(w_i \given w_{i-2}, w_{i-1}) > 0$}\\
    \Prob(w_i \given w_{i-1})          & \mbox{if $\Prob(w_i \given w_{i-2}, w_{i-1}) = 0$ and $\Prob(w_i \given w_{i-1}) > 0$}\\
    \Prob(w_i)                         & \mbox{otherwise}
  \end{cases}
$$

Define a function `ngram_model_smoothed`, like the `ngram_model` function from above, but implementing one of these smoothing methods. Compare its perplexity on some sample text to the unsmoothed model. 

<!--
BEGIN QUESTION
name: open_response_smoothed_model
manual: true
-->

In [93]:
"""
#TODO
Place your definition of `ngram_model_smoothed` and whatever other testing 
of it you'd like to do in this and subsequent cells.
"""
#we decided to use the first method of smoothing
def ngram_model_smoothed(ngram_counts,delta):
    """Returns an n-gram probability model calculated by normalizing the 
       provided `ngram-counts` dictionary
    """
    prob_dict = defaultdict(lambda: defaultdict(int))
    # copy all ngrams and target and assign prob
    for cntxt in ngram_counts:
      sum_ngram = sum(ngram_counts[cntxt][target] for target in ngram_counts[cntxt])
      for target, count in ngram_counts[cntxt].items():
        prob_dict[cntxt][target] = 0
        if(sum_ngram != 0):
          prob_dict[cntxt][target] = (count + delta) / (sum_ngram + delta*len(vocabulary))
    return prob_dict

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Lab debrief

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of Lab 2-1

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [94]:
grader.check_all()