## Tips
- To avoid unpleasant surprises, I suggest you _run all cells in their order of appearance_ (__Cell__ $\rightarrow$ __Run All__).


- If the changes you've made to your solution don't seem to be showing up, try running __Kernel__ $\rightarrow$ __Restart & Run All__ from the menu.


- Before submitting your assignment, make sure everything runs as expected. First, restart the kernel (from the menu, select __Kernel__ $\rightarrow$ __Restart__) and then **run all cells** (from the menu, select __Cell__ $\rightarrow$ __Run All__).

## Reminder

- Make sure you fill in any place that says `YOUR CODE HERE` or "YOUR ANSWER HERE", as well as your name, UA email, and collaborators below:



Several of the cells in this notebook are **read only** to ensure instructions aren't unintentionally altered.  

If you can't edit the cell, it is probably intentional.

In [1]:
NAME = "Kathleen Costa"
# University of Arizona email address
EMAIL = "kathleencosta@arizona.edu"
# Names of any collaborators.  Write N/A if none.
COLLABORATORS = "N/A"

## Scratchpad

You are welcome to create new cells (see the __Cell__ menu) to experiment and debug your solution.

In [2]:
%load_ext autoreload
%autoreload 2

# Mini Python tutorial

This course uses Python 3.11.

Below is a very basic (and incomplete) overview of the Python language... 

For those completely new to Python, [this section of the official documentation may be useful](https://docs.python.org/3.11/library/stdtypes.html#common-sequence-operations).

In [3]:
# This is a comment.  
# Any line starting with # will be interpreted as a comment

# this is a string assigned to a variable
greeting = "hello"

# If enclosed in triple quotes, strings can also be multiline:

"""
I'm a multiline
string.
"""

# let's use a for loop to print it letter by letter
for letter in greeting:
    print(letter)
    
# Did you notice the indentation there?  Whitespace matters in Python!

# here's a list of integers

numbers = [1, 2, 3, 4]

# let's add one to each number using a list comprehension
# and assign the result to a variable called res
# list comprehensions are used widely in Python (they're very Pythonic!)

res = [num + 1 for num in numbers]

# let's confirm that it worked
print(res)

# now let's try spicing things up using a conditional to filter out all values greater than or equal to 3...
print([num for num in res if not num >= 3])

# Python 3.7 introduced "f-strings" as a convenient way of formatting strings using templates
# For example ...
name = "Josuke"

print(f"{greeting}, {name}!")

# f-strings are f-ing convenient!


# let's look at defining functions in Python..

def greet(name):
    print(f"Howdy, {name}!")

# here's how we call it...

greet("partner")

# let's add a description of the function...

def greet(name):
    """
    Prints a greeting given some name.
    
    :param name: the name to be addressed in the greeting
    :type name: str
    
    """
    print(f"Howdy, {name}!")
    
# I encourage you to use docstrings!

# Python introduced support for optional type hints in v3.5.
# You can read more aobut this feature here: https://docs.python.org/3.7/library/typing.html
# let's give it a try...
def add_six(num: int) -> int:
    return num + 6

# this should print 13
print(add_six(7))

# Python also has "anonymous functions" (also known as "lambda" functions)
# take a look at the following code:

greet_alt = lambda name: print(f"Hi, {name}!")

greet_alt("Fred")

# lambda functions are often passed to other functions
# For example, they can be used to specify how a sequence should be sorted
# let's sort a list of pairs by their second element
pairs = [("bounce", 32), ("bighorn", 12), ("radical", 4), ("analysis", 7)]
# -1 is last thing in some sequence, -2 is the second to last thing in some seq, etc.
print(sorted(pairs, key=lambda pair: pair[-1]))

# we can sort it by the first element instead
# NOTE: python indexing is zero-based
print(sorted(pairs, key=lambda pair: pair[0]))

# You can learn more about other core data types and their methods here: 
# https://docs.python.org/3.7/library/stdtypes.html

# Because of its extensive standard library, Python is often described as coming with "batteries included".  
# Take a look at these "batteries": https://docs.python.org/3.7/library/

# You now know enough to complete this homework assignment (or at least where to look)

h
e
l
l
o
[2, 3, 4, 5]
[2]
hello, Josuke!
Howdy, partner!
13
Hi, Fred!
[('radical', 4), ('analysis', 7), ('bighorn', 12), ('bounce', 32)]
[('analysis', 7), ('bighorn', 12), ('bounce', 32), ('radical', 4)]


In [4]:
import numpy as np
from typing import Any, Dict, Iterable, List, Text, Tuple, Union
from collections import Counter
from numpy.typing import NDArray

from math import isclose

# Overview

In this assignment, you will ... 

- implement a function to calculate prior probabilities
- implement a function to calculate conditional probabilities
- estimate the probability of sequences

# NumPy

We'll be using [NumPy (**num**erical **Py**thon)](https://numpy.org/) to efficiently tally counts and generate probabilities.   While not part of the standard library, numpy is widely popular among Python users working in data science and machine learning.  If you're new to NumPy, be sure to watch the videos from this unit for an introduction to the library and some of its relevant features.  [You may also want to check out the official tutorial.](https://numpy.org/devdocs/user/absolute_beginners.html)

NumPy is a very fast and efficient library for [manipulating vectors and matrices of numbers](https://en.wikipedia.org/wiki/Array_programming).  In this assignment, we'll just be scratching the surface of its capabilities.

As a warm-up, complete the following functions using `NumPy`.

## `array_of_zeros(num_zeros: int)`

Implement a method that creates a 1D vector of zeros based on the provided number of values.

In [5]:
def array_of_zeros(num_zeros: int) -> NDArray[float]:
    """
    Creates a numpy array of zeros.
    """
    # YOUR CODE HERE
    return np.zeros(num_zeros)

In [6]:
# result should be a NumPy ndarray

res = array_of_zeros(3)
assert type(res) == np.ndarray

In [7]:
# result should be a 1D array

res = array_of_zeros(3)
assert res.ndim == 1

In [8]:
# ensure returned array is composed of a sequence of length 3

res = array_of_zeros(3)
assert res.shape[0] == 3

In [9]:
# ensure all values are zeros

res = array_of_zeros(3)
assert all(x == 0 for x in res)

## `add_scalar(vector: NDArray[float], scalar: int)`

Implement a method that adds a scalar value to each element in a numpy array.

In [10]:
def add_scalar(vector: NDArray[float], scalar: int) -> NDArray[float]:
    """
    Takes a NumPy Array and adds a scalar value to each element in the array
    """
    # YOUR CODE HERE
    return vector + scalar

In [11]:
# result should be a NumPy ndarray

res = add_scalar(array_of_zeros(7), 34)
assert type(res) == np.ndarray

In [12]:
# ensure all elements of resulting array are equal to 2

res = add_scalar(array_of_zeros(4), 2)
assert all(x == 2 for x in res)

In [13]:
# ensure result is a 1D array

res = add_scalar(array_of_zeros(4), 2)
assert res.ndim == 1

In [14]:
# ensure length of result is 4

res = add_scalar(array_of_zeros(4), 2)
assert res.shape[0] == 4

## `divide_by_scalar(vector: NDArray[float], scalar: int)`

Implement a method that divides each element in a numpy array by a scalar value.

In [15]:
def divide_by_scalar(vector: NDArray[float], scalar: int) -> NDArray[float]:
    """
    Takes a NumPy Array and divides each element in the array by a scalar value
    """
    # YOUR CODE HERE
    return vector / scalar

In [16]:
res = divide_by_scalar(add_scalar(array_of_zeros(14), 2), 17)
assert type(res) == np.ndarray

In [17]:
# all values should be 2
res = divide_by_scalar(add_scalar(array_of_zeros(3), 4), 2)
assert all(x == 2 for x in res)

In [18]:
# ensure result is a 1D array
res = divide_by_scalar(add_scalar(array_of_zeros(3), 4), 2)
assert res.ndim == 1

In [19]:
res = divide_by_scalar(add_scalar(array_of_zeros(300), 17), 14)
assert res.shape[0] == 300

# Constructing a Vocabulary

Before generating conditional probabilities of higher order _n_-grams, we need to track and order the terms in our documents using a **Vocabulary** class.  The vocabulary class will help us determine which _n_-grams we want to consider and how to handle unseen terms.

## `Vocabulary` class

Though attribute and method names have changed, this is very similar to the class you implemented in Unit 4.  You may reuse your solution where possible.  Unlike the `Vocabulary` class you previously implemented, this version ...

- is constructed from a sequence of tokens and optionally applies a count threshold when considering whether or not a term should be included in the vocabulary.
- `id_for(self, term: str)` returns the ID for `Vocabulary.UKNOWN` if `term` is out of vocabulary (OOV).
- includes a new method `empty_vector` for you to implement
- includes an alternative constructor `from_sentences`.  This class-level method is already implemented.
- requires you to filter the terms passed to `create_t2i` by frequency prior to constructing a dictionary



In [20]:
class Vocabulary:
    """
    Stateful vocabulary.
    Provides a mapping from term to ID and a reverse mapping of ID to term.
    """
    # symbol for unknown terms    
    UNKNOWN = "<UNK>"
    
    def __init__(self, terms: Iterable[Text]=[], min_count: int = 1):
        self.t2i: Dict[Text, int] = Vocabulary.create_t2i(terms, min_count=min_count)
        self.i2t: Dict[int, Text] = Vocabulary.create_i2t(self.t2i)
    
    # see https://www.python.org/dev/peps/pep-0484/#forward-references
    @staticmethod
    def from_sentences(sentences: Iterable[Iterable[Text]], min_count: int = 1) -> "Vocabulary":
        """
        Convenience method 
        for converting a sequence of tokenized sentences to a Vocabulary instance
        """
        return Vocabulary(terms=[term for sentence in sentences for term in sentence], min_count=min_count)
    
    def id_for(self, term: Text) -> int:
        """
        Looks up ID for term using self.t2i.  
        If the feature is unknown, returns ID of Vocabulary.UNKNOWN.
        """
        # YOUR CODE HERE
        return self.t2i.get(term, self.t2i[Vocabulary.UNKNOWN])
        
    def term_for(self, term_id: int) -> Union[Text, None]:
        """
        Looks up term corresponding to term_id.  
        If term_id is unknown, returns None.
        """
        # YOUR CODE HERE
        return self.i2t.get(term_id, None)
    
    @property
    def terms(self) -> List[Text]:
        return [self.i2t[i] for i in range(len(self.i2t))]
        
    @staticmethod
    def create_t2i(terms: Iterable[Text], min_count: int = 1) -> Dict[Text, int]:
        """
        Takes a flat iterable of terms (i.e., unigrams) and returns a dictionary of term -> int.
        Assumes terms have already been normalized.
        
        If the frequency of a term is less than min_count, 
        do not include the term in the vocabulary
        
        Requirements:
        - First term in vocabulary (ID 0) is reserved for Vocabulary.UNKNOWN.
        - Sort the features alphabetically
        - Only include terms occurring >= min_count
        """
        # terms must be strings
        if not all(isinstance(term, Text) for term in terms):
            raise Exception("terms must be strings")
        # YOUR CODE HERE
        term_counts = Counter(terms)
        filtered_terms = sorted([term for term, count in term_counts.items() if count >= min_count])
        t2i = {Vocabulary.UNKNOWN: 0}
        
        for index, term in enumerate(filtered_terms, start=1):
            t2i[term] = index
        return t2i
    
    @staticmethod
    def create_i2t(t2i: Dict[Text, int]) -> Dict[int, Text]:
        """
        Takes a dict of str -> int and returns a reverse mapping of int -> str.
        """
        return {i:t for (t, i) in t2i.items()}

    def empty_vector(self) -> NDArray[float]:
        """
        Creates an empty numpy array based on the vocabulary of terms
        """
        # YOUR CODE HERE
        return np.zeros(len(self.t2i), dtype=float)
    
    def __len__(self):
        """
        Defines what should happen when `len` is called on an instance of this class.
        """
        return len(self.t2i)
    
    def __contains__(self, other):
        """
        Example:
        
        v = Vocabulary(["I", "am"])
        assert "am" in v
        """
        return True if other in self.t2i else False

In [21]:
sentences = [
    ["we", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences)
res = vocab.empty_vector()

# result should be a numpy array
assert isinstance(res, np.ndarray)
# everything should be a float
assert all(isinstance(elem, float) for elem in res)

In [22]:
sentences = [
    ["we", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences, min_count=1)
res = vocab.empty_vector()

# vector should be a 1D array
assert len(res.shape) == 1

In [23]:
sentences = [
    ["we", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences, min_count=2)
res = vocab.empty_vector()

assert len(res) == 3

In [24]:
sentences = [
    ["We", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences, min_count=2)
res = vocab.empty_vector()

assert len(res) == 2

In [25]:
sentences = [
    ["we", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences, min_count=4)
res = vocab.empty_vector()

assert len(res) == 1

In [26]:
sentences = [
    ["we", "are", "travelers"],
    ["we", "are", "gamblers"]
]
vocab = Vocabulary.from_sentences(sentences, min_count=1)
res = vocab.empty_vector()

# dimensions of array should be equal to |V|
assert res.shape == (len(vocab),)

# |V| should be 5
assert len(vocab) == 5

# Calclulate _n_-grams

In order to calculate conditional probabilities, we'll first need to generate _n_-grams.

## `ngrams_for(ngrams)`

Generates a sequence of _n_-grams for the token sequence.  Each _n_-gram is represented as a tuple.

**HINTS**:
- You implemented this in Unit 4 and may reuse your solution here.

In [27]:
def ngrams_for(
    # the size of the n-gram
    n: int, 
    # a list of tokens
    tokens: List[Text], 
    # whether or not to use the start and end symbols
    use_start_end: bool = True,
    # the symbol to use for the start of the sequence (assuming user_start_end is true)
    start_symbol: str = "<S>",
    # the symbol to use for the end of the sequence (assuming user_start_end is true)
    end_symbol: str = "</S>"
) -> List[Tuple[str]]:
    """
    Generates a list of n-gram tuples for the provided sequence of tokens.
    """
    # YOUR CODE HERE
    if use_start_end:
        tokens = [start_symbol] * (n - 1) + tokens + [end_symbol] * (n - 1)

    ngrams = [tuple(tokens[i:i + n]) for i in range(len(tokens) - n + 1)]
    
    return ngrams

# Calclulating prior probabilities

Armed with a bit of NumPy, a `Vocabulary` class, and a means of generating _n_-grams, we can move on to calculating prior probabilties efficiently.

## `prior_probs(ngrams: Iterable[Text], vocab: Vocabulary)`

`prior_probs()` takes a sequence of terms (i.e., unigrams) and a `Vocabulary` instance as parameters.  It returns a probability distribution (1D numpy array of floats) containing all terms in the vocabulary.

**HINTS**:
- The length of the 1D array should equal the number of terms in the `Vocabulary` instance
- Each value in the 1D array is a float ranging between 0 and 1
- The sum of the 1D array should be 1

In [28]:
def prior_probs(tokens: Iterable[str], vocab: Vocabulary) -> NDArray[float]:
    """
    Calculates the prior probability for each token,
    given some vocabulary
    """
    # YOUR CODE HERE
    token_counts = Counter(tokens)
    probabilities = np.zeros(len(vocab), dtype=float)
    total_count = sum(token_counts.values())  # Total number of tokens

    unknown_count = 0

    for token, count in token_counts.items():
        term_id = vocab.id_for(token)
        if term_id == vocab.id_for(Vocabulary.UNKNOWN):
            unknown_count += count  # Count unknown terms
        else:
            probabilities[term_id] = count / total_count

    probabilities[vocab.id_for(Vocabulary.UNKNOWN)] = unknown_count / total_count

    return probabilities

In [29]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens, vocab)

# Our Vocabulary should have just two distinct terms plus Vocabulary.UNKNOWN
assert len(res) == 3

In [30]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens, vocab)

assert res[vocab.id_for(Vocabulary.UNKNOWN)] == 0

In [31]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens, vocab)

assert res[0] == 0

In [32]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens, vocab)

# It should not matter that our terms here use just punctuation,
# if we have two of them out of three tokens, the prob is 2/3
assert isclose(res[vocab.id_for(":)")], 0.666, abs_tol=1e-3)

In [33]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens, vocab)

# result should form a probability distribution
assert res.sum() == 1

In [34]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens + ["snorkel"], vocab)

# Our probability result vector should have places for just the two known terms
# plus Vocabulary.UNKNOWN, just like our Vocabulary
assert len(res) == 3

In [35]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens + ["snorkel", "bike", "bike"], vocab)

# We have three unknown tokens out of six in our data
# Be sure you're counting and adding them properly!
assert res[0] == 0.5

In [36]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens + ["snorkel"], vocab)

assert res[vocab.id_for("snorkel")] == 0.25

In [37]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens + ["snorkel"], vocab)

assert res[vocab.id_for(":(")] == 0.25

In [38]:
tokens = [":)", ":)", ":("]
vocab  = Vocabulary(tokens)
res    = prior_probs(tokens + ["snorkel"], vocab)

# result should form a probability distribution
assert res.sum() == 1

# Calclulating conditional probabilities

Now that you've practiced generating prior probabilities, it's time to explore conditional probabilities.

## `make_conditional_probs(ngrams: Iterable[NgramType], vocab: Vocabulary)`

`make_conditional_probs()` takes a sequence of _n_-grams and a `Vocabulary` instance as parameters.  It returns a dictionary of conditional probability distributions for each _n_-gram's history ($(\text{the}, \text{black}, \text{dog}) \rightarrow P(x \vert \text{the black}) = [, ..., ...]$).

**HINTS**:
- Each **key** in the dictionary is a **tuple** of observations
- Each **value** in the dictionary is a **1D numpy array of floats**
  - each array represents a probability distribution of outcomes for $x$ (i.e., the array should sum to 1)

In [39]:
# ex. [("I", "am"), ("she", "is")]
NgramType = Tuple[str, ...]

def make_conditional_probs(ngrams: Iterable[NgramType], vocab: Vocabulary) -> Dict[Tuple[Text, ...], NDArray[float]]:
    """
    Takes a sequence of n-grams and a vocabulary
    Returns a dictionary of conditional probability distributions for each n-gram's history
    """
    # YOUR CODE HERE
    history_counts = {}

    for ngram in ngrams:
        history = ngram[:-1] 
        outcome = ngram[-1]   

        if history not in history_counts:
            history_counts[history] = Counter()

        history_counts[history][outcome] += 1

    conditional_probs = {}

    for history, outcome_counter in history_counts.items():
        total_count = sum(outcome_counter.values())  
        probabilities = np.zeros(len(vocab), dtype=float)  

        for outcome, count in outcome_counter.items():
            outcome_id = vocab.id_for(outcome)
            probabilities[outcome_id] = count / total_count  

        conditional_probs[history] = probabilities

    return conditional_probs

In [40]:
tokens  = ["Sea", "slugs", "don", "'t", "sneeze"]
ngrams  = ngrams_for(n=2, tokens=tokens)
vocab   = Vocabulary(tokens)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# check return type
assert isinstance(res, dict)

In [41]:
tokens  = ["Sea", "slugs", "don", "'t", "sneeze"]
ngrams  = ngrams_for(n=2, tokens=tokens)
vocab   = Vocabulary(tokens)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# check return type
assert all(isinstance(k, tuple) for k in res.keys())

In [42]:
tokens  = ["Sea", "slugs", "don", "'t", "sneeze"]
ngrams  = ngrams_for(n=2, tokens=tokens)
vocab   = Vocabulary(tokens)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# check return type
for v in res.values():
    assert isinstance(v, np.ndarray), f"{v} is not an npdarray"
    # everything should be a float
    assert all(isinstance(elem, float) for elem in v)

In [43]:
sentences = [
    ["I", "study", "turtles"],
    ["I", "study", "cats", "and", "turtles"],
    ["I", "can", "speak", "the", "language", "of", "turtles"]
]
ngrams  = [gram for sent in sentences for gram in ngrams_for(n=2, tokens=sent)]
vocab   = Vocabulary.from_sentences(sentences)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# $p(can|I) \approx 0.3333$
assert isclose(res[("I",)][vocab.id_for("can")], 0.3333, abs_tol=1e-4)

In [44]:
sentences = [
    ["I", "study", "turtles"],
    ["I", "study", "cats", "and", "turtles"],
    ["I", "can", "speak", "the", "language", "of", "turtles"]
]
ngrams  = [gram for sent in sentences for gram in ngrams_for(n=3, tokens=sent)]
vocab   = Vocabulary.from_sentences(sentences)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# $p(study|I study) \approx 0.5$
assert isclose(res[("I", "study")][vocab.id_for("cats")], 0.5, abs_tol=1e-3)

In [45]:
sentences = [
    ["I", "study", "turtles"],
    ["I", "study", "cats", "and", "turtles"],
    ["I", "can", "speak", "the", "language", "of", "turtles"]
]
ngrams  = [gram for sent in sentences for gram in ngrams_for(n=2, tokens=sent)]
vocab   = Vocabulary.from_sentences(sentences)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# $p(love|I) = 0$
assert res[("I",)][vocab.id_for("love")] == 0

In [46]:
sentences = [
    ["I", "study", "turtles"],
    ["I", "study", "cats", "and", "turtles"],
    ["I", "can", "speak", "the", "language", "of", "turtles"]
]
ngrams  = [gram for sent in sentences for gram in ngrams_for(n=2, tokens=sent)]
vocab   = Vocabulary.from_sentences(sentences)
res     = make_conditional_probs(ngrams=ngrams, vocab=vocab)

# each value in the dictionary should form a probability distribution
assert all(v.sum() == 1 for k,v in res.items())

# _n_-gram language models

Now that you know how to generate the probability of higher order _n_-grams, let's estimate the probability of sequences of tokens using an _n_-gram language model.  For more information on _n_-gram language models, [see Sections 3.0 - 3.1 of the textbook](https://parsertongue.org/readings/slp3/3.pdf#page=1).

## `LanguageModel` class

Implement a class that takes a corpus (`corpus`), determines the model's vocabulary, and computes the probability of text sequences using an $(n-1)$ order Markov assumption (i.e., estimate the probability of a sequence as the product of the probability of each of its _n_-grams for some value _n_). [Check the tutorial](https://parsertongue.org/tutorials/language-models-beginner/#markov-assumption) to refresh your memory on how this works.

In [47]:
class LanguageModel():
    """
    An _n_-gram language model using an _n_ - 1 order Markov assumption.
    """
    
    def __init__(self, 
                 corpus: Iterable[Iterable[str]], 
                 n=2,
                 min_count=1,
                 use_start_end: bool = False
    ):
        assert n >= 2
        self.n = n
        self.use_start_end = use_start_end
        # though not stored as an instance attribute, we need this temporarily to calculate other attributes
        ngrams: Iterable[Tuple(Text, ...)] = [gram for sentence in corpus\
                                             for gram in ngrams_for(n=self.n, tokens=sentence, use_start_end=self.use_start_end)]
        self.vocab: Vocabulary                                   = Vocabulary.from_sentences(corpus, min_count)
        self.pdist: Dict[Tuple[Text, ...], NDArray[float]]       = make_conditional_probs(ngrams, self.vocab)
    
    def cond_prob(self, term: Text, given: Tuple[str, ...]) -> float:
        """
        Calculates the conditional probability for the provided term and the term's context.
        
        P(am|I) = cond_prob(term = "am", given = ("I",))
        """
        # YOUR CODE HERE
        history = given
        if history in self.pdist:
            probabilities = self.pdist[history]
            term_id = self.vocab.id_for(term)
            return probabilities[term_id] if term_id < len(probabilities) else 0.0
        return 0.0
        
    def prob_of(self, tokens: Iterable[Text]) -> float:
        """
        Calculates the probability of a token sequence using an _n_ - 1 order Markov assumption.
        """
        p = 1
        for gram in ngrams_for(n=self.n, tokens=tokens, use_start_end=self.use_start_end):
            next_tok = gram[-1]
            history  = gram[:-1]
        # YOUR CODE HERE
        p = 1.0 
        for gram in ngrams_for(n=self.n, tokens=tokens, use_start_end=self.use_start_end):
            next_tok = gram[-1]
            history = gram[:-1]
            p *= self.cond_prob(next_tok, history) 
        return p

In [48]:
lm = LanguageModel(
    corpus=[
        ["I", "like", "turtles"],
        ["I", "like", "horses", "and", "turtles"]
    ],
    n=2
)

# In this naïve language model,
# unseen terms/n-grams will result in 0 probabilities
assert lm.prob_of(["I", "like", "clowns"]) == 0

In [49]:
lm = LanguageModel(
    corpus=[
        ["I", "like", "noodles"],
        ["I", "like", "dumplings", "and", "noodles"]
    ],
    n=2
)

# In our tiny corpus, "I" is **always** followed by "like"
assert lm.prob_of(["I", "like"]) == 1

In [50]:
lm = LanguageModel(
    corpus=[
        ["I", "like", "noodles"],
        ["I", "like", "dumplings", "and", "noodles"]
    ],
    n=3
)

# In our tiny corpus, "I like" is followed by "noodles" half of the time
assert  lm.prob_of(["I", "like", "noodles"]) == 0.5

In [51]:
lm = LanguageModel(
    corpus=[
        ["I", "like", "noodles"],
        ["I", "like", "dumplings", "and", "noodles"]
    ],
    n=3
)

# In our tiny corpus, "I like" is followed by "noodles" half of the time
assert  lm.prob_of(["I", "like", "dumplings"]) == 0.5

Congratulations!  You've implemented an _n_-gram language model!   The examples we've looked at so far have involved toy datasets.  In order to get better estimates of probabilities, we need larger corpora.  Try training a bigram language model using a book from Project Gutenberg.  See below for an example to get started

In [52]:
# from requests import get

# url = "http://www.gutenberg.org/cache/epub/35688/pg35688.txt"

# res = get(url)
# content = res.text

# # tokenize content, clean it up, and use it to train a bigram language model

# Bonus:  Better estimates for probabilities (max 5 points)
Our naïve language model assumes a probability of 0 when it encounters unknown terms.  A consequence of this is that grammatical strings end up being assigned a probability of zero.  

Common solutions to this problem involve **smoothing** or a form of **backoff**.  

### Task
- Redefine the `LanguageModel` class and add a new method called `def smoothed_prob_of(self, tokens: Iterable[str]) -> float` and/or `def prob_of(self, tokens: Iterable[str], backoff: bool) -> float`.  
- Implement a smoothing or backoff algorithm of your choice.  [See Sections 3.3-3.6 of the textbook for examples of smoothing and backoff algorithms](https://parsertongue.org/readings/slp3/3.pdf#page=9).  Alternatively, you may invent your own.
- Add at least two tests

In [53]:
# YOUR CODE HERE
class LanguageModel():
    def __init__(self, 
                 corpus: Iterable[Iterable[str]], 
                 n=2,
                 min_count=1,
                 use_start_end: bool = False
    ):
        assert n >= 2
        self.n = n
        self.use_start_end = use_start_end
        ngrams: Iterable[Tuple(Text, ...)] = [gram for sentence in corpus
                                             for gram in ngrams_for(n=self.n, tokens=sentence, use_start_end=self.use_start_end)]
        self.vocab: Vocabulary = Vocabulary.from_sentences(corpus, min_count)
        self.pdist: Dict[Tuple[Text, ...], NDArray[float]] = make_conditional_probs(ngrams, self.vocab)
    
    def cond_prob(self, term: Text, given: Tuple[str, ...], smoothing: bool = False) -> float:
        history = given
        if history in self.pdist:
            probabilities = self.pdist[history]
            term_id = self.vocab.id_for(term)
        if smoothing:
                total_count = sum(probabilities)  
                vocab_size = len(probabilities)
                smoothed_prob = (probabilities[term_id] if term_id < len(probabilities) else 0) + 1  
                return smoothed_prob / (total_count + vocab_size)  
        else:
                return probabilities[term_id] if term_id < len(probabilities) else 0.0
        return 0.0  

    def smoothed_prob_of(self, tokens: Iterable[str]) -> float:
        p = 1.0  
        for gram in ngrams_for(n=self.n, tokens=tokens, use_start_end=self.use_start_end):
            next_tok = gram[-1]
            history = gram[:-1]
            p *= self.cond_prob(next_tok, history, smoothing=True)  # Use smoothing
        return p

    def prob_of(self, tokens: Iterable[str], backoff: bool = False) -> float:
        p = 1.0  
        for gram in ngrams_for(n=self.n, tokens=tokens, use_start_end=self.use_start_end):
            next_tok = gram[-1]
            history = gram[:-1]
            prob = self.cond_prob(next_tok, history)

            if backoff and prob == 0:
                prob = self.cond_prob(next_tok, history[1:])  # Backoff to a smaller history
            
            p *= prob  
        return p
    
    

corpus = [["the", "cat", "purred"], ["the", "dog", "farted"]]
lm = LanguageModel(corpus, n=2)

smoothed_prob = lm.smoothed_prob_of(["the", "cat"])
print(f"Smoothed probability of ['the', 'cat']: {smoothed_prob}")

backoff_prob = lm.prob_of(["the", "cat"], backoff=True)
print(f"Probability of ['the', 'cat'] with backoff: {backoff_prob}")



Smoothed probability of ['the', 'cat']: 0.21428571428571427
Probability of ['the', 'cat'] with backoff: 0.5


Explain your approach here.

The smoothing function helps to avoid categorizing unknown elements as 0. "prob_of" helps the model to identify a unigram from an unknown bigram. This allows the model to be more flexible in its labeling.