In [1]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone https://github.com/cs236299-2020/lab2-3.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [2]:
# Initialize Otter
import otter
grader = otter.Notebook()

# Lab 2-3 – Language modeling with neural networks

In lab 2-1, you built and tested $n$-gram language models. Recall that some problems with $n$-gram language models are:

1. They are profligate with memory.
2. They are sensitive to very limited context.
3. They don't generalize well across similar words.

As promised, in this lab, you'll explore neural models to address these failings. You will:

1. Build and test a neural $n$-gram language model.
2. Build and test a neural RNN language model.
3. Use language models for classification.

## Setup

%%latex
\newcommand{\vect}[1]{\mathbf{#1}}
\newcommand{\cnt}[1]{\sharp(#1)}
\newcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\newcommand{\softmax}{\operatorname{softmax}}
\newcommand{\Prob}{\Pr}
\newcommand{\given}{\,|\,}

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

In [3]:
import json
import math
import random

import torch
import torchtext

In [4]:
# Set random seeds
SEED = 1234
torch.manual_seed(SEED)
random.seed(SEED)

# GPU check, sets runtime type to "GPU" where available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print (device)

cuda


The corpus used throughout this lab is the Federalist papers. We've trained and provided neural language models on papers authored by Hamilton and Madison, respectively, which we download here.

In [5]:
# Download data
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/federalist_data_raw2.json")
dataset = json.load(open('data/federalist_data_raw2.json'))

# Download vocabulary
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/text_field.pt")
TEXT = torch.load('data/text_field.pt')

# Download pretrained language models (LM)
# Feedforward LM, Hamilton
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/ffnn_lm_h.pt")
# Feedforward LM, Madison
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/ffnn_lm_m.pt")
# RNN LM, Hamilton
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/rnn_lm_h.pt")
# RNN LM, Madison
shell("wget -nv -N -P data https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/rnn_lm_m.pt")









First, let's split the dataset into training, validation, and test sets. Since we have provided pretrained models, we are not using the training set in this lab. In the homework assignments, you will have opportunities to train models yourself.

For this lab, we use a test set `testing`, which is the same as we used in lab 1-2. But for the validation set, we have separate ones for papers authored by Hamilton (`validation_hamilton`) and papers authored by Madison (`validation_madison`).

In [6]:
# Split training, validation, and test sets
TRAIN_RATIO = 0.9
# Extract the papers of unknown authorship
testing = list(filter(lambda ex: ex['authors'] == 'Hamilton or Madison',
                      dataset))
# Change gold labels in-place
for ex in testing:
  ex['authors'] = 'Madison'

# Extract the papers by Madison
dataset_madison = list(filter(lambda ex: ex['authors']=='Madison', dataset))
random.seed(SEED)
random.shuffle(dataset_madison)
training_size_madison = int(math.floor(TRAIN_RATIO * len(dataset_madison)))
validation_madison = dataset_madison[training_size_madison:]

# Extract the papers by Hamilton
dataset_hamilton = list(filter(lambda ex: ex['authors']=='Hamilton', dataset))
random.seed(SEED)
random.shuffle(dataset_hamilton)
training_size_hamilton = int(math.floor(TRAIN_RATIO * len(dataset_hamilton)))
validation_hamilton = dataset_hamilton[training_size_hamilton:]

# We only consider the first 200 tokens of each document for speed
def truncate(s, k=200):
  for document in s:
    document['tokens'] = document['tokens'][:k]
truncate(validation_madison)
truncate(validation_hamilton)
truncate(testing)

print (f"Madison Validation Size: {len(validation_madison)} documents\n"
       f"Hamilton Validation Size: {len(validation_hamilton)} documents")

Madison Validation Size: 3 documents
Hamilton Validation Size: 6 documents


Note that, unlike in labs 1-2 and 1-3, here we consider _all_ word types in the data. Let's look at an example:

In [7]:
print (f"Example (Madison): {validation_madison[0]['tokens']}\n\n"
       f"Example (Hamilton): {validation_hamilton[0]['tokens']}")

Example (Madison): ['it', 'is', 'not', 'a', 'little', 'remarkable', 'that', 'in', 'every', 'case', 'reported', 'by', 'ancient', 'history', ',', 'in', 'which', 'government', 'has', 'been', 'established', 'with', 'deliberation', 'and', 'consent', ',', 'the', 'task', 'of', 'framing', 'it', 'has', 'not', 'been', 'committed', 'to', 'an', 'assembly', 'of', 'men', ',', 'but', 'has', 'been', 'performed', 'by', 'some', 'individual', 'citizen', 'of', 'preeminent', 'wisdom', 'and', 'approved', 'integrity', '.', 'minos', ',', 'we', 'learn', ',', 'was', 'the', 'primitive', 'founder', 'of', 'the', 'government', 'of', 'crete', ',', 'as', 'zaleucus', 'was', 'of', 'that', 'of', 'the', 'locrians', '.', 'theseus', 'first', ',', 'and', 'after', 'him', 'draco', 'and', 'solon', ',', 'instituted', 'the', 'government', 'of', 'athens', '.', 'lycurgus', 'was', 'the', 'lawgiver', 'of', 'sparta', '.', 'the', 'foundation', 'of', 'the', 'original', 'government', 'of', 'rome', 'was', 'laid', 'by', 'romulus', ',', 'a

## The $n$-gram feedforward network

In lab 2-1, you built an $n$-gram language model using a lookup table. However, that model assigns zero probability to any $n$-gram that doesn't appear in the training text (without smoothing). In this lab, we consider a neural-network-based approach, which can address this issue.

Recall that in $n$-gram language modeling, we made the assumption that the probability of a word only depends on its previous $n-1$ words:

\begin{align*}
\Prob(x_1, x_2, \ldots, x_M) & = \Prob(x_1) \cdot \Prob(x_2, \ldots, x_M\given x_1) \\
& = \Prob(x_1) \cdot \Prob(x_2 \given x_1) \cdot \Prob(x_3 \ldots, x_M \given x_1, x_2) \\
& \cdots \\
& = \prod_{i=1}^M \Prob (x_i \given x_1, \cdots, x_{i-1}) \\
& \approx \prod_{i=1}^M \Prob (x_i \given x_{i-n+1}, \cdots, x_{i-1}),
\end{align*}

and we used the empirical frequencies to estimate these conditional probabilities:

$$
\Pr (x_i \given x_{i-n+1}, \cdots, x_{i-1})= \frac{\cnt{x_{i-n+1}, \cdots, x_{i-1}, x_i}}{\sum_{x'} \cnt{x_{i-n+1}, \cdots, x_{i-1}, x'}}
$$

We can immediately see the problem with using a large $n$: the numerator would be 0 for any $n$-grams unseen in the training data.

One way of solving this issue is to use a "smoother" function: we parameterize the conditional probabilities using a neural network:

$$
\Pr (x_i \given x_{i-n+1}, \cdots, x_{i-1})= f(x_{i-n+1}, \cdots, x_{i-1}),
$$

where $f$ is a function returning a vector of size $V$ ($V$ being the vocabulary size). The $j$-th element of the returned vector stores the probability of generating the $j$-th word in the vocabulary.

To parameterize $f$, we can use a feedforward neural network. To convert word ids to numeric values, we map each word type in the vocabulary to a learnable vector called an _embedding_ of size `embedding_size`. 

Why do we represent words with such embeddings? To answer this question, let's consider two alternative representations: (1) word indices and (2) set-of-words (which we used in lab 1-1). (We cannot directly use the strings themselves because they are of varying lengths.) A desirable word representation system should be such that the similarity of words can be reflected in the closeness of word representations (ideally, if two words have similar meaning and syntactic function, they should have similar representations, in order to alleviate the burden of learning such similarities by the rest of the model).  For option (1),  closeness in terms of word indices is meaningless: the 365-th word in the vocabulary is probably not more similar to the 366-th word than it is to other words, since the assignment of index in the vocabulary is arbitrary. For option (2), two different word types always have orthogonal vector representations, but we hope that similar words can be placed near each other (at least we don't want to eliminate that possibility from the beginning).

Therefore, we use an embedding, a vector representation for each word type in the vocabulary, which has been separately learned in a manner that has been shown to cluster  similar words together. There are many such embeddings; the particular embedding we'll use is _word2vec_, a mapping from words to vectors of embedding size 128 trained under a task called "masked language modeling". If you are interested in more details, you should read the original [word2vec paper](https://papers.nips.cc/paper/5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf). For our purposes, we can treat the embedding as just given to us.

Now let's get back to the parameterization of $f(x_{i-n+1}, \cdots, x_{i-1})$. We first map each word in $\langle x_{i-n+1}, \cdots, x_{i-1}\rangle$ to its embedding $\langle v_{i-n+1}, \cdots, v_{i-1}\rangle$ ($n-1$ vectors of size `embedding_size`), then we concatenate these embeddings to a vector (of size `(n-1)*embedding_size`). Afterwards, we apply a linear projection to project it down to size `hidden_size`, then we apply a nonlinear function, and another linear projection to project to size $V$, followed by a softmax to normalize to probabilities. In this case, the nonlinear function we use is not a sigmoid. Instead, we use a Rectified Linear Unit (ReLU), which is simply a componentwise function that clips negative numbers at zero: 

$$ReLU(x) = \max(0, x)$$

We use $n=5$ in this lab.

In [8]:
n = 5



<!--
BEGIN QUESTION
name: ffnn_forward_step
-->

Implement the missing part of the `forward_step` function below. This function takes the previous words (the entire previous history, not just the $n$-gram context) as input, and returns the probabilities of generating the next word (the target). The returned value should be a dictionary, with word types as keys and their respective probabilities as values.

In [9]:
class FFNNLM(torch.nn.Module):
  def __init__(self, n, text_field, embedding_size, hidden_size):
    super().__init__()
    self.n = n
    self.text_field = text_field
    self.pad_index = self.text_field.vocab.stoi[self.text_field.pad_token]
    vocab_size = len(self.text_field.vocab)

    # Create modules
    self.embed = torch.nn.Embedding(vocab_size, embedding_size)         # Embedding
    self.sublayer1 = torch.nn.Linear((n-1)*embedding_size, hidden_size) # First layer
    self.sublayer2 = torch.nn.ReLU()                                    # Second layer
    self.hidden2output = torch.nn.Linear(hidden_size, vocab_size)       # Last layer

  def forward_step(self, history_words):
    # Switch to "evaluation" mode
    self.eval()
    # Convert strings to word ids
    context = self.text_field.process([history_words]).to(device) # context_len, 1
    context_len = context.size(0)
    if context_len < self.n-1:
      # Pad to the left if we don't have enough context words
      padding = context.new(self.n-1-context_len, 1).fill_(self.pad_index)
      context = torch.cat([padding, context], 0)
    else:
      # TODO: prepare proper context (the previous n-1 words) from the full history
      context = context[len(context)-n+1:len(context)].T
    context = context.view(1, -1)              # first dim batch=1, second dim length=n-1
    embeddings = self.embed(context)           # 1, n-1, embedding_size
    embeddings = embeddings.view(1, -1)        # 1, (n-1)*embedding_size
    # TODO: finish feedforward and set logits
    # Logits should be a tensor of size (1, vocab_size)
    # The structure of the network is
    #   embeddings -> sublayer1 -> sublayer2 -> hidden2output -> softmax
    logits = self.hidden2output(self.sublayer2(self.sublayer1(embeddings))) 
    
    # Normalize to get probabilities
    probs = torch.softmax(logits, -1).view(-1) # vocab_size

    # Match probabilities with actual word types
    distribution = {}
    for i, prob in enumerate(probs):
      word = self.text_field.vocab.itos[i]
      distribution[word] = prob.item()
    return distribution

Now, let's load the pretrained feedforward language models for Hamilton and Madison. The model `ffnn_lm_madison` was trained on documents authored by Madison, whereas `ffnn_lm_hamilton` was trained on documents authored by Hamilton.

In [10]:
# Create and load feedforward LM for Madison
ffnn_lm_madison = FFNNLM(n, TEXT,
               embedding_size=128, 
               hidden_size=128, 
               ).to(device)
ffnn_lm_madison.load_state_dict(torch.load('data/ffnn_lm_m.pt', map_location=device))

# Create and load feedforward LM for Hamilton
ffnn_lm_hamilton = FFNNLM(n, TEXT,
               embedding_size=128, 
               hidden_size=128, 
               ).to(device)
ffnn_lm_hamilton.load_state_dict(torch.load('data/ffnn_lm_h.pt', map_location=device))

<All keys matched successfully>

### Sampling from an $n$-gram feedforward network

Recall from lab2-1 that we can sample a sequence of text from a model using the below functions. Note that here an important change is that we are providing the full context to the model instead of the past $n-1$ words, to make it compatible with the RNN language model that will be introduced later.

In [11]:
def sample(model, context):
    """Returns a token sampled from the `model` assuming the `context`"""
    distribution = model.forward_step(context)
    prob_remaining = random.random()
    for token, prob in sorted(distribution.items()):
        if prob_remaining < prob:
            return token
        else:
            prob_remaining -= prob
    raise ValueError

def sample_sequence(model, start_context, count=100):
    """Returns a sequence of tokens of length `count` sampled successively
       from the `model` starting with the `start_context`
    """
    random.seed(SEED) # for reproducibility
    context = list(start_context)
    result = list(start_context)
    for i in range(0, count):
        next = sample(model, tuple(context))
        result.append(next)
        context = context + [next]
    return result

Let's try to sample from our models. The samples might be bad since the dataset is small.

In [12]:
print(' '.join(sample_sequence(ffnn_lm_madison, ('constitution', 'proposed', 'by', 'the'))), "\n")
print(' '.join(sample_sequence(ffnn_lm_hamilton, ('constitution', 'proposed', 'by', 'the'))))

constitution proposed by the united states , which will have laid against the elevation but the federal innovations is <unk> , and , is very <unk> invested in the authors of that reason . little far as we have have <unk> like , and in mankind , or as more and to the union , at a constitution passions , and very <unk> taxes , if gives give the legislative powers falling that they is the rights of any commission , and mean them in the people of its portion of civil suffrages , and the principles of powerful acts . a enjoyed and 

constitution proposed by the usual has , very time never it , the convention , the general models of course , and , in treat . the legislature may are merchants . the person of nations . we have far <unk> less , and instance , adopt properly be less <unk> to the union , has , except to be allowed to acquire the <unk> hands of one situation ? these is sometimes take it to serve by a <unk> in a measures than members of the political difference have prospect for th

In [13]:
grader.check("ffnn_sample")

### Evaluating text according to an $n$-gram feedforward network

Now let's use our language model to score text. Note that the $n$-gram feedforward network is able to score with zero context (internally, it's padded to the left), so `ffnn_lm_hamilton.forward_step([])` will return the probability distribution $\Pr(x_1)$ for the first word in a document.

In [14]:
Pr_x1 = ffnn_lm_hamilton.forward_step([])
topk = 9

# Sort by probabilities
for i, word in enumerate(sorted(Pr_x1, key=lambda word: Pr_x1[word], reverse=True)[:topk]):
    print (f"top {i+1} word: {word:<8} Pr(x1): {Pr_x1[word]:.3f}")

top 1 word: ,        Pr(x1): 0.048
top 2 word: the      Pr(x1): 0.046
top 3 word: of       Pr(x1): 0.031
top 4 word: in       Pr(x1): 0.028
top 5 word: <unk>    Pr(x1): 0.027
top 6 word: that     Pr(x1): 0.020
top 7 word: which    Pr(x1): 0.019
top 8 word: to       Pr(x1): 0.018
top 9 word: and      Pr(x1): 0.016


Define a function `neglogprob` that takes a token sequence and a language model and returns the negative log probability of the _entire_ token sequence according to the model (using log base 2). Note that the unknown word type is `"<unk>"`.

<!--
BEGIN QUESTION
name: ffnn_neglogprob
-->

In [15]:
def neglogprob(tokens, model):
    """Returns the negative log probability of a sequence of `tokens`
       according to a `model`
    """
    probs=[]
    for i in range(0, len(tokens)):
        words_probs = model.forward_step(tokens[0:i])
        probs += [math.log2(words_probs['<unk>'])] \
            if not tokens[i] in words_probs.keys() \
            else [math.log2(words_probs[tokens[i]])]
    score = -sum(probs)
    return score

In [16]:
grader.check("ffnn_neglogprob")

In [17]:
round(neglogprob(["constitution",], ffnn_lm_madison), 2)

13.35

Define a function `perplexity` that takes a token sequence and a language model and returns the perplexity of the _entire_ token sequence according to the model.

<!--
BEGIN QUESTION
name: ffnn_perplexity
-->


In [18]:
# TODO
def perplexity(tokens, model):
    """Returns the perplexity of a sequence of `tokens` according to a `model`
    """
    neglogp = neglogprob(tokens, model)
    return 2 ** (neglogp / len(tokens))


What's the perplexity of each document in the validation set under the language model trained on papers authored by Madison? What about Hamilton? Let's start with one document from each author.

In [19]:
document_madison = validation_madison[0]['tokens']
document_hamilton = validation_hamilton[0]['tokens']

Calculate the perplexity of each model on `document_madison` and  `document_hamilton`.

<!--
BEGIN QUESTION
name: ffnn_ppl
-->

In [20]:
# TODO
ppl_madison_model_madison_document = perplexity(document_madison, ffnn_lm_madison)
ppl_hamilton_model_madison_document = perplexity(document_madison, ffnn_lm_hamilton)
ppl_madison_model_hamilton_document = perplexity(document_hamilton, ffnn_lm_madison)
ppl_hamilton_model_hamilton_document = perplexity(document_hamilton, ffnn_lm_hamilton)

In [21]:
grader.check("ffnn_ppl")

Now, let's compare those perplexity values.

In [22]:
print (f"Author    Madison Model    Hamilton Model\n"
       f"Madison      {ppl_madison_model_madison_document:5.1f}            {ppl_hamilton_model_madison_document:5.1f}\n"
       f"Hamilton     {ppl_madison_model_hamilton_document:5.1f}            {ppl_hamilton_model_hamilton_document:5.1f}")

Author    Madison Model    Hamilton Model
Madison      108.3            134.8
Hamilton     129.4             96.0


<!-- BEGIN QUESTION -->

**Question:** What do you find? Why?

<!--
BEGIN QUESTION
name: open_response_ppl
manual: true
-->

Each model is best for it's paper. It is an expected result, because each set of all of all the parameters of each model were optimized according to the n-gram-alike model (i.e considering only last n-1 words) with seeing mostly the sequences of words of the corresponding author, i.e each model in some sense overlearned or overfitted the examples of the seequences of words only of the concrete author, and didn't see any other combinations of words which point to another authority of the words, so each model predicts best the sequences of words it saw, especially if there are words or sequences which are unique to each author.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

Now, let's revisit our motivation for parameterizing conditional probabilities using a feedforward neural network instead of through counting.

**Question:** Compare the pros and cons of feedforward neural language model and the original $n$-gram language model (possibly with smoothing). Which is better?

<!--
BEGIN QUESTION
name: open_response_nn_v_ngram
manual: true
-->

For different situations different models can be preferable. The feedforward neural language model n-gram-alike is 'smoother' and doesn't need explicit smoothing, so it gives some predictions to contexts it has seen and those which it hasn't. It doesn't need to hold all the contexts, but just just the parameters which are optimized according to all the examples, so comparing to the n-gram it is much less expensive in memory, especially for large n, where the power set of words for different n-1 size contexts is enormously huge and much more bigger from even a rich (has  a lot of layers) neural model. 
The neural model is probably more good in generelizing and hence gets better accurancy, because it uses a function, which theoretically has the capability to learn any function in the world (2 linear layers with a non-linear activation function supply that theoretical result), and the learned set of parameters empiracally performs good for examples which is similiar to those which have been seen, or even if not similiar for a first look, but have some hidden characteristics which the neural model was capable of learning and representating by it's hidden layers. In this sense the neural model have those benefits comparing to any discrete probablistic model, like n-gram. 


<!-- END QUESTION -->



## Recurrent neural networks

One limitation of $n$-gram language models (both the original one and the neural one) is that they only model context up to a fixed number of words. However, natural language exhibits long-term dependencies, well beyond $n=5$. In this part of the lab, we consider an approach based on recurrent neural networks (RNN), which can consider variable amounts of context.

Different from $n$-gram language modeling, RNN-based language models do not make the approximation that the probability of a word only depends on its previous $n-1$ words. That is, we use the unapproximated chain rule:

$$
\Prob(x_1, x_2, \ldots, x_N) = \prod_{i=1}^N \Prob (x_i \given x_1, \cdots, x_{i-1})
$$

and we again specify the conditional probabilities using a neural network:

$$
\Pr (x_i \given x_{\color{red}1}, \cdots, x_{i-1})= f({ x_{\color{red}1}}, \cdots, x_{i-1}),
$$

where we use an RNN to parameterize $f$. (Notice the change in the first index of the context, highlighted in red; we're using the whole history as context now, not just the last $n-1$ words.) 

The inputs to RNNs, like in the feedforward case, are embeddings of words, and we project the _final_ output state of the RNN to a vector of size $V$, followed by a softmax to normalize the probabilities.

Implement the missing part of the `forward_step` function of an RNN language model below. This function takes the previous words as input, and returns the probabilities of generating the next word. The returned value should be a dictionary, with word types as keys and their respective probabilities as values.

> Hint: You might find [torch.nn.RNN documentation](https://pytorch.org/docs/stable/generated/torch.nn.RNN.html) helpful.

<!--
BEGIN QUESTION
name: rnn_forward_step
-->

In [23]:
class RNNLM(torch.nn.Module):
  def __init__(self, text_field, embedding_size, hidden_size):
    super().__init__()
    self.text_field = text_field
    vocab_size = len(self.text_field.vocab)
    self.pad_index = self.text_field.vocab.stoi[self.text_field.pad_token]
    
    # Create modules
    self.embed = torch.nn.Embedding(vocab_size, embedding_size)
    self.rnn = torch.nn.RNN(input_size=embedding_size, hidden_size=hidden_size, num_layers=1)
    self.hidden2output = torch.nn.Linear(hidden_size, vocab_size)
  
  def forward_step(self, context_words):
    self.eval()
    context = self.text_field.process([context_words]).to(device) # seq len, 1
    context_len = context.size(0)
    if context_len == 0: # generate the first word
      context = context.new(1, 1).fill_(self.pad_index)
      context_len = context.size(0)
    hidden = None
    # TODO: finish feedforward and set logits
    # Logits shall be a tensor of size (1, vocab_size)
    # Note that you should project the `output` from rnn, not the `hidden`
    # using self.hidden2output
    rnn_outputs = self.rnn(self.embed(context))[0]
    logits = self.hidden2output(rnn_outputs)[len(context)-1]

    # Normalize to get probabilities
    probs = torch.softmax(logits, -1).view(-1) # vocab_size

    # Match probabilities with actual word types
    distribution = {}
    for i, prob in enumerate(probs):
      word = self.text_field.vocab.itos[i]
      distribution[word] = prob.item()
    return distribution

Now, let's load the pretrained RNN language models for Hamilton and Madison. The model `rnn_lm_madison` was trained on documents authored by Madison, whereas `rnn_lm_hamilton` was trained on documents authored by Hamilton.

In [24]:
# Create and load RNN LM for Madison
rnn_lm_madison = RNNLM(TEXT,
               embedding_size=128, 
               hidden_size=128, 
               ).to(device)
rnn_lm_madison.load_state_dict(torch.load('data/rnn_lm_m.pt', map_location=device))

# Create and load feedforward LM for Hamilton
rnn_lm_hamilton = RNNLM(TEXT,
               embedding_size=128, 
               hidden_size=128, 
               ).to(device)
rnn_lm_hamilton.load_state_dict(torch.load('data/rnn_lm_h.pt', map_location=device))

<All keys matched successfully>

### Sampling from an RNN model

Let's try to sample from our models. The samples might be bad since the dataset is small.

In [25]:
print(' '.join(sample_sequence(rnn_lm_madison, ('constitution', 'proposed', 'by', 'the'))))
print(' '.join(sample_sequence(rnn_lm_hamilton, ('constitution', 'proposed', 'by', 'the'))))

constitution proposed by the united states , which will holding that a representative class , therefore , than the business could be , more weight , not more much <unk> in his liberty and interests , are too little known , on <unk> . it is <unk> on by than any thing of this body , a discretion of course , will be standard , if it clearly sufficiently less reflection from the state governments of the former , and it be held purposes and that the people in the same congress themselves , and there might its human affairs , according to <unk>
constitution proposed by the united states , which will have no beneficial prove as <unk> the exigencies of the confederacy , and , in various act increase in the bodies of independent jealousy in the forms of this kind has a long , and it is , that as often as well it would be certainty , can proposition , are under authorize their <unk> . if it were marked retained in the state governments than to encroach . for relations <unk> innovations the less 

In [26]:
grader.check("rnn_sample")

### Evaluating text according to an RNN model


Again, let's evaluate the models on a document from Hamilton and an artitle from Madison.

In [27]:
document_madison = validation_madison[0]['tokens']
document_hamilton = validation_hamilton[0]['tokens']

Calculate the perplexity of each RNN model on each document.

<!--
BEGIN QUESTION
name: rnn_ppl
-->

In [28]:
# TODO
rnn_ppl_madison_model_madison_document = perplexity(document_madison, rnn_lm_madison)
rnn_ppl_hamilton_model_madison_document = perplexity(document_madison, rnn_lm_hamilton)
rnn_ppl_madison_model_hamilton_document = perplexity(document_hamilton, rnn_lm_madison)
rnn_ppl_hamilton_model_hamilton_document = perplexity(document_hamilton, rnn_lm_hamilton)

In [29]:
grader.check("rnn_ppl")

Now, let's compare those perplexity values.

In [30]:
print (f"Author      Madison Model        Hamilton Model\n"
       f"Madison        {rnn_ppl_madison_model_madison_document:5.1f}                {rnn_ppl_hamilton_model_madison_document:5.1f}\n"
       f"Hamilton       {rnn_ppl_madison_model_hamilton_document:5.1f}                {rnn_ppl_hamilton_model_hamilton_document:5.1f}")

Author      Madison Model        Hamilton Model
Madison         86.4                 99.0
Hamilton        93.9                 77.9


<!-- BEGIN QUESTION -->

**Question:** Which type of model is better? The RNN language models or the feedforward language models? What are the possible reasons?

<!--
BEGIN QUESTION
name: open_response_ffnn_vs_rnn
manual: true
-->

RNN is better. It seems to be expected because RNN as was mentioned uses all the previous context to calculate the probability of the current word for each word, unlike the feedforward model and this case more information about the past improves model's certainty at the occurences of the words in present, i.e the posterior probability of words given a wider context is bigger.  

<!-- END QUESTION -->



## Authorship attribution using language models

In lab 1-3, you saw how to use a Naive Bayes model to determine authorship:

\begin{align*}
\argmax{i} \Prob(c_i \given \vect{x}) 
&= \argmax{i} \frac{\Prob(\vect{x} \given c_i) \cdot \Prob(c_i)}{\Prob(\vect{x})} \\
&= \argmax{i} \Prob(\vect{x} \given c_i) \cdot \Prob(c_i)
\end{align*}

In this lab, the language models trained on Madison documents can be used to calculate $\Pr(\vect{x} \given \text{Madison})$, and the language models trained on Hamilton documents can be used to calculate $\Pr(\vect{x} \given \text{Hamilton})$. Therefore, they can also be used for authorship attribution.

Recall that for numerical stability issues, we operate in log space (with base 2). With a little abuse of notation, let's denote the _log posterior_ as

$$
\log \Prob(\vect{x} \given c_i) + \log \Prob(c_i),
$$
where the priors $\Prob(c_i)$ from lab 1-3 are given below.

In [31]:
prior_madison = 15 / (15+51)
prior_hamilton = 51 / (15+51)

Let's consider a document from the test set.

In [32]:
document = testing[0]['tokens']

Use the feedforward neural language models to calculate the log posteriors for `document`.

<!--
BEGIN QUESTION
name: ffnn_author
-->

In [33]:
#TODO - calculate the log posteriors for Madison and Hamilton using feedforward LMs
posterior_madison = -neglogprob(document, ffnn_lm_madison)
posterior_hamilton =  -neglogprob(document, ffnn_lm_hamilton)

log_posterior_madison_ffnn = posterior_madison+math.log2(prior_madison)
log_posterior_hamilton_ffnn = posterior_hamilton+math.log2(prior_hamilton)

#TODO - determine authorship
author_ffnn = "Madison" if log_posterior_madison_ffnn > log_posterior_hamilton_ffnn else 'Hamilton'

In [34]:
grader.check("ffnn_author")

In [35]:
print (author_ffnn)

Hamilton


Use the RNN neural language models to calculate the log posteriors for `document`.

<!--
BEGIN QUESTION
name: rnn_author
-->

In [36]:
#TODO - calculate the log posteriors for Madison and Hamilton using RNN LMs
posterior_madison = -neglogprob(document, rnn_lm_madison)
posterior_hamilton =  -neglogprob(document, rnn_lm_hamilton)

log_posterior_madison_rnn = posterior_madison+math.log2(prior_madison)
log_posterior_hamilton_rnn = posterior_hamilton+math.log2(prior_hamilton)
#TODO - determine authorship
author_rnn = "Madison" if log_posterior_madison_rnn > log_posterior_hamilton_rnn else 'Hamilton'

In [37]:
grader.check("rnn_author")

Now, we can use these models to determine authorship on the entire test set. Define the `ffnn_classify` and `ffnn_classify` functions, wich take a sequence of `tokens` and return either `'Hamilton'` or `'Madison'` depending on which of the two has a higher probability of authoring the text.

<!--
BEGIN QUESTION
name: authorship
-->

In [38]:
def ffnn_classify(tokens):
    posterior_madison = -neglogprob(tokens, ffnn_lm_madison)
    posterior_hamilton =  -neglogprob(tokens, ffnn_lm_hamilton)
    log_posterior_madison_ffnn = posterior_madison+math.log2(prior_madison)
    log_posterior_hamilton_ffnn = posterior_hamilton+math.log2(prior_hamilton)
    return "Madison" if log_posterior_madison_ffnn > log_posterior_hamilton_ffnn else 'Hamilton'
    

def rnn_classify(tokens):
    posterior_madison = -neglogprob(tokens, rnn_lm_madison)
    posterior_hamilton =  -neglogprob(tokens, rnn_lm_hamilton)
    log_posterior_madison_rnn = posterior_madison+math.log2(prior_madison)
    log_posterior_hamilton_rnn = posterior_hamilton+math.log2(prior_hamilton)
    return "Madison" if log_posterior_madison_rnn > log_posterior_hamilton_rnn else 'Hamilton'

for ex in testing:
    print(f"{ex['number']:2} {ffnn_classify(ex['tokens']):8} {rnn_classify(ex['tokens']):8}")

49 Hamilton Madison 
50 Madison  Madison 
51 Madison  Madison 
52 Madison  Madison 
53 Madison  Madison 
54 Madison  Madison 
55 Madison  Madison 
56 Madison  Madison 
57 Madison  Madison 
62 Madison  Madison 
63 Madison  Madison 


In [39]:
grader.check("authorship")

<!-- BEGIN QUESTION -->

**Question:** What would happen if the dataset is imbalanced, i.e., if we have much more data for one author compared to another?

<!--
BEGIN QUESTION
name: open_response_imbalanced
manual: true
-->

The model for the author with the less data would me less efficient for determing the data which of it's author, so the second model can be more dominant recongnizing some similarities with it's author and then the whole classify model could have a tendency to classify all the data it see with the more data author label.

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

## Lab debrief – for consensus submission only

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of Lab 2-3

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [40]:
grader.check_all()