<a href="https://colab.research.google.com/github/cs236299-2023-spring/lab2-5-joannj35/blob/master/lab2_5.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [82]:
# Please do not change this cell because some hidden tests might depend on it.
import os

# Otter grader does not handle ! commands well, so we define and use our
# own function to execute shell commands.
def shell(commands, warn=True):
    """Executes the string `commands` as a sequence of shell commands.
     
       Prints the result to stdout and returns the exit status. 
       Provides a printed warning on non-zero exit status unless `warn` 
       flag is unset.
    """
    file = os.popen(commands)
    print (file.read().rstrip('\n'))
    exit_status = file.close()
    if warn and exit_status != None:
        print(f"Completed with errors. Exit status: {exit_status}\n")
    return exit_status

shell("""
ls requirements.txt >/dev/null 2>&1
if [ ! $? = 0 ]; then
 rm -rf .tmp
 git clone git@github.com:cs236299-2023-spring/lab2-5.git .tmp
 mv .tmp/tests ./
 mv .tmp/requirements.txt ./
 rm -rf .tmp
fi
pip install -q -r requirements.txt
""")




In [83]:
# Initialize Otter
import otter
grader = otter.Notebook()

$$
\renewcommand{\vect}[1]{\mathbf{#1}}
\renewcommand{\cnt}[1]{\sharp(#1)}
\renewcommand{\argmax}[1]{\underset{#1}{\operatorname{argmax}}}
\renewcommand{\softmax}{\operatorname{softmax}}
\renewcommand{\Prob}{\Pr}
\renewcommand{\given}{\,|\,}
$$

# Course 236299
## Lab 2-5 - Sequence labeling with recurrent neural networks

In the last lab, you saw how to use hidden Markov models (HMMs) for sequence labeling. In this lab, you will use recurrent neural networks (RNNs) for sequence labeling. 

In this lab, we consider the task of automatic punctuation restoration from unpunctuated text, which is useful for post-processing transcribed speech from speech recognition systems (since we don't want users to have to utter all punctuation marks). We can formulate this task as a sequence labeling task, predicting for each word the punctuation that should follow. If there's no punctuation following the word, we use a special tag `O` for "other".

The dataset we use is the Federalist papers, but this time we use text without punctuation as our input, and predict the punctuation following each word. An example constructed from the dataset looks like below, which correponds to the punctuated sentence `the powers to make treaties and to send and receive ambassadors , speak their own propriety .`

| Token       | Label  |
| ----------- | ------ |
| &lt;bos&gt; | O |
| the         | O |
| powers      | O |
| to          | O |
| make        | O |
| treaties    | O |
| and         | O |
| to          | O |
| send        | O |
| and         | O |
| receive     | O |
| ambassadors | , |
| speak       | O |
| their       | O |
| own         | O |
| propriety   | . |

# Preparation and setup

In [84]:
import copy

import wget
import torch
import torch.nn as nn

import csv
import random

from datasets import load_dataset
from tokenizers import Tokenizer
from tokenizers.pre_tokenizers import WhitespaceSplit
from tokenizers import normalizers
from tokenizers.models import WordLevel
from tokenizers.trainers import WordLevelTrainer
from transformers import PreTrainedTokenizerFast

from collections import Counter
from tqdm.auto import tqdm

# Fix random seed for replicability
SEED=1234
random.seed(SEED)
torch.manual_seed(SEED)

<torch._C.Generator at 0x7fc46a220c30>

In [85]:
## GPU check
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(device)

cuda


In [86]:
# Prepare to download needed data
def download_if_needed(source, dest, filename):
    os.makedirs(data_path, exist_ok=True) # ensure destination
    os.path.exists(f"./{dest}{filename}") or wget.download(source + filename, out=dest)

source_path = "https://raw.githubusercontent.com/nlp-236299/data/master/Federalist/"
data_path = "data/"

# Download files
for filename in ["federalist_tag.train.txt",
                 "federalist_tag.dev.txt",
                 "federalist_tag.test.txt"
                ]:
    download_if_needed(source_path, data_path, filename)

Next, we process the dataset by extracting the sequences and their corresponding labels and save it in CSV format. 

In [87]:
for split in ['train', 'dev', 'test']:
    in_file = f'data/federalist_tag.{split}.txt'
    out_file = f'data/federalist_tag.{split}.csv'
    
    with open(in_file, 'r') as f_in:
        with open(out_file, 'w') as f_out:
            text, tag = [], []
            writer = csv.writer(f_out)
            writer.writerow(('text','tag'))
            for line in f_in:
                if line.strip() == '':
                    writer.writerow((' '.join(text), ' '.join(tag)))
                    text, tag = [], []
                else:
                    token, label = line.split('\t')
                    text.append(token)
                    tag.append(label.strip())

Let's take a look at what each data file looks like.

In [88]:
shell('head "data/federalist_tag.train.csv"')

text,tag
<bos> to the people of the state of new york,O O O O O O O O O :
<bos> the last paper having concluded the observations which were meant to introduce a candid survey of the plan of government reported by the convention we now proceed to the execution of that part of our undertaking,"O O O O O O O O O O O O O O O O O O O O O O O O , O O O O O O O O O O O ."
<bos> the first question that offers itself is whether the general form and aspect of the government be strictly republican it is evident that no other form would be reconcilable with the genius of the people of america with the fundamental principles of the revolution or with that honorable determination which animates every votary of freedom to rest all our political experiments on the capacity of mankind for self-government if the plan of the convention therefore be found to depart from the republican character its advocates must abandon it as no longer defensible,"O O O O O O O , O O O O O O O O O O O . O O O O O O O O O

We use `datasets` to prepare the data. More information on datasets can be found at https://huggingface.co/docs/datasets/loading.

In [89]:
federa = load_dataset('csv', data_files={'train':'data/federalist_tag.train.csv', \
                                       'val': 'data/federalist_tag.dev.csv', \
                                       'test': 'data/federalist_tag.test.csv'})
federa



Downloading and preparing dataset csv/default to /root/.cache/huggingface/datasets/csv/default-d9f09870171c0d07/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317...


Downloading data files:   0%|          | 0/3 [00:00<?, ?it/s]

Extracting data files:   0%|          | 0/3 [00:00<?, ?it/s]

Generating train split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating val split: 0 examples [00:00, ? examples/s]

  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


Generating test split: 0 examples [00:00, ? examples/s]

Dataset csv downloaded and prepared to /root/.cache/huggingface/datasets/csv/default-d9f09870171c0d07/0.0.0/6b34fb8fcf56f7c8ba51dc895bfa2bfbe43546f190a60fcf74bb5e8afdcc2317. Subsequent calls will reuse this data.


  return pd.read_csv(xopen(filepath_or_buffer, "rb", use_auth_token=use_auth_token), **kwargs)


  0%|          | 0/3 [00:00<?, ?it/s]

DatasetDict({
    train: Dataset({
        features: ['text', 'tag'],
        num_rows: 809
    })
    val: Dataset({
        features: ['text', 'tag'],
        num_rows: 106
    })
    test: Dataset({
        features: ['text', 'tag'],
        num_rows: 120
    })
})

In [90]:
train_data = federa['train']
val_data = federa['val']
test_data = federa['test']

We build a tokenizer from the training data to tokenize text and convert tokens into word ids.

In [91]:

# We place a limit on the size of the vocabulary, including only the 
# `MAX_VOCAB_SIZE` most frequent words. All others will become `[UNK]`.
MAX_VOCAB_SIZE = 5000
unk_token = '[UNK]'
pad_token = '[PAD]'

text_tokenizer = Tokenizer(WordLevel(unk_token=unk_token))
text_tokenizer.pre_tokenizer = WhitespaceSplit()

trainer = WordLevelTrainer(vocab_size=MAX_VOCAB_SIZE, special_tokens=[pad_token, unk_token])
text_tokenizer.train_from_iterator(train_data['text'], trainer=trainer)

We use `datasets.Dataset.map` to convert text into word ids. As shown in lab 1-5, first we need to wrap `tokenizer` with the `transformers.PreTrainedTokenizerFast` class to be compatible with the `datasets` library.

In [92]:
hf_text_tokenizer = PreTrainedTokenizerFast(tokenizer_object=text_tokenizer, pad_token=pad_token, unk_token=unk_token)

In [93]:
def encode(example):
    return hf_text_tokenizer(example['text'])

train_data = train_data.map(encode)
val_data = val_data.map(encode)
test_data = test_data.map(encode)

  0%|          | 0/809 [00:00<?, ?ex/s]

  0%|          | 0/106 [00:00<?, ?ex/s]

  0%|          | 0/120 [00:00<?, ?ex/s]

In [94]:
train_data[0]

{'text': '<bos> to the people of the state of new york',
 'tag': 'O O O O O O O O O :',
 'input_ids': [22, 4, 2, 45, 3, 2, 34, 3, 73, 132],
 'token_type_ids': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}

We also need to convert tag strings into tag ids.

In [95]:
tag_tokenizer = Tokenizer(WordLevel())
tag_tokenizer.pre_tokenizer = WhitespaceSplit()

tag_trainer = WordLevelTrainer(special_tokens=[pad_token])
tag_tokenizer.train_from_iterator(train_data['tag'], trainer=tag_trainer)

hf_tag_tokenizer = PreTrainedTokenizerFast(tokenizer_object=tag_tokenizer, pad_token=pad_token)

def encode_tag(example):
    example['tag_ids'] = hf_tag_tokenizer(example['tag']).input_ids
    return example

train_data = train_data.map(encode_tag)
val_data = val_data.map(encode_tag)
test_data = test_data.map(encode_tag)


  0%|          | 0/809 [00:00<?, ?ex/s]

  0%|          | 0/106 [00:00<?, ?ex/s]

  0%|          | 0/120 [00:00<?, ?ex/s]

In [96]:
# Print out some stats
text_vocab = hf_text_tokenizer.get_vocab()
tag_vocab = hf_tag_tokenizer.get_vocab()
vocab_size = len(text_vocab)
num_tags = len(tag_vocab)

print(f"Size of vocab: {vocab_size}")
print (f"Most common English words: {Counter(token for sentence in train_data['text'] for token in sentence.split()).most_common(10)}\n")
print(f"Number of tags: {num_tags}")
print (f"Most common tags: {Counter(tag for sentence_tags in train_data['tag'] for tag in sentence_tags.split()).most_common(10)}")

Size of vocab: 5000
Most common English words: [('the', 12401), ('of', 8251), ('to', 5107), ('and', 3325), ('in', 3159), ('a', 2878), ('be', 2666), ('that', 1954), ('it', 1805), ('is', 1552)]

Number of tags: 11
Most common tags: [('O', 118169), (',', 8962), ('.', 3493), (';', 1036), ('?', 305), (':', 153), (')', 26), ('(', 25), ('!', 10), ("'", 2)]


We mapped words that are not among the most frequent words (specified by `MAX_VOCAB_SIZE`) to a special unknown token:

In [97]:
unk_index = text_vocab[unk_token]

print (f"Unknown word: {unk_token}\n"
       f"Unknown index: {unk_index}")

Unknown word: [UNK]
Unknown index: 1


To facilitate batching sentences of different lengths into the same tensor we also reserved a special padding symbol `[PAD]` for both `text_vocab` and `tag_vocab`

In [98]:
print (f"Padding token: {pad_token}")
text_pad_index = text_vocab[pad_token]
print (f"Padding text_vocab token id: {text_pad_index}")
tag_pad_index = tag_vocab[pad_token]
print (f"Padding tag_vocab token id: {tag_pad_index}")

Padding token: [PAD]
Padding text_vocab token id: 0
Padding tag_vocab token id: 0


To load data in batched tensors, we use `torch.utils.data.DataLoader` for data splits, which enables us to iterate over the dataset under a given `BATCH_SIZE`, which is set to be `1` throughout this lab. We still batch the data because other torch functions expect data to be batched.

In [99]:
 # we use batch size 1 for simplicity
BATCH_SIZE = 1

# Defines how to batch a list of examples together
def collate_fn(examples):
    batch = {}
    bsz = len(examples)
    input_ids, tag_ids = [], []
    for example in examples:
        input_ids.append(example['input_ids'])
        tag_ids.append(example['tag_ids'])
        
    max_length = max([len(word_ids) for word_ids in input_ids])

    tag_batch = torch.zeros(bsz, max_length).long().fill_(tag_vocab[pad_token]).to(device)
    text_batch = torch.zeros(bsz, max_length).long().fill_(text_vocab[pad_token]).to(device)
    for b in range(bsz):
        text_batch[b][:len(input_ids[b])] = torch.LongTensor(input_ids[b]).to(device)
        tag_batch[b][:len(tag_ids[b])] = torch.LongTensor(tag_ids[b]).to(device)
    
    batch['tag_ids'] = tag_batch
    batch['input_ids'] = text_batch
    return batch

train_iter = torch.utils.data.DataLoader(train_data, 
                                         batch_size=BATCH_SIZE, 
                                         shuffle=True,
                                         collate_fn=collate_fn)
val_iter = torch.utils.data.DataLoader(val_data, 
                                       batch_size=BATCH_SIZE, 
                                       shuffle=False,
                                       collate_fn=collate_fn)
test_iter = torch.utils.data.DataLoader(test_data, 
                                        batch_size=BATCH_SIZE, 
                                        shuffle=False, 
                                        collate_fn=collate_fn)

Let's take a look at the dataset. Recall from homework 1 that there are two different ways of iterating over the dataset, one by iterating over individual examples, the other by iterating over batches of examples.

In [100]:
# Iterating over individual examples:
# Note that the words are the original words, so you'd need to manually 
# replace them with [UNK] if not in the vocabulary.
example = train_data[1]
text = example['text'].split() # a sequence of unpunctuated words
tags = example['tag'].split()  # a sequence of tags indicating the proper punctuation
print (f'{"TYPE":15}: {"TAG"}')
for word, tag in zip(text, tags):
  print (f'{word:15}: {tag}')

TYPE           : TAG
<bos>          : O
the            : O
last           : O
paper          : O
having         : O
concluded      : O
the            : O
observations   : O
which          : O
were           : O
meant          : O
to             : O
introduce      : O
a              : O
candid         : O
survey         : O
of             : O
the            : O
plan           : O
of             : O
government     : O
reported       : O
by             : O
the            : O
convention     : ,
we             : O
now            : O
proceed        : O
to             : O
the            : O
execution      : O
of             : O
that           : O
part           : O
of             : O
our            : O
undertaking    : .


Alternatively, we can produce the data a batch at a time, as in the example below. Note the "shape" of a batch, it's a two-dimensional tensor of size `batch_size x max_length`. (In this case, `batch_size` is 1.) Thus, to extract a sentence from a batch, we need to index by the _first_ dimension.

In [101]:
# Iterating over batches of examples:
# Note that the collat_fn returns input_ids and tag_ids only, so you need to manually convert them back to strings.
# Unknown words have been mapped to unknown word ids
batch = next(iter(train_iter))
text_ids = batch['input_ids']
example_text = text_ids[0]
print (f"Size of text batch: {text_ids.size()}")
print (f"First sentence in batch: {example_text}")
print (f"Mapped back to string: {hf_text_tokenizer.decode(example_text)}")

print ('-'*20)

tag_ids = batch['tag_ids']
example_tags = tag_ids[0]
print (f"Size of tag batch: {tag_ids.size()}")
print (f"First tag in batch: {example_tags}")
print (f"Mapped back to string: {hf_tag_tokenizer.decode(example_tags, clean_up_tokenization_spaces=False)}")

Size of text batch: torch.Size([1, 70])
First sentence in batch: tensor([  22,   10,   33,    8,  619,  304,  230,    7,  421,    3,  579,  389,
          16,  231,    2,  355,  139,    5,   42,   10,   16,   18,  230,    7,
         100,  364,   12,   16,   43,  219,    8, 1810,   14,  154,  172,   70,
        1011, 2225,   16,   18,   19,    9,  206,    8, 1795,    4,    7,  734,
         364,   12,   74,   43,    2,   72,   90,  118,  563,   19,    2,  302,
           3,    1,    2, 1411,  960,    5,  703,    3,    2,  273],
       device='cuda:0')
Mapped back to string: <bos> it may be asked also whether a duration of four years would answer the end proposed and if it would not whether a less period which would at least be recommended by greater security against ambitious designs would not for that reason be preferable to a longer period which was at the same time too short for the purpose of [UNK] the desired firmness and independence of the magistrate
--------------------
Size of

Given the tokenized tags of an unpunctuated sequence of words, we can easily restore the punctuation:

In [102]:
def restore_punctuation(word_ids, tag_ids):
  words = hf_text_tokenizer.convert_ids_to_tokens(word_ids)
  tags = hf_tag_tokenizer.convert_ids_to_tokens(tag_ids)
  words_with_punc = []
  for word, tag in zip(words, tags):
    words_with_punc.append(word)
    if tag != 'O':
      words_with_punc.append(tag)
  return ' '.join(words_with_punc)

In [103]:
print(restore_punctuation(example['input_ids'], example['tag_ids']))

<bos> the last paper having concluded the observations which were meant to introduce a candid survey of the plan of government reported by the convention , we now proceed to the execution of that part of our undertaking .


# Majority Labeling

Recall from our previous lab that a naive baseline is choosing the majority label for each word in the sequence, where the majority label depends on the word. We've provided an implementation of this baseline for you, to give you a sense of how difficult the punctuation restoration task is.

In [104]:
class MajorityTagger():
  def __init__(self):
    """Initializer.
    """
    self.most_common_label_given_word = {}

  def train_all(self, train_iter):
    """Finds the majority label for each word in the training set.
    """
    train_counts_given_word = {}
    for batch in train_iter:
      for example_input_ids, example_tag_ids in zip(batch['input_ids'], batch['tag_ids']):
        for word_id, tag_id in zip(example_input_ids, example_tag_ids):
          if word_id not in train_counts_given_word:
            train_counts_given_word[word_id.item()] = Counter([])
          train_counts_given_word[word_id.item()].update([tag_id.item()])
    
    for word_id in train_counts_given_word:
      # Find the most common
      most_common_label = train_counts_given_word[word_id].most_common(1)[0][0]
      self.most_common_label_given_word[word_id] = most_common_label

  def predict_all(self, test_iter):
    """Predicts labels for each example in test_iter.
       Returns a list of list of strings. The order should be the same as
       in `test_iter.dataset` (or equivalently `test_iter`).
    """
    predictions = []
    for batch in test_iter:
      batch_predictions = []
      for example_input_ids in batch['input_ids']:
        example_tag_ids_pred = []
        for word_id in example_input_ids:
          tag_id_pred = self.most_common_label_given_word[word_id.item()]
          example_tag_ids_pred.append(tag_id_pred)
        batch_predictions.append(example_tag_ids_pred)
      predictions.append(batch_predictions)
    return predictions # batch list -> example list -> tag list

  def evaluate(self, test_iter):
    """Evaluates the overall accuracy, and the precision and recall of comma.
    """
    correct = 0
    total = 0
    true_positive_comma = 0
    predicted_positive_comma = 0
    total_positive_comma = 0
    comma_id = tag_vocab[',']

    # Get predictions
    predictions = self.predict_all(test_iter)
    assert len(predictions) == len(test_iter)
    
    for batch_tag_pred, batch in zip(predictions, test_iter):
      for tag_ids_pred, example_tag_ids in zip(batch_tag_pred, batch['tag_ids']):
        assert len(tag_ids_pred) == len(example_tag_ids)
        for tag_id_pred, tag_id in zip(tag_ids_pred, example_tag_ids):
          tag_id = tag_id.item()
          total += 1
          if tag_id_pred == tag_id:
            correct += 1
          if tag_id_pred == comma_id:
            predicted_positive_comma += 1 # predicted positive
          if tag_id == comma_id:
            total_positive_comma += 1     # gold label positive
          if tag_id_pred == comma_id and tag_id == comma_id:
            true_positive_comma += 1      # true positive
    precision_comma = true_positive_comma / predicted_positive_comma
    recall_comma = true_positive_comma / total_positive_comma
    F1_comma = 2. / (1./precision_comma + 1./recall_comma)
    return correct/total, precision_comma, recall_comma, F1_comma

Now, we can train our baseline on training data. 

In [105]:
maj_tagger = MajorityTagger()
maj_tagger.train_all(train_iter)

Let's take a look at an example prediction using this simple baseline.

In [106]:
# Get all predictions
predictions = maj_tagger.predict_all(test_iter)

# Pick one example
example_id = 2 # the third example
example = test_data[example_id]
prediction = predictions[example_id][0]

print('Ground truth punctuation:')
print(restore_punctuation(example['input_ids'], example['tag_ids']), '\n')
print('Predicted punctuation:')
print(restore_punctuation(example['input_ids'], prediction))

Ground truth punctuation:
<bos> the several departments being perfectly co-ordinate by the terms of their common commission , none of them , it is evident , can pretend to an exclusive or superior right of [UNK] the boundaries between their respective powers ; and how are the encroachments of the stronger to be prevented , or the [UNK] of the weaker to be redressed , without an appeal to the people themselves , who , as the [UNK] of the commissions , can alone declare its true meaning , and enforce its observance ? there is certainly great force in this reasoning , and it must be allowed to prove that a constitutional road to the decision of the people ought to be marked out and kept open , for certain great and extraordinary occasions . but there appear to be insuperable objections against the proposed [UNK] to the people , as a provision in all cases for keeping the several departments of power within their constitutional limits . in the first place , the provision does not reach the

This baseline model clearly grossly underpunctuates. It predicts the tag to be `O` almost all of the time.

We can quantitatively evaluate the performance of the majority labeling tagger, which establishes a baseline that any reasonable model should outperform.

In [107]:
accuracy, precision_comma, recall_comma, F1_comma = maj_tagger.evaluate(test_iter)
print (f"Overall Accuracy: {accuracy:.4f}. \n"
       f"Comma: Precision: {precision_comma:.4f}. Recall: {recall_comma:.4f}. F1: {F1_comma:.4f}")

Overall Accuracy: 0.8511. 
Comma: Precision: 0.2420. Recall: 0.2093. F1: 0.2245


<!-- BEGIN QUESTION -->

**Question:** You can see that even though the overall accuracy is pretty high, the F-1 score of comma is very low. Why?

<!--
BEGIN QUESTION
name: open_response_F1
manual: true
-->

The overall accuracy is high because the majority of punctuation is " "(no punctuation), as in the amount of time a word is followed by "O" is much higher than the amount of times its followed by "," , and since we are using the majority model that is what we expect the model to predict.
Additionally, there are very little apperances of a comma in the sentence, meaning, even with a false prediction, it will not vastly affect the accuracy of the model.

The F-1 score of the comma is very low because it precicly predicts the accuracy for the comma itself without regarding the other predictions of other puncutions in the sentence, meaning it better describes the "accuracy" of the models predicition of the comma which is low. 

<!-- END QUESTION -->



# RNN Sequence Tagging

Now we get to the real point, using an RNN model for sequence tagging. We provide a base class `RNNBaseTagger` below, which implements training and evaluation. Throughout the rest of this lab, you will implement three subclasses of this class, using PyTorch functions at different abstraction levels.

In [108]:
class RNNBaseTagger(nn.Module):
  def __init__(self):
    super().__init__()

  def init_parameters(self, init_low=-0.15, init_high=0.15):
    """Initialize parameters. We usually use larger initial values for smaller models.
    See http://proceedings.mlr.press/v9/glorot10a/glorot10a.pdf for a more
    in-depth discussion.
    """
    for p in self.parameters():
      p.data.uniform_(init_low, init_high)

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (1 ,seq_len) 
    Returns:
      logits: a tensor of size (1, seq_len, self.N)
    """
    raise NotImplementedError

  def compute_loss(self, logits, tags):
    return self.loss_function(logits.view(-1, self.N), tags.view(-1))

  def train_all(self, train_iter, val_iter, epochs=5, learning_rate=1e-3):
    # Switch the module to training mode
    self.train()
    # Use Adam to optimize the parameters
    optim = torch.optim.Adam(self.parameters(), lr=learning_rate)
    best_validation_accuracy = -float('inf')
    best_model = None
    # Run the optimization for multiple epochs
    for epoch in range(epochs): 
      total = 0
      running_loss = 0.0
      for batch in tqdm(train_iter):
        # Zero the parameter gradients
        self.zero_grad()

        # Input and target
        words = batch['input_ids'] # 1, seq_len
        tags = batch['tag_ids']    # 1, seq_len
        
        # Run forward pass and compute loss along the way.
        logits = self.forward(words)
        loss = self.compute_loss(logits, tags)
        
        # Perform backpropagation
        (loss/words.size(1)).backward()

        # Update parameters
        optim.step()

        # Training stats
        total += 1
        running_loss += loss.item()
        
      # Evaluate and track improvements on the validation dataset
      validation_accuracy, _, _, _ = self.evaluate(val_iter)
      if validation_accuracy > best_validation_accuracy:
        best_validation_accuracy = validation_accuracy
        self.best_model = copy.deepcopy(self.state_dict())
      epoch_loss = running_loss / total
      print (f'Epoch: {epoch} Loss: {epoch_loss:.4f} '
             f'Validation accuracy: {validation_accuracy:.4f}')

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      tag_batch: a tensor containing tag ids of size (1, seq_len)
    """
    raise NotImplementedError

  def evaluate(self, iterator):
    """Returns the model's performance on a given dataset `iterator`.

    Arguments: 
      iterator
    Returns:
      overall accuracy, and precision, recall, and F1 for comma
    """
    correct = 0
    total = 0
    true_positive_comma = 0
    predicted_positive_comma = 0
    total_positive_comma = 0
    comma_id = tag_vocab[',']
    pad_id = tag_vocab[pad_token]

    for batch in tqdm(iterator):
      words = batch['input_ids']       # 1, seq_len
      tags = batch['tag_ids']          # 1, seq_len
      tags_pred = self.predict(words)  # 1, seq_len
      mask = tags.ne(pad_id)
      cor = (tags == tags_pred)[mask]
      correct += cor.float().sum().item()
      total += mask.float().sum().item()
      predicted_positive_comma += (mask * tags_pred.eq(comma_id)).float().sum().item()
      true_positive_comma += (mask * tags.eq(comma_id) * tags_pred.eq(comma_id)).float().sum().item()
      total_positive_comma += (mask * tags.eq(comma_id)).float().sum().item()

    precision_comma = true_positive_comma / predicted_positive_comma
    recall_comma = true_positive_comma / total_positive_comma
    F1_comma = 2. / (1./precision_comma + 1./recall_comma)
    return correct/total, precision_comma, recall_comma, F1_comma

## RNN from scratch

In this part of the lab, you will implement the forward pass of an RNN from scratch. Recall that 

\begin{align}
h_0 &= 0\\
h_t &= \sigma(\vect{U}x_t + \vect{V}h_{t - 1} + b_h) \\
o_t &= \vect{W}h_t + b_o
\end{align}

where we embed each word and use its embedding as $x_t$, and we use $o_t$ as the output logits. (Again, the final softmax has been absorbed into the loss function so you don't need to implement that.) Note that we added bias vectors $b_h$ and $b_o$ in this lab since we are training very small models. (In large models, having a bias vector matters a lot less.) In this part, you should implement the RNN from scratch and *not* use `nn.RNN`. We'll make use of this convenient PyTorch module in the next part. 

You will need to implement both the `forward` function and the `predict` function.

> Hint: You might find [`torch.stack`](https://pytorch.org/docs/stable/generated/torch.stack.html) useful for stacking a list of tensors to form a single tensor. You can also use `torch.mv` or `@` for matrix-vector multiplication, `torch.mm` or `@` for matrix-matrix multiplication.

**Warning: Training this and later models takes a little while, likely around three minutes for the full set of epochs. You might want to set the number of epochs to a small number (1?) until your code is running well. You should also feel free to move ahead to the next parts while earlier parts are running.**

In [109]:
class RNNTagger1(RNNBaseTagger):
  def __init__(self, hf_text_tokenizer, hf_tag_tokenizer, embedding_size, hidden_size):
    super().__init__()
    self.hf_text_tokenizer = hf_text_tokenizer
    self.hf_tag_tokenizer = hf_tag_tokenizer
    
    self.N = len(self.hf_tag_tokenizer)   # tag vocab size
    self.V = len(self.hf_text_tokenizer)  # text vocab size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.U = nn.Parameter(torch.Tensor(hidden_size, embedding_size))
    self.V = nn.Parameter(torch.Tensor(hidden_size, hidden_size))
    self.b_h = nn.Parameter(torch.Tensor(hidden_size))
    self.sigma = nn.Tanh() # Nonlinear Layer
    self.W = nn.Parameter(torch.Tensor(self.N, hidden_size))
    self.b_o = nn.Parameter(torch.Tensor(self.N))

    # Create loss function
    pad_id = self.hf_tag_tokenizer.pad_token_id
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      logits: a tensor of size (1, seq_len, self.N)
    """
    h0 = torch.zeros(self.hidden_size, device=device)
    word_embeddings = self.word_embeddings(text_batch) # 1, seq_len, embedding_size
    seq_len = word_embeddings.size(1)

    #TODO: your code below
    o_vect = []
    h_t = h0

    for idx in range(0,seq_len):
      h_t = self.sigma(self.U @ word_embeddings[0][idx] + self.V @ h_t + self.b_h)
      o_t = self.W @ h_t + self.b_o
      o_vect.append(o_t)
    logits = (torch.stack(o_vect, dim=0))

    return logits.view(1,seq_len,self.N)

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      tag_batch: a tensor containing tag ids of size (1, seq_len)
    """
    #TODO: your code below
    tag_batch = torch.argmax(self.forward(text_batch),dim=2)
    return tag_batch

In [110]:
# Instantiate and train classifier
rnn_tagger1 = RNNTagger1(hf_text_tokenizer, hf_tag_tokenizer, embedding_size=32, hidden_size=32).to(device)
rnn_tagger1.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3)
rnn_tagger1.load_state_dict(rnn_tagger1.best_model)

# Evaluate model performance
train_accuracy1, train_p1, train_r1, train_f1 = rnn_tagger1.evaluate(train_iter)
test_accuracy1, test_p1, test_r1, test_f1 = rnn_tagger1.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy1:.3f}, precision: {train_p1:.3f}, recall: {train_r1:.3f}, F-1: {train_f1:.3f}\n'
      f'Test accuracy: {test_accuracy1:.3f}, precision: {test_p1:.3f}, recall: {test_r1:.3f}, F-1: {test_f1:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 77.0154 Validation accuracy: 0.8930


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 55.0089 Validation accuracy: 0.8923


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 53.1839 Validation accuracy: 0.8909


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 52.1290 Validation accuracy: 0.8919


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 51.1200 Validation accuracy: 0.8920


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.895, precision: 0.583, recall: 0.009, F-1: 0.017
Test accuracy: 0.893, precision: 0.600, recall: 0.020, F-1: 0.038


In [111]:
grader.check("rnn1")

Did your model outperform the baseline? Don't be surprised if it doesn't: the model is very small and the dataset is small as well.

## RNN forward using `nn.RNN` and explicit loop through time steps

In this part, you will use `nn.RNN` and `nn.Linear` to implement the forward pass:

\begin{align}
h_0 &= 0\\
h_t &= \text{nn.RNN}(x_t,h_{t - 1}) \\
o_t &= \text{nn.Linear}(h_t)
\end{align}

You will need to implement both the `forward` function and the `predict` function. You'll use the `nn.RNN` function to implement each time step of the RNN, with an explicit `for` loop to step through the time steps. (In the next part, you'll use a single call to `nn.RNN` to handle the entire process!) For the linear projection from RNN outputs to logits, use `self.hidden2output`.

> Hint: you can reuse your `predict` implementation from before if you wrote it in a general way.

In [112]:
class RNNTagger2(RNNBaseTagger):
  def __init__(self, hf_text_tokenizer, hf_tag_tokenizer, embedding_size, hidden_size):
    super().__init__()
    self.hf_text_tokenizer = hf_text_tokenizer
    self.hf_tag_tokenizer = hf_tag_tokenizer
    
    self.N = len(self.hf_tag_tokenizer)   # tag vocab size
    self.V = len(self.hf_text_tokenizer)  # text vocab size

    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.rnn = nn.RNN(input_size=embedding_size, hidden_size=hidden_size, batch_first=True)
    self.hidden2output = nn.Linear(hidden_size, self.N)

    # Create loss function
    pad_id = self.hf_tag_tokenizer.pad_token_id
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      logits: a tensor of size (1, seq_len, self.N)
    """
    # h0 is of shape (num_layers * num_directions, batch, hidden_size),
    # which is (1, 1, hidden_size)
    h0 = torch.zeros(1, 1, self.hidden_size, device=device)
    
    #TODO: your code below, using an *explicit for-loop*
    word_embeddings = self.word_embeddings(text_batch)
    seq_len = word_embeddings.size(1)
    h_t = h0
    o_vect = []
    
    for idx in range(0,seq_len):
      word  = word_embeddings[0][idx].view(1,1,self.hidden_size)
      o_t, h_t = self.rnn( word, h_t)
      o_vect.append(self.hidden2output(h_t)) #linear

    logits = (torch.stack(o_vect, dim=0))
    return logits.view(1,seq_len,self.N)

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      tag_batch: a tensor containing tag ids of size (1, seq_len)
    """
    #TODO: your code below
    tag_batch = torch.argmax(self.forward(text_batch),dim=2)
    return tag_batch

In [113]:
# Instantiate and train classifier
rnn_tagger2 = RNNTagger2(hf_text_tokenizer, hf_tag_tokenizer, embedding_size=32, hidden_size=32).to(device)
rnn_tagger2.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3)
rnn_tagger2.load_state_dict(rnn_tagger2.best_model)

# Evaluate model performance
train_accuracy2, train_p2, train_r2, train_f2 = rnn_tagger2.evaluate(train_iter)
test_accuracy2, test_p2, test_r2, test_f2 = rnn_tagger2.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy2:.3f}, precision: {train_p2:.3f}, recall: {train_r2:.3f}, F-1: {train_f2:.3f}\n'
      f'Test accuracy: {test_accuracy2:.3f}, precision: {test_p2:.3f}, recall: {test_r2:.3f}, F-1: {test_f2:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 75.6014 Validation accuracy: 0.8928


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 54.3757 Validation accuracy: 0.8926


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 52.3063 Validation accuracy: 0.8930


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 50.8981 Validation accuracy: 0.8929


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 49.8527 Validation accuracy: 0.8929


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.897, precision: 0.448, recall: 0.096, F-1: 0.159
Test accuracy: 0.893, precision: 0.374, recall: 0.079, F-1: 0.130


In [114]:
grader.check("rnn2")

## RNN forward using bidirectional `nn.RNN`

Instead of using a for loop, we can directly feed the entire sequence to `nn.RNN`:

\begin{align}
h_0 &= 0\\
H &= \text{nn.RNN}(X,h_0) \\
O &= \text{nn.Linear}(H)
\end{align}

where $X$ is the concatenation of $x_1, \cdots, x_T$, $H$ is the concatenation of $h_1, \cdots, h_T$, and $O$ is the concatenation of $o_1, \cdots, o_T$. 

By using this formulation, our code becomes more efficient, since `nn.RNN` is highly optimized. Besides, we can use bi-directional RNNs by simply passing `bidirectional=True` to the RNN constructor.

The difference between a bidirectional RNN and a unidirectional RNN is that bidirectional RNNs have an additional RNN cell running in the reverse direction:

\begin{align}
&h_{T+1}' = 0\\
&h_t' = \sigma(\vect{U}'x_{t}' + \vect{V}'h_{t + 1}' + b_h') \\
\end{align}

To get the output at step $t$, a bidirectional RNN simply concatenates $h_t$ and $h_t'$ and projects to produce outputs. The benefit of a bidirectional RNN is that the output at step $t$ takes into account not only words $x_1,\cdots, x_t$, but also $x_{t+1}, \cdots, x_T$.

Implement `forward` and `predict` functions below, using a bidirectional RNN.

In [115]:
class RNNTagger3(RNNBaseTagger):
  def __init__(self, hf_text_tokenizer, hf_tag_tokenizer, embedding_size, hidden_size):
    super().__init__()
    self.hf_text_tokenizer = hf_text_tokenizer
    self.hf_tag_tokenizer = hf_tag_tokenizer
    
    self.N = len(self.hf_tag_tokenizer)   # tag vocab size
    self.V = len(self.hf_text_tokenizer)  # text vocab size
    self.embedding_size = embedding_size
    self.hidden_size = hidden_size

    # Create essential modules
    self.word_embeddings = nn.Embedding(self.V, embedding_size) # Lookup layer
    self.rnn = nn.RNN(input_size=embedding_size, 
                      hidden_size=hidden_size,
                      batch_first=True,
                      bidirectional=True)
    self.hidden2output = nn.Linear(hidden_size*2, self.N) # *2 due to using bi-rnn

    # Create loss function
    pad_id = self.hf_tag_tokenizer.pad_token_id
    self.loss_function = nn.CrossEntropyLoss(reduction='sum', ignore_index=pad_id)

    # Initialize parameters
    self.init_parameters()

  def forward(self, text_batch):
    """Performs forward, returns logits.
    
    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      logits: a tensor of size (1, seq_len, self.N)
    """
    hidden = None # equivalent to setting hidden to a zero vector

    #TODO: your code below, without using any for-loops
    word_embeddings = self.word_embeddings(text_batch)
    rnn_out = self.rnn(word_embeddings,hidden)[0]
    logits = self.hidden2output(rnn_out)
    return logits

  def predict(self, text_batch):
    """Returns the most likely sequence of tags for a sequence of words in `text_batch`.

    Arguments: 
      text_batch: a tensor containing word ids of size (1, seq_len) 
    Returns:
      tag_batch: a tensor containing tag ids of size (1, seq_len)
    """
    #TODO: your code below
    tag_batch = torch.argmax(self.forward(text_batch),dim=2)
    return tag_batch

In [116]:
# Instantiate and train classifier
rnn_tagger3 = RNNTagger3(hf_text_tokenizer, hf_tag_tokenizer, embedding_size=32, hidden_size=32).to(device)
rnn_tagger3.train_all(train_iter, val_iter, epochs=5, learning_rate=1e-3)
rnn_tagger3.load_state_dict(rnn_tagger3.best_model)

# Evaluate model performance
train_accuracy3, train_p3, train_r3, train_f3 = rnn_tagger3.evaluate(train_iter)
test_accuracy3, test_p3, test_r3, test_f3 = rnn_tagger3.evaluate(test_iter)
print(f'\nTraining accuracy: {train_accuracy3:.3f}, precision: {train_p3:.3f}, recall: {train_r3:.3f}, F-1: {train_f3:.3f}\n'
      f'Test accuracy: {test_accuracy3:.3f}, precision: {test_p3:.3f}, recall: {test_r3:.3f}, F-1: {test_f3:.3f}')

  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 0 Loss: 69.6651 Validation accuracy: 0.8961


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 1 Loss: 47.0417 Validation accuracy: 0.9048


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 2 Loss: 41.5202 Validation accuracy: 0.9045


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 3 Loss: 38.2586 Validation accuracy: 0.9087


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/106 [00:00<?, ?it/s]

Epoch: 4 Loss: 35.4437 Validation accuracy: 0.9097


  0%|          | 0/809 [00:00<?, ?it/s]

  0%|          | 0/120 [00:00<?, ?it/s]


Training accuracy: 0.933, precision: 0.635, recall: 0.448, F-1: 0.526
Test accuracy: 0.909, precision: 0.479, recall: 0.328, F-1: 0.390


In [117]:
grader.check("birnn")

Let's see what our model predicts for the example we used before.

In [118]:
# Pick one example
example_id = 2 # the third example
example = test_data[example_id]

# Process strings to word ids
text_tensor  = torch.LongTensor([example['input_ids']]).to(device)

# Predict
prediction_tensor = rnn_tagger3.predict(text_tensor)[0]

print ('Ground truth punctuation:')
print(restore_punctuation(example['input_ids'], example['tag_ids']))
print ('Predicted punctuation:')
print(restore_punctuation(example['input_ids'], prediction_tensor))

Ground truth punctuation:
<bos> the several departments being perfectly co-ordinate by the terms of their common commission , none of them , it is evident , can pretend to an exclusive or superior right of [UNK] the boundaries between their respective powers ; and how are the encroachments of the stronger to be prevented , or the [UNK] of the weaker to be redressed , without an appeal to the people themselves , who , as the [UNK] of the commissions , can alone declare its true meaning , and enforce its observance ? there is certainly great force in this reasoning , and it must be allowed to prove that a constitutional road to the decision of the people ought to be marked out and kept open , for certain great and extraordinary occasions . but there appear to be insuperable objections against the proposed [UNK] to the people , as a provision in all cases for keeping the several departments of power within their constitutional limits . in the first place , the provision does not reach the

<!-- BEGIN QUESTION -->

**Question:** Did your bidirectional RNN reach a higher F-1 score than unidirectional RNNs? Why?

<!--
BEGIN QUESTION
name: open_response_birnn
manual: true
-->

Since bidirectional RNNs have an additional RNN cell running in the reverse direction, it allows analyzing the future events (sentences in our case) by not limiting the model's learning to past and present.

This is especially useful for tasks like predicting punctuation.
For example, knowing how many words are left in a paragraph could influence whether to use a period or a comma.
Additionally, if two sentences are linked with a comma, a bidirectional RNN could potentially recognize this pattern better because it can see both sentences at once (the future event), whereas a unidirectional RNN couldnt. 

This results in a better accuracy and overall better predictions for bidirectional RNNs.
We can see that the result of bidirectinoal RNN Tagger holds the best accuracy percentage and the best F-1 score. 😀

<!-- END QUESTION -->

<!-- BEGIN QUESTION -->

# Lab debrief

**Question:** We're interested in any thoughts your group has about this lab so that we can improve this lab for later years, and to inform later labs for this year. Please list any issues that arose or comments you have to improve the lab. Useful things to comment on include the following: 

* Was the lab too long or too short?
* Were the readings appropriate for the lab? 
* Was it clear (at least after you completed the lab) what the points of the exercises were? 
* Are there additions or changes you think would make the lab better?

<!--
BEGIN QUESTION
name: open_response_debrief
manual: true
-->

_Type your answer here, replacing this text._

<!-- END QUESTION -->



# End of lab 2-5

---

To double-check your work, the cell below will rerun all of the autograder tests.

In [119]:
grader.check_all()