<a href="https://colab.research.google.com/github/mayurs619/CS-572---Natural-Language-Processing/blob/main/assignment0.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## CS572 Assignment 0

*This tutorial has been adapted from the one created by David Gaddy, Daniel Fried, Nikita Kitaev, Mitchell Stern, Rodolfo Corona, John DeNero, and Dan Klein.*

The purpose of this first assignment is to make sure that you are familiar with all the tools you need to complete the programming assignments for the course.  We will walk you through the process of building a model with PyTorch in Colab.  Most of it will be structured as a tutorial, but we will ask you to fill in code and submit at the end.

Please note that this assignment is not representative of the assignments for this course - other assignments will be considerably more involved and take longer to complete.

**Grading rubric**
- 60% Pooling network (meets target)
- 40% Improved network (improvement over target)

**Deadline**
Monday 01/30


### Important Note:
In the TA Office hours next week, the TAs will use this assignment as a PyTorch tutorial and cover solutions to most parts. Please do attend the session. More details forthcoming on Ed.

### TA contacts for this assignment:
Gaurav Parikh (gaurav.rajesh.parikh@duke.edu) \\
Aayush Sheth (aas146@duke.edu)

### Colab

Our assignments will be given to you as Jupyter notebooks, and we intend for you to run them using Google Colab.
Colab is an online editor that also provides free access to a GPU.
To get started, make a copy of the assignment by clicking `File->Save a copy in drive...`.  You will need to be logged into a Google account.

To access a GPU, go to `Edit->Notebook settings` and in the `Hardware accelerator` dropdown choose `GPU`.
As soon as you run a code cell, you will be connected to a cloud instance with a GPU.
Try running the code cell below to check that a GPU is connected (select the cell then either click the play button at the top left or press `Ctrl+Enter` or `Shift+Enter`).

In [None]:
import torch

if torch.cuda.is_available():
    print('Found GPU')
else:
    print('Did not find GPU')

Found GPU


When you run a code cell, Colab executes it on a temporary cloud instance.  Every time you open the notebook, you will be assigned a different machine.  All compute state and files saved on the previous machine will be lost.  Therefore, you may need to re-download datasets or rerun code after a reset. If you save output files that you don't want to lose, you should download them to your personal computer before moving on to something else.  You can download files by clicking the folder icon at the top left of the page under the menus to expand the sidebar and right clicking the file you want, and clicking `Download`.  Alternatively, you can mount your Google drive to the temporary cloud instance's local filesystem using the following code snippet and save files under the specified directory (note that you will have to provide permission every time you run this).


In [None]:
# mount Google drive
from google.colab import drive
drive.mount('/content/drive')

# now you can see files
!echo -e "\nNumber of Google drive files in /content/drive/My Drive/:"
!ls -l "/content/drive/My Drive/" | wc -l
# by the way, you can run any linux command by putting a ! at the start of the line

# by default everything gets executed and saved in /content/
!echo -e "\nCurrent directory:"
!pwd

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Number of Google drive files in /content/drive/My Drive/:
117

Current directory:
/content


Many of the assignments will require training a model for some period of time, often on the order of 20-30 minutes.  There are some important limitations to Colab that you should be aware of when running code for this amount of time.  If you close the window or put your computer to sleep, Colab will disconnect you from the compute machine and your code will stop running.   There are also timeouts for inactivity (somewhere on the order of 30 minutes), so if you want to leave code running, be sure to check back periodically.  After a timeout, your compute machine will be disconnected and the files on it will be lost.

A few other notes about using Colab:
* The `Runtime` menu has many different run options, such as `Run all` or `Run after` so you don't have to run each code block individually.
* Some people have run into CUDA device assert errors that did not originate from their code.  Restarting the runtime should fix this (unless there actually is a problem with your code).

**Note on GPU usage**: Colab places some restrictions on GPU usage due to which you might get locked out after continuously using one (~8 hours). To avoid this, you should only use the GPU when needed. You can enable / disable GPU usage by changing the Runtime type under the Runtime menu. If you do get locked out of using a GPU, a potential workaround is to sign in using a different account.

### Part-of-Speech Tagging

You'll be trying to predict the most common [part of speech](https://web.stanford.edu/~jurafsky/slp3/8.pdf) for a word from its characters.  This project will focus on word types rather than tokens and not use any context (https://en.wikipedia.org/wiki/Type%E2%80%93token_distinction). This task is different from (and simpler than) a standard part-of-speech tagging task, which predicts part-of-speech tags for tokens in their sentential context.

Many words can have multiple different parts of speech, but in this project we will associate each word only with its most common part of speech in the [Brown Corpus](https://en.wikipedia.org/wiki/Brown_Corpus), which has been manually labeled with part-of-speech tags.  

Words are lowercased and filtered for length and frequency. Punctuation and numbers are removed. Any real NLP application would have to deal with the actual contents of text instead of filtering in this way, but we're just warming up.

Below, we provide you with code to load the dataset. Please don't change the cell below, or you may confuse our autograder.

In [None]:
import nltk
import random

from nltk.corpus import brown
from collections import defaultdict, Counter
import pickle

nltk.download('brown')
nltk.download('universal_tagset')

brown_tokens = brown.tagged_words(tagset='universal')
print('Tagged tokens example: ', brown_tokens[:5])
print('Total # of word tokens:', len(brown_tokens))

max_word_len = 20

def most_common(s):
    "Return the most common element in a sequence."
    return Counter(s).most_common(1)[0][0]

def most_common_tags(tagged_words, min_count=3, max_len=max_word_len):
    "Return a dictionary of the most common tag for each word, filtering a bit."
    counts = defaultdict(list)
    for w, t in tagged_words:
        counts[w.lower()].append(t)
    return {w: most_common(tags) for w, tags in counts.items() if
            w.isalpha() and len(w) <= max_len and len(tags) >= min_count}

brown_types = most_common_tags(brown_tokens)
print('Tagged types example: ', sorted(brown_types.items())[:5])
print('Total # of word types:', len(brown_types))

def split(items, test_size):
    "Randomly split into train, validation, and test sets with a fixed seed."
    random.Random(288).shuffle(items)
    once, twice = test_size, 2 * test_size
    return items[:-twice], items[-twice:-once], items[-once:]

val_test_size = 1000
all_data_raw = split(sorted(brown_types.items()), val_test_size)
train_data_raw, validation_data_raw, test_data_raw = all_data_raw
all_tags = sorted(set(brown_types.values()))
print('Tag options:', all_tags)

Tagged tokens example:  [('The', 'DET'), ('Fulton', 'NOUN'), ('County', 'NOUN'), ('Grand', 'ADJ'), ('Jury', 'NOUN')]


[nltk_data] Downloading package brown to /root/nltk_data...
[nltk_data]   Package brown is already up-to-date!
[nltk_data] Downloading package universal_tagset to /root/nltk_data...
[nltk_data]   Package universal_tagset is already up-to-date!


Total # of word tokens: 1161192
Tagged types example:  [('a', 'DET'), ('aaron', 'NOUN'), ('ab', 'NOUN'), ('abandon', 'VERB'), ('abandoned', 'VERB')]
Total # of word types: 18954
Tag options: ['ADJ', 'ADP', 'ADV', 'CONJ', 'DET', 'NOUN', 'NUM', 'PRON', 'PRT', 'VERB', 'X']


You're welcome to insert additional cells and explore the data. Our autograders don't rely on any particular structure of the notebook.

In [None]:
# Explore the data here as you see fit.

First, let's run a baseline that predicts `NOUN` for every word. A predictor function takes a list of tagged words and returns a list of predicted tags. We've also provided some helper functions here to evaluate model outputs.  You don't need to fill in any code in this cell.



In [None]:
def noun_predictor(raw_data):
    "A predictor that always predicts NOUN."
    predictions = []
    for word, _ in raw_data:
        predictions.append('NOUN')
    return predictions

def accuracy(predictions, targets):
    """Return the accuracy percentage of a list of predictions.

    predictions has only the predicted tags
    targets has tuples of (word, tag)
    """
    assert len(predictions) == len(targets)
    n_correct = 0
    for predicted_tag, (word, gold_tag) in zip(predictions, targets):
        if predicted_tag == gold_tag:
            n_correct += 1

    return n_correct / len(targets) * 100.0

def evaluate(predictor, raw_data):
    return accuracy(predictor(raw_data), raw_data)

def print_sample_predictions(predictor, raw_data, k=10):
    "Print the first k predictions."
    d = raw_data[:k]
    print('Sample predictions:',
          [(word, guess) for (word, _), guess in zip(d, predictor(d))])

print('noun baseline validation accuracy:',
      evaluate(noun_predictor, validation_data_raw))
print_sample_predictions(noun_predictor, validation_data_raw)

noun baseline validation accuracy: 55.1
Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'NOUN'), ('informal', 'NOUN'), ('padded', 'NOUN'), ('tantalizing', 'NOUN'), ('farce', 'NOUN'), ('berger', 'NOUN')]


### Building a PyTorch Classifier

We will be using the deep learning framework PyTorch for all our assignments.
If you haven't used PyTorch at all before, we recommend you check out the tutorials on the PyTorch website: https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html.  Throughout this assignment and the others in this course, you will need to reference the documentation at https://pytorch.org/docs/stable/index.html.  We'll be using PyTorch version 2.1.0, which comes pre-installed with Colab.  In this assignment, we'll walk you through the process of defining and training your neural network model, but future assignments will have less guidance.

Below, we've provided a baseline network as a PyTorch Module that will learn a single parameter per part-of-speech tag. This model has the capacity to learn that `'NOUN'` is the most common tag and predict that. It can't do better. Use this network as you are developing your training and prediction code, then replace it with your actual network later.

In [None]:
import torch
from torch import nn
import torch.nn.functional as F

class BaselineNetwork(nn.Module):
    def __init__(self, n_outputs):
        super().__init__()

        # learn a vector of size n_outputs, initialized with all zeros
        self.param = nn.Parameter(torch.zeros(n_outputs))

    def forward(self, chars, mask):
        """This function defines the computation graph.

        Args:
          chars: a matrix of size batch_size x max_len
          mask: a matrix of size batch_size x max_len

        Returns:
          a matrix of scores of size batch_size x n_outputs
        """
        # return the same outputs (self.param) for each example in a batch
        # here we use 'expand' to return the same outputs for each example in the batch
        return self.param.expand(chars.size(0), -1)

To train or evaluate a neural model, we'll need to transform the raw data from strings into tensors.  We've provided the following function to perform the transformation for you. Each word is prepended with the `^` character and appended with `$` so that these boundary characters are available to the network.

In [None]:
def make_matrices(data_raw):
    """Convert a list of (word, tag) pairs into tensors with appropriate padding.

    character_matrix holds character codes for each word,
      indexed as [word_index, character_index]
    character_mask masks valid characters (1 for valid, 0 invalid),
      indexed similarly so that all inputs can have a constant length
    pos_labels holds part-of-speech one-hot vectors,
      indexed as [word_index, pos_index] with 0/1 values
    """
    max_len = max_word_len + 2  # leave room for word start/end symbols
    character_matrix = torch.zeros(len(data_raw), max_len, dtype=torch.int64)
    character_mask = torch.zeros(len(data_raw), max_len, dtype=torch.float32)
    pos_labels = torch.zeros(len(data_raw), dtype=torch.int64)
    for word_i, (word, pos) in enumerate(data_raw):
        for char_i, c in enumerate('^' + word + '$'):
            character_matrix[word_i, char_i] = ord(c)
            character_mask[word_i, char_i] = 1
        pos_labels[word_i] = all_tags.index(pos)
    return torch.utils.data.TensorDataset(character_matrix, character_mask, pos_labels)

validation_data = make_matrices(validation_data_raw)

print('Sample datapoint after preprocessing:', validation_data[0])
print('Raw datapoint:', validation_data_raw[0])

Sample datapoint after preprocessing: (tensor([ 94, 115,  97, 108, 101, 109,  36,   0,   0,   0,   0,   0,   0,   0,
          0,   0,   0,   0,   0,   0,   0,   0]), tensor([1., 1., 1., 1., 1., 1., 1., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
        0., 0., 0., 0.]), tensor(5))
Raw datapoint: ('salem', 'NOUN')


The output of a `BaselineNetwork` is a matrix of dimension (batch_size, num_pos_labels) containing logits, or unnormalized log probabilities. To get probabilities from this matrix, you would run `F.softmax(x, dim=1)`, which exponentiates the logits and then normalizes each row to sum to 1.  The cell below generates an output distribution for the first example of the validation set, which is uniform because the network param was initialized to zero.

In PyTorch, it is common to return pre-activation values from modules (e.g. the values before running the final softmax or sigmoid operation).  PyTorch has loss functions that combine the softmax/sigmoid operation into the loss operation for more numerical stability.  Be sure you know what type of values a network returns, as this will affect your training and prediction code.

In [None]:
# Create a network and copy its parameters to the GPU.
untrained_baseline = BaselineNetwork(len(all_tags)).cuda()
untrained_baseline.eval()

# Select the first validation example.
example = validation_data[0]
chars, mask, _ = example

# Networks only process batches. Create a batch of size one.
chars_batch, mask_batch = chars.unsqueeze(0), mask.unsqueeze(0)

# Copy batch to the GPU.
chars_batch, mask_batch = chars_batch.cuda(), mask_batch.cuda()

# Run the untrained network.
logits = untrained_baseline(chars_batch, mask_batch)

# Convert to a distribution.
output_distribution = F.softmax(logits, dim=1).squeeze().tolist()

# Inspect the distribution, which should be uniform.
list(zip(all_tags, output_distribution))

[('ADJ', 0.09090909361839294),
 ('ADP', 0.09090909361839294),
 ('ADV', 0.09090909361839294),
 ('CONJ', 0.09090909361839294),
 ('DET', 0.09090909361839294),
 ('NOUN', 0.09090909361839294),
 ('NUM', 0.09090909361839294),
 ('PRON', 0.09090909361839294),
 ('PRT', 0.09090909361839294),
 ('VERB', 0.09090909361839294),
 ('X', 0.09090909361839294)]

Finally, time to write some code!

In the cell below, define a predictor for a network by following the instructions in the comments. The predictor takes a list of words (strings) and returns a list of part-of-speech tags (also strings).

For this assignment, we've provided more fine-grained instructions as comments in the code template.  You are free to explore methods and architectures other than the ones we specified in the comments, but we highly recommend starting with them, as they will help you reach the required accuracies and give lots of best practices to use in later projects.

In [None]:
import torch
from torch.utils.data import DataLoader

def predict_using(network):
    def predictor(raw_data):
        """Return a list of part-of-speech tags as strings, one for each word."""

        device = next(network.parameters()).device  # Get the device of the model

        with torch.no_grad():
            network.eval()

            input_data = make_matrices(raw_data)
            data_loader = torch.utils.data.DataLoader(input_data, batch_size=32, shuffle=False)

            predictions = []

            for batch in data_loader:
                chars_batch, mask_batch, _ = batch

                chars_batch = chars_batch.to(device)
                mask_batch = mask_batch.to(device)

                outputs = network(chars_batch, mask_batch)

                predicted_labels = outputs.argmax(dim=-1)

                predicted_tags = [all_tags[label] for label in predicted_labels.cpu().numpy()]
                predictions.extend(predicted_tags)

            return predictions

    return predictor



print_sample_predictions(predict_using(untrained_baseline), validation_data_raw)

Sample predictions: [('salem', 'ADJ'), ('unsympathetic', 'ADJ'), ('downwind', 'ADJ'), ('exodus', 'ADJ'), ('avoiding', 'ADJ'), ('informal', 'ADJ'), ('padded', 'ADJ'), ('tantalizing', 'ADJ'), ('farce', 'ADJ'), ('berger', 'ADJ')]



Fill in the training function for the neural network below. This function should train any network.  

Then, you'll have all the parts needed to train and evaluate the baseline network.  You should get the same accuracy as the all-noun baseline.  Make sure your train function prints validation scores so that you see score outputs here.

In [None]:
import torch
import torch.optim as optim
import torch.nn.functional as F
import tqdm

def train(network, n_epochs=25):
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    network = network.to(device)

    train_data = make_matrices(train_data_raw)
    data_loader = torch.utils.data.DataLoader(train_data, batch_size=32, shuffle=True)

    optimizer = optim.Adam(network.parameters(), lr=1e-3)

    predictor = predict_using(network)

    best_validation_score = float('-inf')

    for epoch in range(n_epochs):
        print('Epoch', epoch)

        network.train()

        for batch in tqdm.tqdm(data_loader, leave=False):
            chars_batch, mask_batch, pos_batch = batch

            chars_batch = chars_batch.to(device)
            mask_batch = mask_batch.to(device)
            pos_batch = pos_batch.to(device)

            optimizer.zero_grad()

            outputs = network(chars_batch, mask_batch)

            loss = F.cross_entropy(outputs.view(-1, outputs.size(-1)), pos_batch.view(-1))

            loss.backward()
            optimizer.step()

        validation_score = evaluate(predictor, validation_data_raw)
        print('Validation score:', validation_score)

        if validation_score > best_validation_score:
            best_validation_score = validation_score
            torch.save(network.state_dict(), 'best_model.pth')

    network.load_state_dict(torch.load('best_model.pth'))
    return network

trained_baseline_network = train(BaselineNetwork(len(all_tags)), 2)
print_sample_predictions(predict_using(trained_baseline_network), validation_data_raw)

Epoch 0




Validation score: 55.1
Epoch 1




Validation score: 55.1
Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'NOUN'), ('avoiding', 'NOUN'), ('informal', 'NOUN'), ('padded', 'NOUN'), ('tantalizing', 'NOUN'), ('farce', 'NOUN'), ('berger', 'NOUN')]


  network.load_state_dict(torch.load('best_model.pth'))


It's time to actually define a non-trivial neural network.  We'll start with a pretty simple model that takes embeddings of the characters of a word, pools them, and runs a feedforward network.  Fill in your code for `PoolingNetwork` below.  A correct implementation should get a validation score over 66%.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class PoolingNetwork(nn.Module):
    def __init__(self, n_outputs, embedding_dim=100, vocab_size=1000):
        super().__init__()

        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.linear1 = nn.Linear(embedding_dim, 128)
        self.linear2 = nn.Linear(128, n_outputs)
        self.dropout = nn.Dropout(p=0.5)

    def forward(self, chars, mask):
        device = chars.device
        chars = chars.to(device)
        mask = mask.to(device)

        embeddings = self.embedding(chars)
        embeddings = embeddings * mask.unsqueeze(-1)

        valid_chars = mask.sum(dim=1).unsqueeze(-1)
        pooled_embeddings = embeddings.sum(dim=1) / valid_chars

        x = F.relu(self.linear1(pooled_embeddings))
        x = self.dropout(x)
        output = self.linear2(x)

        return output


trained_pooling_network = train(PoolingNetwork(len(all_tags)))
pooling_predictor = predict_using(trained_pooling_network)

Epoch 0




Validation score: 60.8
Epoch 1




Validation score: 63.9
Epoch 2




Validation score: 65.10000000000001
Epoch 3




Validation score: 64.60000000000001
Epoch 4




Validation score: 65.60000000000001
Epoch 5




Validation score: 66.7
Epoch 6




Validation score: 66.10000000000001
Epoch 7




Validation score: 64.8
Epoch 8




Validation score: 65.8
Epoch 9




Validation score: 66.8
Epoch 10




Validation score: 65.7
Epoch 11




Validation score: 65.60000000000001
Epoch 12




Validation score: 65.8
Epoch 13




Validation score: 66.5
Epoch 14




Validation score: 66.4
Epoch 15




Validation score: 66.2
Epoch 16




Validation score: 66.8
Epoch 17




Validation score: 66.8
Epoch 18




Validation score: 65.9
Epoch 19




Validation score: 65.3
Epoch 20




Validation score: 66.10000000000001
Epoch 21




Validation score: 66.7
Epoch 22




Validation score: 66.0
Epoch 23




Validation score: 66.8
Epoch 24




Validation score: 66.9


  network.load_state_dict(torch.load('best_model.pth'))


And look at some outputs.

In [None]:
print_sample_predictions(pooling_predictor, validation_data_raw)

Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'NOUN'), ('downwind', 'NOUN'), ('exodus', 'VERB'), ('avoiding', 'VERB'), ('informal', 'NOUN'), ('padded', 'VERB'), ('tantalizing', 'VERB'), ('farce', 'NOUN'), ('berger', 'NOUN')]


For this next part, we'll give you a little more freedom to experiment.  Think about what types of information could be useful for predicting parts of speech.  Think about what the pooling model is missing. Try to improve the accuracy found in the previous run. Implement an improved model that reaches a validation score above 75%.

One way is to operate over character n-grams before pooling.
There are several ways to implement this, but if you need help, you can use the following steps between the creation of embeddings and the mask/pool operations to process bigrams:
1. create two slices of the embedding tensor, one with the first character cut off and one with the last cut off
2. concatenate the two sliced tensors along the embedding dimension with `torch.cat`
3. run a linear layer with activation on the concatenated embeddings
4. cut off the first character of the mask tensor

You can also try playing around with the hyperparameters of the network and the optimizer to see if that improves accuracy.

In [None]:
import torch
import torch.nn as nn
import torch.nn.functional as F

class ImprovedNetwork(nn.Module):
    def __init__(self, n_outputs, embedding_dim=200, vocab_size=1000, max_chars=100):
        super().__init__()

        self.embedding_dim = embedding_dim
        self.max_chars = max_chars

        self.embedding = nn.Embedding(vocab_size, embedding_dim)

        self.linear1 = nn.Linear(embedding_dim * 3, 512)
        self.linear2 = nn.Linear(512, 256)
        self.linear3 = nn.Linear(256, n_outputs)
        self.dropout = nn.Dropout(p=0.3)

        self.batch_norm1 = nn.BatchNorm1d(512)
        self.batch_norm2 = nn.BatchNorm1d(256)

        nn.init.xavier_uniform_(self.linear1.weight)
        nn.init.xavier_uniform_(self.linear2.weight)
        nn.init.xavier_uniform_(self.linear3.weight)

    def forward(self, chars, mask):
        device = chars.device
        chars = chars.to(device)
        mask = mask.to(device)

        embeddings = self.embedding(chars)

        embeddings_first = embeddings[:, :-2, :]
        embeddings_second = embeddings[:, 1:-1, :]
        embeddings_third = embeddings[:, 2:, :]

        trigrams = torch.cat([embeddings_first, embeddings_second, embeddings_third], dim=-1)

        trigram_mask = mask[:, 2:]
        trigram_mask = trigram_mask.unsqueeze(-1)

        pooled = (trigrams * trigram_mask).sum(dim=1) / trigram_mask.sum(dim=1)

        x = self.batch_norm1(self.linear1(pooled))
        x = F.relu(x)

        x = self.batch_norm2(self.linear2(x))
        x = F.relu(x)

        x = self.dropout(x)

        output = self.linear3(x)

        return output

trained_improved_network = train(ImprovedNetwork(len(all_tags)))
improved_predictor = predict_using(trained_improved_network)

Epoch 0




Validation score: 72.5
Epoch 1




Validation score: 74.2
Epoch 2




Validation score: 75.5
Epoch 3




Validation score: 75.8
Epoch 4




Validation score: 77.2
Epoch 5




Validation score: 76.3
Epoch 6




Validation score: 76.1
Epoch 7




Validation score: 76.0
Epoch 8




Validation score: 76.7
Epoch 9




Validation score: 76.8
Epoch 10




Validation score: 76.1
Epoch 11




Validation score: 76.4
Epoch 12




Validation score: 76.6
Epoch 13




Validation score: 76.1
Epoch 14




Validation score: 77.3
Epoch 15




Validation score: 76.6
Epoch 16




Validation score: 77.5
Epoch 17




Validation score: 77.7
Epoch 18




Validation score: 76.9
Epoch 19




Validation score: 76.6
Epoch 20




Validation score: 77.0
Epoch 21




Validation score: 76.5
Epoch 22




Validation score: 77.2
Epoch 23




Validation score: 76.5
Epoch 24




Validation score: 77.10000000000001


  network.load_state_dict(torch.load('best_model.pth'))


We can also get a feel for what our model learned by providing some of our own inputs that aren't real words (yet).

In [None]:
print_sample_predictions(improved_predictor, validation_data_raw)

print_sample_predictions(improved_predictor, [['kleining','X'], ['deneroful','X']])

Sample predictions: [('salem', 'NOUN'), ('unsympathetic', 'ADJ'), ('downwind', 'ADJ'), ('exodus', 'ADJ'), ('avoiding', 'VERB'), ('informal', 'ADJ'), ('padded', 'VERB'), ('tantalizing', 'VERB'), ('farce', 'NOUN'), ('berger', 'NOUN')]
Sample predictions: [('kleining', 'VERB'), ('deneroful', 'ADJ')]


Next, you need to run your model on the test set and save the outputs.  You'll turn in your predictions for us to grade.

In [None]:
def save_predictions(predictions, filename):
    """Save predictions to a file.

    predictions is a list of strings.
    """
    with open(filename, 'w') as f:
        for pred in predictions:
            f.write(pred)
            f.write('\n')

print('test score pooling:', evaluate(pooling_predictor, test_data_raw))
print('test score improved:', evaluate(improved_predictor, test_data_raw))

test_predictions = pooling_predictor(test_data_raw)
save_predictions(test_predictions, 'predicted_test_outputs_pooling.txt')
test_predictions = improved_predictor(test_data_raw)
save_predictions(test_predictions, 'predicted_test_outputs_improved.txt')

# Check that your test set looks like we expect it to
import hashlib
m = hashlib.md5()
m.update(str(test_data_raw).encode('utf-8'))
assert m.digest() == b'*N\xf6\xbe\xed\xde\xe8q)\xb9GG\xa6\x15UI'

test score pooling: 69.8
test score improved: 78.2


Finally, we will run our model on a hidden test set.
- We first download the hidden test_set.
- Run our trained models on it.
- Submit the output files to gradescope

In [None]:
# Downloading hidden test set.
!pip install gdown==v4.6.3
!gdown 1iDjTU-NpXBB-OFEow5s4jiD-URF6Wbmw
hidden_test_words = pickle.load(open("hidden_words.p", "rb"))
assert hidden_test_words[0][0]=='reversed' and hidden_test_words[-1][0]=='under'

Downloading...
From: https://drive.google.com/uc?id=1iDjTU-NpXBB-OFEow5s4jiD-URF6Wbmw
To: /content/hidden_words.p
100% 47.3k/47.3k [00:00<00:00, 89.8MB/s]


hidden_test_words is the new hidden dataset provided. All the labels in the hidden dataset are incorrectly assigned label 'X', the 'other' tag from the universal tagset.

In [None]:
print(hidden_test_words[:3])

[('reversed', 'X'), ('baum', 'X'), ('coffers', 'X')]


Your job now is to find the correct lables on this test set. We do that by running the following code.

In [None]:
test_predictions = pooling_predictor(hidden_test_words)
save_predictions(test_predictions, 'predicted_hidden_test_outputs_pooling.txt')
test_predictions = improved_predictor(hidden_test_words)
save_predictions(test_predictions, 'predicted_hidden_test_outputs_improved.txt')

Please submit the corrected labels found onto gradescope. You will be evaluated on these labels and a final score will be given.

### Gradescope

We will use Gradescope for assignment submission.  Please let us know if you are unable to access Gradescope for any reason.

You will submit this notebook and the required output files that we specify.  For this project, your submission will contain:
* assignment0.ipynb (rename this notebook to match this)
* predicted_test_outputs_pooling.txt
* predicted_test_outputs_improved.txt
* predicted_hidden_test_outputs_pooling.txt
* predicted_hidden_test_outputs_improved.txt

You can upload files individually or as part of a zip file, but if using a zip file be sure you are zipping the files directly and not a folder that contains them.

To download this notebook, go to `File->Download .ipynb`.  Please rename the file to match the name in our file list.  You can download other outputs, like `predicted_test_output_improved.txt` by clicking the > arrow near the top left and finding it under `Files`.

When submitting your ipython notebooks, make sure everything runs correctly if the cells are executed in order starting from a fresh session.  Note that just because a cell runs in your current session doesn't mean it doesn't rely on code that you have already changed or deleted.  If the code doesn't take too long to run, we recommend re-running everything with `Runtime->Restart and run all...`.

When you upload your submission to the Gradescope assignment, you should get immediate feedback that confirms your submission was processed correctly.  Be sure to check this, as an incorrectly formatted submission could cause the autograder to fail.  For this project, you should be able to see your test set accuracies and a confirmation that all required files were found.  Most assignments will be graded primarily on your test set accuracies, but we may also use other factors to grade.

Note that Gradesope will allow you to submit multiple times before the deadline, and we will use the latest submission for grading.