[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/dbamman/anlp24/blob/main/5.classification/HW6_TransformerClassification_TODO.ipynb)

**N.B.** Once it's open on Colab, remember to save a copy (by e.g. clicking `Copy to Drive` above).

---

Thie notebook explores using transformers for document classification.  Before starting, change the runtime to GPU: Runtime > Change runtime type > Hardware accelerator: GPU (any GPU is fine).

For an intro to models in PyTorch, see [this tutorial](https://pytorch.org/tutorials/beginner/introyt/modelsyt_tutorial.html).




Download classification data for training/evaluation.

In [1]:
!wget https://raw.githubusercontent.com/dbamman/anlp24/main/data/convote/train.tsv
!wget https://raw.githubusercontent.com/dbamman/anlp24/main/data/convote/dev.tsv

--2024-10-13 03:51:19--  https://raw.githubusercontent.com/dbamman/anlp24/main/data/convote/train.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 4660140 (4.4M) [text/plain]
Saving to: ‘train.tsv.1’


2024-10-13 03:51:19 (78.0 MB/s) - ‘train.tsv.1’ saved [4660140/4660140]

--2024-10-13 03:51:19--  https://raw.githubusercontent.com/dbamman/anlp24/main/data/convote/dev.tsv
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.111.133, 185.199.110.133, 185.199.109.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.111.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 351382 (343K) [text/plain]
Saving to: ‘dev.tsv.1’


2024-10-13 03:51:20 (10.4 MB/s) - ‘dev.tsv.1’ saved [35138

In [2]:
import math
import sys
import torch
from torch import nn
from collections import Counter
from nltk import word_tokenize

In [3]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [4]:
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")


In [5]:
# max sequence length
max_length=256

# limit vocabulary to top N words in training data
max_vocab=10000

# batch size
batch_size=128

# size of token representations (which dictates the size of the overall model).
d_model=16


# number of epochs
num_epochs=50

print('')
print("********************************************")
print("Running on: {}".format(device))
print("********************************************")
print('')


********************************************
Running on: cuda
********************************************



In [6]:
# PositionalEncoding class copied from:
# https://github.com/pytorch/examples/blob/main/word_language_model/model.py

class PositionalEncoding(nn.Module):

    def __init__(self, d_model, dropout=0.1):
        super(PositionalEncoding, self).__init__()
        self.dropout = nn.Dropout(p=dropout)

        pe = torch.zeros(max_length, d_model)
        position = torch.arange(0, max_length, dtype=torch.float).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-math.log(10000.0) / d_model))
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)#.transpose(0, 1)
        self.register_buffer('pe', pe)

    def forward(self, x):

        x = x + self.pe[:x.size(0), :]
        return self.dropout(x)


In [7]:
class TransformerClassifier(torch.nn.Module):

    def __init__(self, num_labels, d_model, nhead=2, num_encoder_layers=1, dim_feedforward=256):

        super(TransformerClassifier, self).__init__()

        self.num_labels=num_labels
        self.embedding = nn.Embedding(num_embeddings=max_vocab+2, embedding_dim=d_model)
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, batch_first=True)
        self.classifier = nn.Linear(d_model, self.num_labels)
        self.pos_encoder = PositionalEncoding(d_model)

    def forward(self, x, m):
        # Convert lists to tensors if necessary
        if isinstance(x, list):
            x = torch.tensor(x, dtype=torch.long)
        if isinstance(m, list):
            m = torch.tensor(m, dtype=torch.bool)

        # put data on device (e.g., gpu)
        x=x.to(device)
        m=m.to(device)

        # convert input token IDs to word embeddings
        embed=self.embedding(x)

        # add position encodings to include information about word position within the document
        embed = self.pos_encoder(embed)

        # get transformer output
        h=self.transformer.encoder(embed, src_key_padding_mask=m)

        # Represent document as average embedding of transformer output
        h=torch.mean(h, dim=1)

        # Convert document representation into output label space
        logits=self.classifier(h)

        return logits


In [8]:
def create_vocab_and_labels(filename, max_vocab):
    # This function creates the word vocabulary (and label ids) from the training data
    # The vocab is a mapping between word types and unique word IDs

    counts=Counter()
    labels={}
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols=line.rstrip().split("\t")
            lab=cols[0]
            text=word_tokenize(cols[1].lower())
            for tok in text:
                counts[tok]+=1

            if lab not in labels:
                labels[lab]=len(labels)

    vocab={"[MASK]":0, "[UNK]":1}

    for k,v in counts.most_common(max_vocab):
        vocab[k]=len(vocab)

    return vocab, labels

In [9]:
def read_data(filename, vocab, labels, max_length, max_docs=5000):
    # Read in data from file, up to the first max_docs documents. For each document
    # read up to max_length tokens.

    x=[]
    y=[]
    m=[]

    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            if idx >= max_docs:
                break
            cols=line.rstrip().split("\t")
            lab=cols[0]
            text=word_tokenize(cols[1])
            text_ids=[]
            for tok in text:
                if tok in vocab:
                    text_ids.append(vocab[tok])
                else:
                    text_ids.append(vocab["[UNK]"])

            text_ids=text_ids[:max_length]

            # PyTorch (and most libraries that deal with matrix operations) expects all inputs to be the same length
            # So pad each document with 0s up to max_length
            # But keep track of the true number of tokens in the document with the "mask" list.

            # True tokens have a mask value of 0
            mask=[0]*len(text_ids)

            for i in range(len(text_ids), max_length):
                text_ids.append(vocab["[MASK]"])
                # Padded tokens have a mask value of 1
                mask.append(1)

            x.append(text_ids)
            m.append(mask)
            y.append(labels[lab])

    return x, y, m

In [10]:
def get_batches(x, y, m, batch_size):

    # Create minibatches from the full dataset

    batches_x=[]
    batches_y=[]
    batches_m=[]
    for i in range(0, len(x), batch_size):
        xbatch=x[i:i+batch_size]
        ybatch=y[i:i+batch_size]
        mbatch=m[i:i+batch_size]

        batches_x.append(torch.LongTensor(xbatch))
        batches_y.append(torch.LongTensor(ybatch))
        batches_m.append(torch.BoolTensor(mbatch))

    return batches_x, batches_y, batches_m

In [11]:
def evaluate(model, all_x, all_y, all_m):

    # Calculate accuracy

    model.eval()
    corr = 0.
    total = 0.
    with torch.no_grad():
        for x, y, m in zip(all_x, all_y, all_m):
            y_preds=model.forward(x, m)
            for idx, y_pred in enumerate(y_preds):
                prediction=torch.argmax(y_pred)
                if prediction == y[idx]:
                    corr += 1.
                total+=1
    return corr/total

In [12]:
def train(model, model_filename, train_batches_x, train_batches_y, train_batches_m, dev_batches_x, dev_batches_y, dev_batches_m):

    optimizer = torch.optim.Adam(model.parameters(), lr=0.001)
    cross_entropy=nn.CrossEntropyLoss()

    # Keep track of the epoch that has the best dev accuracy
    best_dev_acc=0.
    best_dev_epoch=None

    # How many epochs with no changes before we quit
    patience=10

    for epoch in range(num_epochs):

        model.train()

        for x, y, m in zip(train_batches_x, train_batches_y, train_batches_m):
            # Get predictions for batch x (with mask values m)
            y_pred=model.forward(x, m)
            y=y.to(device)

            # Calculate loss as cross-entropy with true labels
            loss = cross_entropy(y_pred.view(-1, model.num_labels), y.view(-1))

            # Set all gradients to zero
            optimizer.zero_grad()

            # Calculate gradients from current loss
            loss.backward()

            # Update parameters
            optimizer.step()

        dev_accuracy=evaluate(model, dev_batches_x, dev_batches_y, dev_batches_m)

        # we're going to save the model that performs the best on *dev* data
        if dev_accuracy > best_dev_acc:
            torch.save(model.state_dict(), model_filename)
            print("%.3f is better than %.3f, saving model ..." % (dev_accuracy, best_dev_acc))
            best_dev_acc = dev_accuracy
            best_dev_epoch=epoch

        if epoch % 1 == 0:
            print("Epoch %s, dev accuracy: %.3f" % (epoch, dev_accuracy))

        if epoch-best_dev_epoch > patience:
          print("%s > patience (%s), stopping..." % (epoch-best_dev_epoch, patience))
          break

    model.load_state_dict(torch.load(model_filename))
    print("\nBest Performing Model achieves dev accuracy of : %.3f" % (best_dev_acc))

In [13]:
vocab, labels=create_vocab_and_labels("train.tsv", max_vocab)
train_x, train_y, train_m=read_data("train.tsv", vocab, labels, max_length=max_length)
dev_x, dev_y, dev_m=read_data("dev.tsv", vocab, labels, max_length=max_length)

In [14]:
classifier=TransformerClassifier(num_labels=len(labels), d_model=100, dim_feedforward=1024)
classifier=classifier.to(device)

train_x_batch, train_y_match, train_m_match=get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match=get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

train(classifier, "test.model", train_x_batch, train_y_match, train_m_match, dev_x_batch, dev_y_match, dev_m_match)

  output = torch._nested_tensor_from_mask(output, src_key_padding_mask.logical_not(), mask_check=False)


0.510 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.510
Epoch 1, dev accuracy: 0.498
0.518 is better than 0.510, saving model ...
Epoch 2, dev accuracy: 0.518
0.521 is better than 0.518, saving model ...
Epoch 3, dev accuracy: 0.521
0.580 is better than 0.521, saving model ...
Epoch 4, dev accuracy: 0.580
0.607 is better than 0.580, saving model ...
Epoch 5, dev accuracy: 0.607
Epoch 6, dev accuracy: 0.553
Epoch 7, dev accuracy: 0.588
0.642 is better than 0.607, saving model ...
Epoch 8, dev accuracy: 0.642
Epoch 9, dev accuracy: 0.630
Epoch 10, dev accuracy: 0.626
0.658 is better than 0.642, saving model ...
Epoch 11, dev accuracy: 0.658
Epoch 12, dev accuracy: 0.642
Epoch 13, dev accuracy: 0.638
Epoch 14, dev accuracy: 0.626
Epoch 15, dev accuracy: 0.630
Epoch 16, dev accuracy: 0.603
Epoch 17, dev accuracy: 0.591
Epoch 18, dev accuracy: 0.580
Epoch 19, dev accuracy: 0.630
Epoch 20, dev accuracy: 0.603
Epoch 21, dev accuracy: 0.623
Epoch 22, dev accuracy: 0.611
11 > 

  model.load_state_dict(torch.load(model_filename))


**Q1**. Play around with this transformer as implemented and experiment with how performance on the dev data changes as a function of `d_model`, `num_encoder_layers`, `nhead`, etc.).  Describe your experiments and report dev accuracy on them below.

####Result

1. **d_model (Dimensionality of Token Embeddings)**: Dimensionality of token embeddings with higher value capture more complex structure but also increases computation cost and risk of overfitting.
  - Increasing d_model from 16 to 64 did not consistently improve accuracy in this case. For example, the model with **d_model=16, num_encoder_layers=1, and nhead=2** achieved the highest accuracy (0.700), while models with d_model=64 had lower accuracy.
2. **num_encoder_layers (Number of Transformer Encoder Layers)**: The depth of the transformer model. Again a larger number increase model complexity but could lead to overfitting.
  - increasing the number of encoder layers from 1 to 2 did not consistently improve accuracy
3. **nhead (Number of Attention Heads)**: Number of attention heads allows the model to focus on different parts of the input sequence in parallel. A higher number increase complexity at the risk of overfitting.
  - Increasing the number of attention heads lead to worse performance, particularly in more complex models

The highest-performing model (**d_model=16, num_encoder_layers=1, nhead=2, accuracy=0.700**) was one of the simpler configurations. This suggests that the more complex models might have suffered from under-optimization due to:

1. Early stopping after 10 epochs lead to oscillation.
2. The learning rate or other hyperparameters may not have been tuned well, leading to slower convergence.

In [15]:
import itertools
import torch.optim as optim
import matplotlib.pyplot as plt

In [16]:
# Define hyperparameters grid search
d_model_values = [16, 64]
num_encoder_layers_values = [1, 2]
nhead_values = [2, 4]

In [17]:
results = []
train_x_batch, train_y_match, train_m_match = get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match = get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

for d_model, num_encoder_layers, nhead in itertools.product(d_model_values, num_encoder_layers_values, nhead_values):
    print(f"\nTraining model with d_model={d_model}, num_encoder_layers={num_encoder_layers}, nhead={nhead}")
    classifier = TransformerClassifier(
        num_labels=len(labels),
        d_model=d_model,
        nhead=nhead,
        num_encoder_layers=num_encoder_layers,
        dim_feedforward=1024
    )
    classifier = classifier.to(device)

    model_filename = f"model_d{d_model}_l{num_encoder_layers}_h{nhead}.pt"

    train(
        classifier,
        model_filename,
        train_x_batch, train_y_match, train_m_match,
        dev_x_batch, dev_y_match, dev_m_match
    )

    # Load the trained model and evaluate it
    loaded_model = TransformerClassifier(
        num_labels=len(labels),
        d_model=d_model,
        nhead=nhead,
        num_encoder_layers=num_encoder_layers,
        dim_feedforward=1024
    )
    loaded_model.load_state_dict(torch.load(model_filename))
    loaded_model = loaded_model.to(device)

    dev_accuracy = evaluate(loaded_model, dev_x_batch, dev_y_match, dev_m_match)
    print(f"Loaded model dev accuracy: {dev_accuracy:.3f}")

    results.append((d_model, num_encoder_layers, nhead, dev_accuracy))


Training model with d_model=16, num_encoder_layers=1, nhead=2
0.494 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.494
Epoch 1, dev accuracy: 0.494
0.529 is better than 0.494, saving model ...
Epoch 2, dev accuracy: 0.529
0.576 is better than 0.529, saving model ...
Epoch 3, dev accuracy: 0.576
Epoch 4, dev accuracy: 0.549
Epoch 5, dev accuracy: 0.518
Epoch 6, dev accuracy: 0.518
Epoch 7, dev accuracy: 0.510
Epoch 8, dev accuracy: 0.533
Epoch 9, dev accuracy: 0.556
Epoch 10, dev accuracy: 0.556
Epoch 11, dev accuracy: 0.576
0.580 is better than 0.576, saving model ...
Epoch 12, dev accuracy: 0.580
0.595 is better than 0.580, saving model ...
Epoch 13, dev accuracy: 0.595
0.607 is better than 0.595, saving model ...
Epoch 14, dev accuracy: 0.607
0.615 is better than 0.607, saving model ...
Epoch 15, dev accuracy: 0.615
Epoch 16, dev accuracy: 0.611
0.630 is better than 0.615, saving model ...
Epoch 17, dev accuracy: 0.630
0.646 is better than 0.630, saving model ...
Ep

  model.load_state_dict(torch.load(model_filename))
  loaded_model.load_state_dict(torch.load(model_filename))


0.494 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.494
Epoch 1, dev accuracy: 0.490
Epoch 2, dev accuracy: 0.490
0.514 is better than 0.494, saving model ...
Epoch 3, dev accuracy: 0.514
0.521 is better than 0.514, saving model ...
Epoch 4, dev accuracy: 0.521
Epoch 5, dev accuracy: 0.510
Epoch 6, dev accuracy: 0.494
Epoch 7, dev accuracy: 0.510
Epoch 8, dev accuracy: 0.514
Epoch 9, dev accuracy: 0.521
0.525 is better than 0.521, saving model ...
Epoch 10, dev accuracy: 0.525
0.529 is better than 0.525, saving model ...
Epoch 11, dev accuracy: 0.529
0.533 is better than 0.529, saving model ...
Epoch 12, dev accuracy: 0.533
Epoch 13, dev accuracy: 0.525
Epoch 14, dev accuracy: 0.529
Epoch 15, dev accuracy: 0.521
0.545 is better than 0.533, saving model ...
Epoch 16, dev accuracy: 0.545
0.553 is better than 0.545, saving model ...
Epoch 17, dev accuracy: 0.553
Epoch 18, dev accuracy: 0.549
0.560 is better than 0.553, saving model ...
Epoch 19, dev accuracy: 0.560
Epoch

In [18]:
# Convert results into a plot-friendly format
d_models, num_layers, nheads, accuracies = zip(*results)

for d_model, num_encoder_layers, nhead, accuracy in results:
    print(f"d_model={d_model}, num_encoder_layers={num_encoder_layers}, nhead={nhead}, accuracy={accuracy:.3f}")

d_model=16, num_encoder_layers=1, nhead=2, accuracy=0.700
d_model=16, num_encoder_layers=1, nhead=4, accuracy=0.584
d_model=16, num_encoder_layers=2, nhead=2, accuracy=0.661
d_model=16, num_encoder_layers=2, nhead=4, accuracy=0.619
d_model=64, num_encoder_layers=1, nhead=2, accuracy=0.654
d_model=64, num_encoder_layers=1, nhead=4, accuracy=0.658
d_model=64, num_encoder_layers=2, nhead=2, accuracy=0.650
d_model=64, num_encoder_layers=2, nhead=4, accuracy=0.607


**Q2**.  This transformer is forced to learn everything about the structure of language from the labeled dataset.  Word embeddings, however, already capture some of this structure, and can be incorporated into this model in an `nn.Embedding` layer.  Change the `TransformerClassifier` class above so that the `Embedding` layer uses pre-trained weights (do so with the `Embedding.from_pretrained` function described on the PyTorch [API](https://pytorch.org/docs/stable/generated/torch.nn.Embedding.html).  You can use any pre-trained embeddings you like, including the [GloVe vectors](https://raw.githubusercontent.com/dbamman/anlp24/main/data/glove.6B.50d.50K.txt) from class.  (Hint: doing so will require changes to `read_data` and `create_vocab_and_labels` since the word embeddings will give you your vocabulary.)

- The implemenation below use the GloVe vectors from class
- Overwrite `TransformerClassifier` class and `read_data` and `create_vocab_and_labels` methods
- Best Performing Model achieves dev accuracy of : `0.673`

In [19]:
import numpy as np
from gensim.models import Word2Vec, KeyedVectors

In [20]:
!wget https://raw.githubusercontent.com/dbamman/anlp24/main/data/glove.6B.50d.50K.txt

--2024-10-13 03:57:10--  https://raw.githubusercontent.com/dbamman/anlp24/main/data/glove.6B.50d.50K.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.110.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 21357798 (20M) [text/plain]
Saving to: ‘glove.6B.50d.50K.txt.1’


2024-10-13 03:57:11 (224 MB/s) - ‘glove.6B.50d.50K.txt.1’ saved [21357798/21357798]



In [21]:
glove_file = 'glove.6B.50d.50K.txt'

glove_embeddings = KeyedVectors.load_word2vec_format(glove_file, binary=False)

glove_embeddings

<gensim.models.keyedvectors.KeyedVectors at 0x7ef6712c3760>

In [22]:
class TransformerClassifier(nn.Module):
    def __init__(self, num_labels, d_model, pretrained_embeddings, nhead=2, num_encoder_layers=1, dim_feedforward=256):

        super(TransformerClassifier, self).__init__()

        self.num_labels = num_labels
        self.embedding = nn.Embedding.from_pretrained(pretrained_embeddings, freeze=False)
        self.transformer = nn.Transformer(d_model=d_model, nhead=nhead, num_encoder_layers=num_encoder_layers, dim_feedforward=dim_feedforward, batch_first=True)
        self.classifier = nn.Linear(d_model, self.num_labels)
        self.pos_encoder = PositionalEncoding(d_model)

    def forward(self, x, m):
        if isinstance(x, list):
            x = torch.tensor(x, dtype=torch.long)
        if isinstance(m, list):
            m = torch.tensor(m, dtype=torch.bool)

        x = x.to(device)
        m = m.to(device)

        embed = self.embedding(x)
        embed = self.pos_encoder(embed)
        h = self.transformer.encoder(embed, src_key_padding_mask=m)
        h = torch.mean(h, dim=1)
        logits = self.classifier(h)

        return logits

def create_vocab_and_labels(filename, glove_embeddings):
    glove_dim = glove_embeddings.vector_size

    counts = Counter()
    labels = {}
    with open(filename, encoding="utf-8") as file:
        for line in file:
            cols = line.rstrip().split("\t")
            lab = cols[0]
            text = word_tokenize(cols[1].lower())
            for tok in text:
                counts[tok] += 1

            if lab not in labels:
                labels[lab] = len(labels)

    vocab = {"[PAD]": 0, "[UNK]": 1}
    embeddings = [torch.zeros(glove_dim), torch.randn(glove_dim)]  # For [PAD] and [UNK]

    for word in counts.keys():
        if word in glove_embeddings:
            vocab[word] = len(vocab)
            embeddings.append(torch.tensor(glove_embeddings[word]))

    embeddings = torch.stack(embeddings)

    return vocab, labels, embeddings

def read_data(filename, vocab, labels, max_length, max_docs=5000):
    x = []
    y = []
    m = []

    with open(filename, encoding="utf-8") as file:
        for idx, line in enumerate(file):
            if idx >= max_docs:
                break
            cols = line.rstrip().split("\t")
            lab = cols[0]
            text = word_tokenize(cols[1].lower())
            text_ids = []
            for tok in text:
                if tok in vocab:
                    text_ids.append(vocab[tok])
                else:
                    text_ids.append(vocab["[UNK]"])

            text_ids = text_ids[:max_length]
            mask = [0] * len(text_ids)

            for i in range(len(text_ids), max_length):
                text_ids.append(vocab["[PAD]"])
                mask.append(1)

            x.append(text_ids)
            m.append(mask)
            y.append(labels[lab])

    return x, y, m

In [23]:
vocab, labels, embeddings = create_vocab_and_labels('train.tsv', glove_embeddings)
train_x, train_y, train_m=read_data("train.tsv", vocab, labels, max_length=max_length)
dev_x, dev_y, dev_m=read_data("dev.tsv", vocab, labels, max_length=max_length)

In [24]:
classifier=TransformerClassifier(num_labels=len(labels), d_model=embeddings.size(1),
                                 pretrained_embeddings=embeddings, dim_feedforward=1024)
classifier=classifier.to(device)

train_x_batch, train_y_match, train_m_match=get_batches(train_x, train_y, train_m, batch_size=batch_size)
dev_x_batch, dev_y_match, dev_m_match=get_batches(dev_x, dev_y, dev_m, batch_size=batch_size)

train(classifier, "test_embedded.model", train_x_batch, train_y_match, train_m_match, dev_x_batch, dev_y_match, dev_m_match)

0.494 is better than 0.000, saving model ...
Epoch 0, dev accuracy: 0.494
Epoch 1, dev accuracy: 0.494
Epoch 2, dev accuracy: 0.494
Epoch 3, dev accuracy: 0.494
0.498 is better than 0.494, saving model ...
Epoch 4, dev accuracy: 0.498
0.572 is better than 0.498, saving model ...
Epoch 5, dev accuracy: 0.572
0.580 is better than 0.572, saving model ...
Epoch 6, dev accuracy: 0.580
0.607 is better than 0.580, saving model ...
Epoch 7, dev accuracy: 0.607
0.626 is better than 0.607, saving model ...
Epoch 8, dev accuracy: 0.626
Epoch 9, dev accuracy: 0.619
0.638 is better than 0.626, saving model ...
Epoch 10, dev accuracy: 0.638
Epoch 11, dev accuracy: 0.619
Epoch 12, dev accuracy: 0.603
0.661 is better than 0.638, saving model ...
Epoch 13, dev accuracy: 0.661
Epoch 14, dev accuracy: 0.607
Epoch 15, dev accuracy: 0.615
0.673 is better than 0.661, saving model ...
Epoch 16, dev accuracy: 0.673
Epoch 17, dev accuracy: 0.669
Epoch 18, dev accuracy: 0.642
Epoch 19, dev accuracy: 0.658
Epoch

  model.load_state_dict(torch.load(model_filename))


---

## To submit

Congratulations on finishing this homework!
Please follow the instructions below to download the notebook file (`.ipynb`) and its printed version (`.pdf`) for submission on bCourses -- remember **all cells must be executed**.

1.  Download a copy of the notebook file: `File > Download > Download .ipynb`.

2.  Print the notebook as PDF (via your browser, or tools like [nbconvert](https://nbconvert.readthedocs.io/en/latest/)).