<center> <h1> Transfer Learning: Sentence Similarity Task </h1> </center>

In this assignment we will compare the performance of a LSTM based classfier to that of a Classifier built "on top" of the [BERT](https://arxiv.org/abs/1810.04805) sentence representation model.


We will use a subset of the [Quora duplicate question dataset](https://www.kaggle.com/c/quora-question-pairs/data). The input is a pair of questions such as:

Q1: `Which one is more harmful to eyes CRT, TFT, LED, AMOLED, or LCD`   
Q2: `How do I notice whether the front panel of the TFT is LCD or LED?`   
Label: `0` (not similar)

Another example:   
Q1:`When is the best time to take apple cider vinegar?`   
Q2: `How do I take Apple cider vinegar and when is the best time?`   
Label: `1` (similar)


As a baseline we will first construct a LSTM classifier that accepts two sequences and predicts the similarity label using the last hidden state from each sentence representation (as encoded by the LSTM). 

Next, we will use a variation of BERT known as [DistillBERT](https://arxiv.org/abs/1910.01108) and supply the two questions as one long sequence separated by a special `[SEP]` symbol. We are going to "fine-tune" the BERT model to perform the sentence similarty classification and predict a similar/not similar label.


## Google colaboratory

Before getting started, get familiar with google colaboratory:
https://colab.research.google.com/notebooks/welcome.ipynb

This is a neat python environment that works in the cloud and does not require you to
set up anything on your personal machine
(it also has some built-in IDE features that make writing code easier).
Moreover, it allows you to copy any existing collaboratory file, alter it and share
with other people. In this homework, we will ask you to copy current colaboraty,
complete all the tasks and share your colaboratory notebook with us so
that we can grade it.

## Submission

Before you start working on this homework do the following steps:

1. Press __File > Save a copy in Drive...__ tab. This will allow you to have your own copy and change it.
2. Follow all the steps in this collaboratory file and write/change/uncomment code as necessary.
3. Do not forget to occasionally press __File > Save__ tab to save your progress.
4. After all the changes are done and progress is saved press __Share__ button (top right corner of the page), press __get shareable link__ and make sure you have the option __Anyone with the link can view__ selected.
5. Paste the link into your submission pdf file so that we can view it and grade.

# Dataset
We have preselected a subset of the Quora duplicate question dataset and split the subset into training, validation (dev) and test sets.

In [1]:
!wget https://raw.githubusercontent.com/jhu-intro-hlt/jhu-intro-hlt.github.io/master/data-transfer-learning-hw/{dev,test,train}.tsv -q -nc
!head train.tsv # display some training examples, format: sentence1,tab,sentence2,tab,label

Can you suggest a best budget phone below 15k?	What is the best phone I can buy under the price of 15000?	1
How can I make a disabled or accident-prone spouse feel useful and respected?	How does it feel to suddenly realize that your children are the product of a broken home because of you and/or your spouse?	0
I had an overdraft in Wells Fargo US! What will happen if I don't pay it?	How do you stop payment on a Wells Fargo check?	0
What are the best places to visit this December in India?	What will be the best place to visit in December in India?	1
What should I do if I want to renew an expired driver's license, but had an accident while the license was expired? (India)	In what US state is it easy to get a driver's license?	0
How/why did Stanford develop such a strong entrepreneurial culture? Why doesn't UC Berkeley have such a strong entrepreneurial culture in comparison?	How strong is the startup culture in Berkeley?	0
How much weight can a honey bee lift?	How much force in Newton re

# Pretrained Models
We will use the [🤗 (huggingface)](https://github.com/huggingface/transformers) release of pretrained sentence representation models (Yes, it's the name of an actual [company](https://https://huggingface.co/) and they do some cool work in NLP). These can be used by first `pip install`ing the `pytorch-transformers` library. .

In [2]:
!pip install pytorch-transformers -q # install python library for pretrained BERT (and other similar) models

[K     |████████████████████████████████| 184kB 3.4MB/s 
[K     |████████████████████████████████| 870kB 49.7MB/s 
[K     |████████████████████████████████| 1.0MB 46.3MB/s 
[?25h  Building wheel for sacremoses (setup.py) ... [?25l[?25hdone


In [0]:
import torch
import random
import time
import math
from pytorch_transformers import DistilBertModel as BertModel
from pytorch_transformers import DistilBertTokenizer as BertTokenizer
random.seed(1234)
torch.manual_seed(1234)
torch.cuda.set_device(0)

# Data Reader
The `STSCorpus` (SenTence Similarity) class handles the data loading, processing and itertating (during training and testing). It accepts a flag `bert_format` that preprocessing the data either using a standard format (reading words and converting them to integers) or a bert format which uses `DistilBertTokenizer` provided in the pytorch-transformer library. 

In [0]:
SPL_SYMS = ['<PAD>','<BOS>', '<EOS>', '<UNK>']


class STSCorpus(object):
  def __init__(self,
              file,
              vocab=None,
              cuda=False,
              batch_size=1, bert_format=0):
    self.bert_format = bert_format
    if self.bert_format == 0:
      self.bert_tokenizer = None
      self.max_vocab = 64000
    else:
      self.bert_tokenizer = BertTokenizer.from_pretrained('distilbert-base-uncased')
      self.max_vocab = self.bert_tokenizer.vocab_size
    self.max_size = 0
    self.batch_size = batch_size
    self.vocab = self.make_vocab(file, vocab)
    self.idx2vocab = self.make_idx2vocab(self.vocab)
    self.data = self.numberize(file, self.vocab, cuda)
    self.batch_data = self.batchify()
    self.data_size = len(self.batch_data)

  def batchify(self,):
    self.batch_data = []
    curr_batch = []
    max_x1, max_x2 = 0, 0
    for x1, x2, y in self.data:
      if len(curr_batch) < self.batch_size:
        curr_batch.append((x1, x2, y))
        max_x1 = max(max_x1, x1.shape[1])
        if self.bert_format == 0:
          max_x2 = max(max_x2, x2.shape[1]) 
      else:
        
        _x1, _x2, _y = zip(*curr_batch)
        
        
        if self.bert_format == 0:
          _x1 = [torch.cat((torch.zeros(1, max_x1 - i.shape[1]).type_as(i), i), dim=1) for i in _x1]
          batch_x1 = torch.cat(_x1, dim=0)
          _x2 = [torch.cat((torch.zeros(1, max_x2 - i.shape[1]).type_as(i), i), dim=1) for i in _x2]
          batch_x2 = torch.cat(_x2, dim=0) if _x2[0] is not None else None
        else:
          _x1 = [torch.cat((i, torch.zeros(1, max_x1 - i.shape[1]).type_as(i)), dim=1) for i in _x1]
          batch_x1 = torch.cat(_x1, dim=0)
          batch_x2 = None
        batch_y = torch.cat(_y, dim=0)
        self.batch_data.append((batch_x1, batch_x2, batch_y))
        curr_batch = []
        max_x1, max_x2 = 0, 0
    # remaining items in curr_batch
    if len(curr_batch) > 0:
      print(len(self.batch_data),  max_x1, max_x2)
      _x1, _x2, _y = zip(*curr_batch)
      
      
      if self.bert_format == 0:
        _x1 = [torch.cat((torch.zeros(1, max_x1 - i.shape[1]).type_as(i), i), dim=1) for i in _x1]
        batch_x1 = torch.cat(_x1, dim=0)
        _x2 = [torch.cat((torch.zeros(1, max_x2 - i.shape[1]).type_as(i), i), dim=1) for i in _x2]
        batch_x2 = torch.cat(_x2, dim=0) if _x2[0] is not None else None
      else:
        _x1 = [torch.cat((i, torch.zeros(1, max_x1 - i.shape[1]).type_as(i)), dim=1) for i in _x1]
        batch_x1 = torch.cat(_x1, dim=0)
        batch_x2 = None
      batch_y = torch.cat(_y, dim=0)
      self.batch_data.append((batch_x1, batch_x2, batch_y))
    return self.batch_data

  def numberize(self, txt, vocab, cuda=False):
    data = []
    max_size = 0
    with open(txt, 'r', encoding='utf8') as corpus:
      for l in corpus:
        l1, l2, y = l.split('\t')
        y = torch.Tensor([[float(y)]]).float()
        if self.bert_format == 0:
          d1 = [vocab['<BOS>']] + [vocab.get(t, vocab['<UNK>']) for t in l1.strip().split()] + [vocab['<EOS>']]
          d1 = torch.Tensor(d1).long()
          d1 = d1.unsqueeze(0) # shape = (1, N)
          d2 = [vocab['<BOS>']] + [vocab.get(t, vocab['<UNK>']) for t in l2.strip().split()] + [vocab['<EOS>']]
          d2 = torch.Tensor(d2).long()
          d2 = d2.unsqueeze(0) # shape = (1, N)
          max_size = max(d1.shape[1], d2.shape[1], max_size)
          if cuda:
            d1 = d1.cuda()
            d2 = d2.cuda()
            y = y.cuda()
        elif self.bert_format == 1:
          _d1 = torch.Tensor(self.bert_tokenizer.encode("[CLS] " + l1 + " [SEP]")).long()
          _d2 = torch.Tensor(self.bert_tokenizer.encode(" " + l2 + " [SEP]")).long()
          d = torch.cat([_d1, _d2], dim=0).unsqueeze(0)
          max_size = max(d.shape[1], max_size)
          if cuda:
            d1 = d.cuda()
            d2 = None
            y = y.cuda()
        else:
          pass
        data.append((d1, d2, y))
    self.max_size = max_size
    return data

  def make_idx2vocab(self, vocab):
    if vocab is not None:
      idx2vocab = {v: k for k, v in vocab.items()}
      return idx2vocab
    else:
      return None

  def make_vocab(self, txt, vocab):
    if vocab is None and txt is not None:
      vc = {}
      for line in open(txt, 'r', encoding='utf-8').readlines():
        x1, x2, y = line.strip().split('\t')
        for w in x1.split() + x2.split():
          vc[w] = vc.get(w, 0) + 1
      cv = sorted([(c, w) for w, c in vc.items()], reverse=True)
      cv = cv[:self.max_vocab]
      _, v = zip(*cv)
      v = SPL_SYMS + list(v)
      vocab = {w: idx for idx, w in enumerate(v)}
      return vocab
    else:
      return vocab

  def get(self, idx):
    return self.batch_data[idx]

Creating train, dev and test data objects. (with `bert_format=0`) and places the data on the GPU.

In [5]:
train_corpus = STSCorpus(file='train.tsv',
                         cuda=True,
                         batch_size=32, 
                         bert_format=0)
dev_corpus = STSCorpus(file='dev.tsv', vocab=train_corpus.vocab,
                       cuda=True,
                       batch_size=32, 
                       bert_format=0)
test_corpus = STSCorpus(file='test.tsv', vocab=train_corpus.vocab,
                        cuda=True,
                        batch_size=1,
                        bert_format=0)
print(train_corpus.data_size, dev_corpus.data_size, test_corpus.data_size)

1212 19 15
151 30 23
1213 152 2500


the training input batch looks like this. The input is formated as a tuple the first item is a batch of Q1s (once they are converted to integers) and the second item is a batch of Q2s.

In [6]:
print(train_corpus.batch_data[0][:2])

(tensor([[    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     0,     0,     1,    48,    17,  1022,     7,    23,
          1199,   166,  1932,  6382,     2],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             1,    13,    18,     9,    76,     7,  4012,    26, 45538,  2312,
           144,  1052,    12, 33066,     2],
        [    0,     0,     0,     0,     0,     0,     0,     1,     9,   171,
            32, 13296,     8, 14337, 18610, 46694,     5,    39,   149,    35,
             9,    96,   468,   111,     2],
        [    0,     0,     0,     0,     0,     0,     0,     0,     0,     0,
             0,     0,     1,     5,    14,     4,    23,   241,     6,   249,
            91,  2087,     8,    64,     2],
        [    1,     5,    38,     9,    15,    35,     9,   105,     6, 11283,
            32,  5460,  6182, 21845,   118,   171,    32, 12094,   178,     4,
          1515,    75, 40327,

the training output (i.e. desired predictions) batch looks like this:

In [7]:
print( train_corpus.batch_data[0][2])

tensor([[1.],
        [0.],
        [0.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [1.],
        [1.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [1.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [1.],
        [0.],
        [0.],
        [1.],
        [0.]], device='cuda:0')


# Training Routine

In [0]:
def train(model, train_cropus, dev_corpus, max_epochs):
  sum_loss, sum_acc = 0., 0.
  train_instances_idxs = list(range(train_corpus.data_size))
  st = time.time()
  for epoch_i in range(max_epochs):
    sum_loss, sum_acc = 0., 0.
    random.shuffle(train_instances_idxs)
    model.train()
    for i in train_instances_idxs:
      x1, x2, y = train_corpus.get(i)
      l, a = model.train_step(x1, x2, y)
      sum_loss += l
      sum_acc += a
    print(f"epoch: {epoch_i} time elapsed: {time.time() - st:.2f}")
    print(f"train loss: {sum_loss/train_corpus.data_size:.4f} train acc: {sum_acc/train_corpus.data_size:.4f}")
    sum_loss, sum_acc = 0., 0.
    model.eval()
    for dev_i in range(dev_corpus.data_size):
      x1, x2, y = dev_corpus.get(dev_i)
      with torch.no_grad():
        l, a = model(x1, x2, y)
        sum_loss += l
        sum_acc += a
    print(f"  dev loss: {sum_loss/dev_corpus.data_size:.4f}   dev acc: {sum_acc/dev_corpus.data_size:.4f}")
  return model


# Evaluation Routine

In [0]:
def evaluate(model, test_corpus):
  print('Predictions:')
  sum_acc = 0.0
  model.eval()
  for test_i in range(test_corpus.data_size):
    x1, x2, y = test_corpus.get(test_i)
    _, pred = model.predict(x1, x2)
    sum_acc += (1 if pred.item() == y.item() else 0)
  print(f"Avg acc: {sum_acc/test_corpus.data_size:.4f}")

#Part 1: Baseline Classifier
In the first part of the assignment you will complete the code for a baseline classifier which uses simple LSTM-RNNs to encoder a pair of sentences. The last time-step hidden state is then used to predict if the two sentences are similar or not. If you are unfamiliar with LSTMs [this](https://colah.github.io/posts/2015-08-Understanding-LSTMs/) is an excellent resource. Note: you don't have to memorize the internals of an LSTM, for this assignment just knowing that LSTMs expect three inputs 1. a representation of a word 2. previous hidden state and 3. the previous cell state is sufficient. In pytorch LSTMs "wrap" the hidden state (lets call it h) and the cell state (lets call it c) into a tuple (h,c).

In [0]:
class Classifier(torch.nn.Module):
    def __init__(self,
                 vocab_size,
                 embedding_size,
                 hidden_size,
                 num_layers=1,
                 dropout=0.1,
                 max_grad_norm=5.0):
        super().__init__()
        self.hidden_size = hidden_size
        self.num_layers = num_layers
        self.embedding_size = embedding_size
        self.max_grad_norm = max_grad_norm
        #TODO: create a drouput layer. Use the `dropout` value from the `__init__` arguments.
        self.dropout_layer = torch.nn.Dropout(p = dropout)
        
        if max(vocab_size,embedding_size ,hidden_size,num_layers) > 0:
          #TODO: create an embedding layer here
          #TODO: the embedding layer takes a sequence of ints and converts them into a sequence of real-valued vectors
          #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=embedding#torch.nn.Embedding
          self.embedding_layer = torch.nn.Embedding(num_embeddings = vocab_size, embedding_dim = self.embedding_size)
          
          #TODO: create a unidirectional RNN-LSTM here, Note: Set `batch_first=True`
          #TODO: see https://pytorch.org/docs/stable/nn.html?highlight=lstm#torch.nn.LSTM
          self.uni_RNN_LSTM_layer = torch.nn.LSTM(input_size = self.embedding_size, hidden_size = self.hidden_size, num_layers=self.num_layers,  dropout = dropout, batch_first= True)
          #TODO: create a Linear layer that takes 2 * hidden_size and outputs a single output (binary label)
          #TODO: name this layer as self.output
          #TODO: https://pytorch.org/docs/stable/nn.html?highlight=linear#torch.nn.Linear
          self.output = torch.nn.Linear(in_features=self.hidden_size * 2, out_features = 1)
          

          #we will package the optimier inside the model class for convenience.
          self.optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.parameters()))
        else:
          pass
          
        #TODO: create a Binary Cross Entropy loss object here, set reduction='mean'
        #TODO: name it `self.loss`
        self.loss = torch.nn.BCELoss(reduction='mean')
          

    def predict(self, x1, x2):
        """ Generates a prediction and probability for each input instance
        Args:
            x1: sequence of input tokens for the first sentence
            x2: sequence of input tokens for the second sentence
        Returns:
            out: sequence of output predictions (probabilities) for each instance
            pred: the discrete prediction from the output probabilities
        """
        batch_size, seq_len = x1.shape
        batch_size2, seq_len2 = x2.shape
        assert batch_size == batch_size2
        
        #TODO: embed the x1 sequence into a sequence of embeddings, then apply dropout
        #TODO: name the result `emb_x1`
        emb_x1 = self.dropout_layer(self.embedding_layer(x1))
        
        #TODO: embed the x2 sequence into a sequence of embeddings, then apply dropout
        #TODO: name the result `emb_x2`
        emb_x2 = self.dropout_layer(self.embedding_layer(x2))
        
        #TODO: create an initial state (hidden and cell states) of zeros for the LSTM, this should support batching and num_layers>1
        h, c = (torch.zeros(self.num_layers, batch_size, self.hidden_size).cuda(),
                torch.zeros(self.num_layers, batch_size, self.hidden_size).cuda())
        #TODO: use the LSTM to get the hidden states of the last time-step of the x1 sequence
        x1_out, (x1_hidden, x1_cell) = self.uni_RNN_LSTM_layer(emb_x1, (h, c))
        #TODO: use the LSTM to get the hidden states of the last time-step of the x2 sequence
        x2_out, (x2_hidden, x2_cell) = self.uni_RNN_LSTM_layer(emb_x2, (h, c))
        #TODO: concat the last time-step hidden states from the two sequences
        #TODO: name the concated result `final_hidden`
        #TODO: `final_hidden` should have shape (batch_size, 2 * hidden_size)
        final_hidden = torch.cat((x1_out[:,-1,:].squeeze(1), x2_out[:,-1,:].squeeze(1)), -1)
        #TODO: apply dropout to the `final_hidden` tensor
        final_hidden = self.dropout_layer(final_hidden)
        #TODO: pass `final_hidden` throught the `self.output` linear layer and then
        #TODO: apply a sigmoid transformation to the output of self.output
        #TODO: name the transformed output as `out`
        #TODO: `out` should have the shape (batch_size, 1)
        out = torch.sigmoid(self.output(final_hidden))

        pred = out.clone().detach()
        pred[pred >= 0.5] = 1
        pred[pred < 0.5] = 0
        return out, pred

    def forward(self, x1, x2, y):
        """Generates the loss and accuracy given a batch of sequences x1 and x2 and their associated classification label y
        Args:
            x1: sequence of indexes representing the first sentence
            x2: sequence of indexes representing the second sentence
            y: binary valued tensor representing the label for each x1,x2 sentence pair
        Returns:
            loss: the Binary cross entropy loss from the current batch of x1, x2, y
            acc: the accuracy of the current-batch predictions
        """
        #TODO: use the `self.predict` function to get the `out` and `pred`
        out, pred = self.predict(x1,x2)
        #TODO: compute the loss using the output (from the previous line and the labels `y`
        loss = self.loss(out, y)

        assert pred.shape == y.shape
        acc = (pred == y).sum().item() / y.numel()
        return loss, acc

    def train_step(self, x1, x2, y):
        """ Performs one step of SGD
        Args:
            x1: the input sequence, its size should be: (1, x1_length)
            x2: the input sequence, its size should be: (1, x2_length)
            y: the output label, its size should be (1, 1)
        Returns:
            loss: the loss for this example (note this is just for logging it is not a pytorch tensor)
            accuracy: the accuracy for this example
        """
        self.optimizer.zero_grad()
        _loss, acc = self(x1, x2, y) # calls self.forward(x, y)
        _loss.backward()
        grad_norm = torch.nn.utils.clip_grad_norm_(filter(lambda p: p.requires_grad, self.parameters()),
                                                   self.max_grad_norm)

        if math.isnan(grad_norm):
            print('skipping update grad_norm is nan!')
        else:
            self.optimizer.step()
        loss = _loss.item()
        return loss, acc

In [11]:
base_model = Classifier(vocab_size=len(train_corpus.vocab),
                        embedding_size=1024,
                        hidden_size=1024,
                        num_layers=2)
print(base_model, '\ncontains', sum([p.numel() for p in base_model.parameters() if p.requires_grad]), 'parameters')
base_model = base_model.cuda()

Classifier(
  (dropout_layer): Dropout(p=0.1, inplace=False)
  (embedding_layer): Embedding(61585, 1024)
  (uni_RNN_LSTM_layer): LSTM(1024, 1024, num_layers=2, batch_first=True, dropout=0.1)
  (output): Linear(in_features=2048, out_features=1, bias=True)
  (loss): BCELoss()
) 
contains 79858689 parameters


In [12]:
base_model = train(base_model, train_corpus, dev_corpus, 5)

epoch: 0 time elapsed: 106.63
train loss: 0.5645 train acc: 0.6994
  dev loss: 0.5138   dev acc: 0.7454
epoch: 1 time elapsed: 219.17
train loss: 0.4056 train acc: 0.8183
  dev loss: 0.5149   dev acc: 0.7667
epoch: 2 time elapsed: 331.78
train loss: 0.2191 train acc: 0.9097
  dev loss: 0.6921   dev acc: 0.7673
epoch: 3 time elapsed: 444.53
train loss: 0.0916 train acc: 0.9650
  dev loss: 0.9610   dev acc: 0.7561
epoch: 4 time elapsed: 557.07
train loss: 0.0557 train acc: 0.9800
  dev loss: 1.3052   dev acc: 0.7552


In [13]:
evaluate(base_model, test_corpus)

Predictions:
Avg acc: 0.7760


Creating train, dev and test data objects. (with `bert_format=1`).

In [14]:
train_corpus = STSCorpus(file='train.tsv',
                          cuda=True,
                          batch_size=32, bert_format=1)
dev_corpus = STSCorpus(file='dev.tsv', vocab=train_corpus.vocab,
                        cuda=True,
                        batch_size=32,bert_format=1)
test_corpus = STSCorpus(file='test.tsv', vocab=train_corpus.vocab,
                        cuda=True,
                        batch_size=1,bert_format=1)
print(train_corpus.data_size, dev_corpus.data_size, test_corpus.data_size)

100%|██████████| 231508/231508 [00:00<00:00, 2652872.30B/s]


1212 33 0
151 51 0
1213 152 2500


# Part 2: BERT based Classifier
Next we will implement a sentence similarity predictor using DistilBERT. A nice property/design of pytorch-transformer library is that we can obtain a pretrained BERT model using the following simple line of code:

In [0]:
class BERTClassifier(Classifier):
    def __init__(self,
                 dropout=0.1,
                 max_grad_norm=5.0):
        super().__init__(0, 0, 0, 0, dropout, max_grad_norm)
        self.output = torch.nn.Linear(768, 1)
        #TODO: we have created a linear layer `self.output` for you.
        #TODO: for Bert fine-tuning to work, the weights of this layer should be initialized to a small random values
        #TODO: initialize the `weight` variable in self.output 
        #TODO: with a 0 mean 0.05 var Normal distribution
        #TODO: this link may be useful https://pytorch.org/cppdocs/api/function_namespacetorch_1_1nn_1_1init_1a105c2a8ef81c6faa82a01cf35ce9f3b1.html
        weight = torch.nn.init.normal_(torch.zeros(1,768), mean = 0, std = 0.05)
        self.output.weight = torch.nn.Parameter(weight)
        #using a pretrained bert model using pytorch-transformers is as easy as adding the line below!
        self.bert_model = BertModel.from_pretrained('distilbert-base-uncased')
        #note that the learning rate for fine-tuning should be small, we will use 1e-5 for our learning rate.
        self.optimizer = torch.optim.Adam(filter(lambda p: p.requires_grad, self.parameters()), lr=1e-5)

    def predict(self, x1, x2=None):
        assert x2 is None
        #TODO: x1 is a batch of sequence pairs. 
        #TODO: BERT (and DistillBERT) have been trained such that
        #TODO: the ouput at the first time-step can be used for sentence similarity classification.
        #TODO: Pass the x1 tensor to the `self.bert_model`
        #TODO: Note the Bert model will return a tuple, you only need the first item (which are the hidden states from the last layer of Bert) in the tuple for this task.
        #TODO: documentation for the Bert model can be found here: https://huggingface.co/transformers/model_doc/bert.html
        #TODO: the result should have shape (batch_size, seq_size, 768)
        x2 = self.bert_model(x1)
        #TODO: Extract the first time step hidden state from all the hidden states.
        #TODO: Pass the first time step hidden state through the `self.output` linear layer
        #TODO: Pass the output of the linear layer through a sigmoid function and name the result `out`.
        #TODO: `out` should have the shape (batch_size, 1)
        out = torch.sigmoid(self.output(x2[0][:,-1,:].squeeze(1)))

        
        
        pred = out.clone().detach()
        pred[pred >= 0.5] = 1
        pred[pred < 0.5] = 0
        return out, pred

In [16]:
bert_model = BERTClassifier()
bert_model = bert_model.cuda()
print(bert_model, '\ncontains', sum([p.numel() for p in bert_model.parameters() if p.requires_grad]), 'parameters')

100%|██████████| 492/492 [00:00<00:00, 473736.82B/s]
100%|██████████| 267967963/267967963 [00:03<00:00, 72310751.52B/s]


BERTClassifier(
  (dropout_layer): Dropout(p=0.1, inplace=False)
  (loss): BCELoss()
  (output): Linear(in_features=768, out_features=1, bias=True)
  (bert_model): DistilBertModel(
    (embeddings): Embeddings(
      (word_embeddings): Embedding(30522, 768, padding_idx=0)
      (position_embeddings): Embedding(512, 768)
      (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
      (dropout): Dropout(p=0.1, inplace=False)
    )
    (transformer): Transformer(
      (layer): ModuleList(
        (0): TransformerBlock(
          (dropout): Dropout(p=0.1, inplace=False)
          (attention): MultiHeadSelfAttention(
            (dropout): Dropout(p=0.1, inplace=False)
            (q_lin): Linear(in_features=768, out_features=768, bias=True)
            (k_lin): Linear(in_features=768, out_features=768, bias=True)
            (v_lin): Linear(in_features=768, out_features=768, bias=True)
            (out_lin): Linear(in_features=768, out_features=768, bias=True)
          )
 

In [17]:
bert_model = train(bert_model, train_corpus, dev_corpus, 3) # takes ~1 hour

epoch: 0 time elapsed: 220.21
train loss: 0.4207 train acc: 0.8011
  dev loss: 0.3455   dev acc: 0.8415
epoch: 1 time elapsed: 449.09
train loss: 0.2932 train acc: 0.8771
  dev loss: 0.2984   dev acc: 0.8765
epoch: 2 time elapsed: 678.29
train loss: 0.2231 train acc: 0.9122
  dev loss: 0.3167   dev acc: 0.8734


In [18]:
evaluate(bert_model, test_corpus)

Predictions:
Avg acc: 0.8684
