# Problem 4

In this problem, we simply finetune a BERT model (not pretrained) on RTE dataset, and then finetune a BERT model (pretrained) on RTE dataset.

**IMPORTANT NOTES**:
- Please make sure that you have already read the part of hw5 pdf that corresponds to this problem. This is very important.
- At the end of the hw5, you will need to submit a zip folder containing three things. The instruction is also included in the first paragraph of the hw5 pdf.
  - (1) The writeup pdf containing your solutions to Problems 1, 2, 3, 4, 5. Yes, there're things you need to respond in your writeup (see hw5 pdf).
  - (2) The downloaded colab corresponding to Problem 4.
  - (3) The downloaded colab corresponding to Problem 5.

Some imports and data downloading

In [1]:
!git clone https://github.com/huggingface/transformers
!python transformers/utils/download_glue_data.py --tasks RTE
!pip install transformers=='3.5.1'

Cloning into 'transformers'...
remote: Enumerating objects: 80, done.[K
remote: Counting objects: 100% (80/80), done.[K
remote: Compressing objects: 100% (76/76), done.[K
remote: Total 54622 (delta 43), reused 35 (delta 3), pack-reused 54542[K
Receiving objects: 100% (54622/54622), 40.80 MiB | 30.43 MiB/s, done.
Resolving deltas: 100% (38173/38173), done.
Downloading and extracting RTE...
	Completed!
Collecting transformers==3.5.1
[?25l  Downloading https://files.pythonhosted.org/packages/3a/83/e74092e7f24a08d751aa59b37a9fc572b2e4af3918cb66f7766c3affb1b4/transformers-3.5.1-py3-none-any.whl (1.3MB)
[K     |████████████████████████████████| 1.3MB 12.0MB/s 
Collecting sentencepiece==0.1.91
[?25l  Downloading https://files.pythonhosted.org/packages/d4/a4/d0a884c4300004a78cca907a6ff9a5e9fe4f090f5d95ab341c53d28cbc58/sentencepiece-0.1.91-cp36-cp36m-manylinux1_x86_64.whl (1.1MB)
[K     |████████████████████████████████| 1.1MB 53.8MB/s 
Collecting tokenizers==0.9.3
[?25l  Downloading h

In [2]:
import dataclasses
import logging
import os
import sys
from dataclasses import dataclass, field
from typing import Dict, Optional

import numpy as np

import torch
import torch.nn as nn 
from transformers import AutoTokenizer, EvalPrediction, GlueDataset, GlueDataTrainingArguments, AutoModel, BertPreTrainedModel, AutoConfig, BertModel, BertForSequenceClassification
from transformers import GlueDataTrainingArguments 
from transformers import (
    Trainer,
    TrainingArguments,
    glue_compute_metrics,
    glue_tasks_num_labels,
    set_seed,
)

In [3]:
!ls glue_data/RTE/

dev.tsv  test.tsv  train.tsv


In [4]:
model_name = "bert-base-uncased"

data_args = GlueDataTrainingArguments(task_name="rte", data_dir="./glue_data/RTE")
training_args = TrainingArguments(
    logging_steps=50, 
    per_device_train_batch_size=32, 
    per_device_eval_batch_size=64, 
    save_steps=1000,
    evaluate_during_training=True,
    output_dir="./models/rte",
    overwrite_output_dir=True,
    do_train=True,
    do_eval=True,
    do_predict=True,
    learning_rate=0.00001,
    num_train_epochs=15,
)
set_seed(42)
num_labels = glue_tasks_num_labels[data_args.task_name]



### From non-pretrained BERT

TODO:
- Complete the following three lines such that ```tokenizer``` and ```config``` and ```bert_model``` corresponds to the ```model_name``` we defined in the above cells. 
- IMPORTANT: make sure that the BERT model does not load pretrained weights!
- Hint: https://huggingface.co/transformers/model_doc/auto.html and other relevant Hugging Face documentations. Consider using the tools we imported in the first cell. More hints: it's okay to use ```from_pretrained``` in the first two lines, depending on what class you use.

In [5]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
bert_model = AutoModel.from_config(config)


HBox(children=(FloatProgress(value=0.0, description='Downloading', max=433.0, style=ProgressStyle(description_…




HBox(children=(FloatProgress(value=0.0, description='Downloading', max=231508.0, style=ProgressStyle(descripti…




TODO:
- Complete the forward function of the following class such that the model can do finetuning on RTE dataset.
- For more instructions, please refer to the hw5 pdf.

In [6]:
class SequenceClassificationBERT(nn.Module):
      
    def __init__(self, config, bert_model):
        super().__init__()
        self.config = config
        self.num_labels = config.num_labels
        self.dropout = nn.Dropout(config.hidden_dropout_prob)
        self.classifier = nn.Linear(config.hidden_size, config.num_labels)
        self.bert = bert_model

    def forward(
        self,
        input_ids=None,
        attention_mask=None,
        token_type_ids=None,
        position_ids=None,
        head_mask=None,
        inputs_embeds=None,
        labels=None,
        output_attentions=None,
        output_hidden_states=None,
        return_dict=None,
        self_pool=False
    ):
        return_dict = return_dict if return_dict is not None else self.config.use_return_dict
        # model.train()
        outputs = self.bert(input_ids=input_ids, 
                            attention_mask=attention_mask, 
                            token_type_ids=token_type_ids, 
                            position_ids=position_ids, 
                            head_mask=head_mask,
                            inputs_embeds=inputs_embeds,
                            # labels=labels,
                            output_attentions=output_attentions,
                            output_hidden_states=output_hidden_states,
                            return_dict=return_dict,
                            )
        # print(outputs[0].shape)
        # print(outputs[1].shape)
        if self_pool:
            last_hidden_state = outputs[0]
            cls_representation = last_hidden_state[:,0,:]
            pooled_out = self.dropout(cls_representation)

        # with pooler output
        else: 
            pooled_out = outputs[1]
            pooled_out = self.dropout(pooled_out)
        logits = torch.tanh(self.classifier(pooled_out))
        loss_fct = nn.CrossEntropyLoss()
        loss = loss_fct(logits.view(-1, self.num_labels), labels.view(-1))
        
        # make sure that all the arguments in the forward() function is used
        # somewhere in the code

        # do not change the lines below, so make sure your code works for the
        # lines below
        output = (logits,) + outputs[2:]
        return ((loss,) + output) if loss is not None else output


In [7]:
model = SequenceClassificationBERT(config=config, bert_model=bert_model)

TODO:
- Print out the number of trainable parameters in the BERT model. This can be done in one line. Please feel free to look up resources online. We also briefly touched upon relevant materials in Lab 3, but here, make sure you only count the number of trainable parameters.

In [8]:
 n_parameters = sum(p.numel() for p in model.parameters() if p.requires_grad)
 print(n_parameters)

109483778


In [9]:
train_dataset = GlueDataset(data_args, tokenizer=tokenizer)
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")



Now we train the model. Please make sure to read the pdf instructions. When you report results in the pdf writeup, make sure you report the mean and std of >=3 runs with different random seeds. Consider using ```set_seed(some number)``` before the below cell, before each run.

Make sure in each run, you're picking the best validation accuracy. We're using Trainer instead of the normal training loop which we have seen many many times earlier in the semester. In the trainer, we need to specify ```num_train_epochs``` (in ```training_args```) which we defined above. Please feel free to modify ```training_args``` such that:
- The learning rate is small (around 0.00001).
- Your model doesn't have large improments on validation accuracy anymore, at the end of training. The expected behavior is that the final validation accuracy won't be much better than chance.

We provided part of an example log below, but you may be able to get better accuracy. Again, make sure this run corresponds to using an non-pretrained BERT.

In [11]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

In [None]:

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.705232,0.695744,0.472924
100,0.693975,0.693097,0.552347
150,0.693033,0.693116,0.541516
200,0.693055,0.693194,0.472924
250,0.692858,0.693156,0.534296
300,0.696241,0.691756,0.527076
350,0.692825,0.692202,0.527076
400,0.693322,0.692722,0.527076
450,0.692958,0.692597,0.530686
500,0.693113,0.692802,0.519856




{'epoch': 15.0, 'eval_acc': 0.51985559566787, 'eval_loss': 0.7062535285949707}

In [12]:
# different seed 
# def compute_metrics(p: EvalPrediction):
#     preds = np.argmax(p.predictions, axis=1)
#     return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

seed_list=[42, 77, 100]
for seed in seed_list:
    set_seed(seed)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    config = AutoConfig.from_pretrained(model_name)
    bert_model = AutoModel.from_config(config)
    model = SequenceClassificationBERT(config=config, bert_model=bert_model)
    print('[SEED]:', seed)
    trainer = Trainer(
              model=model,
              args=training_args,
              train_dataset=train_dataset,
              eval_dataset=eval_dataset,
              compute_metrics=compute_metrics,
          )

    trainer.train()
    trainer.evaluate()


[SEED]: 42


Step,Training Loss,Validation Loss,Acc
50,0.709714,0.693547,0.472924
100,0.693744,0.692987,0.527076
150,0.693121,0.692943,0.527076
200,0.693087,0.692979,0.516245
250,0.693112,0.692882,0.519856
300,0.692907,0.692884,0.537906
350,0.69123,0.690436,0.519856
400,0.694451,0.69307,0.530686
450,0.693112,0.693062,0.527076
500,0.693142,0.693064,0.527076




[SEED]: 77


Step,Training Loss,Validation Loss,Acc
50,0.707981,0.692989,0.523466
100,0.69317,0.692961,0.516245
150,0.693113,0.692937,0.537906
200,0.693207,0.692943,0.534296
250,0.692887,0.693302,0.523466
300,0.692337,0.69177,0.527076
350,0.693178,0.692883,0.516245
400,0.693013,0.692893,0.527076
450,0.700681,0.692045,0.519856
500,0.693162,0.692866,0.519856




[SEED]: 100


Step,Training Loss,Validation Loss,Acc
50,0.698629,0.693952,0.472924
100,0.693533,0.692957,0.527076
150,0.693158,0.692886,0.527076
200,0.693131,0.692959,0.527076
250,0.69317,0.692979,0.519856
300,0.693083,0.693065,0.527076
350,0.693167,0.692744,0.519856
400,0.693246,0.692981,0.516245
450,0.693032,0.692971,0.530686
500,0.692429,0.701809,0.472924




In [34]:
train_loss = [0.685635, 0.659564, 0.683424]
val_loss = [0.698736, 0.706977, 0.701897]
val_acc = [0.537906, 0.516245, 0.537906]

print("train loss mean:{:.4f}, train loss std:{:.4f}".format(
     np.mean(train_loss), np.std(train_loss)))
print("   val loss mean:{:.4f}, val loss std:{:.4f}".format(
    np.mean(val_loss), np.std(val_loss)
))
print("   acc mean:{:.4f}, acc std:{:.4f}".format(
    np.mean(val_acc), np.std(val_acc)
))


train loss mean:0.6762, train loss std:0.0118
   val loss mean:0.7025, val loss std:0.0034
   acc mean:0.5307, acc std:0.0102


### From pretrained BERT

Now, let's do the above experiments using a pretrained BERT!

TODO:
- Complete the following three lines such that ```tokenizer``` and ```config``` and ```bert_model``` corresponds to the ```model_name``` we defined in the above cells. 
- IMPORTANT (different from the TODO a few cells above): make sure that the BERT model below loads the pretrained weights!

In [25]:
tokenizer = AutoTokenizer.from_pretrained(model_name)
config = AutoConfig.from_pretrained(model_name)
bert_model = BertModel.from_pretrained(model_name)

HBox(children=(FloatProgress(value=0.0, description='Downloading', max=440473133.0, style=ProgressStyle(descri…




In [26]:
bert_model 

BertModel(
  (embeddings): BertEmbeddings(
    (word_embeddings): Embedding(30522, 768, padding_idx=0)
    (position_embeddings): Embedding(512, 768)
    (token_type_embeddings): Embedding(2, 768)
    (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
    (dropout): Dropout(p=0.1, inplace=False)
  )
  (encoder): BertEncoder(
    (layer): ModuleList(
      (0): BertLayer(
        (attention): BertAttention(
          (self): BertSelfAttention(
            (query): Linear(in_features=768, out_features=768, bias=True)
            (key): Linear(in_features=768, out_features=768, bias=True)
            (value): Linear(in_features=768, out_features=768, bias=True)
            (dropout): Dropout(p=0.1, inplace=False)
          )
          (output): BertSelfOutput(
            (dense): Linear(in_features=768, out_features=768, bias=True)
            (LayerNorm): LayerNorm((768,), eps=1e-12, elementwise_affine=True)
            (dropout): Dropout(p=0.1, inplace=False)
          

In [27]:
train_dataset = GlueDataset(data_args, tokenizer=tokenizer)
eval_dataset = GlueDataset(data_args, tokenizer=tokenizer, mode="dev")



In [28]:
model = SequenceClassificationBERT(config=config, bert_model=bert_model)

TODO:
- Similarly, we train the model. For more instructions, please see the TODO cells above (i.e., the TODO corresponding to training the model, when we're not loading weights into BERT), as well as the hw5 pdf.

In [29]:
def compute_metrics(p: EvalPrediction):
    preds = np.argmax(p.predictions, axis=1)
    return glue_compute_metrics(data_args.task_name, preds, p.label_ids)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
    eval_dataset=eval_dataset,
    compute_metrics=compute_metrics,
)

trainer.train()
trainer.evaluate()

Step,Training Loss,Validation Loss,Acc
50,0.706412,0.673483,0.592058
100,0.67326,0.663302,0.602888
150,0.603644,0.651969,0.599278
200,0.52244,0.663544,0.592058
250,0.466264,0.682637,0.602888
300,0.39241,0.68492,0.638989
350,0.339178,0.7007,0.638989
400,0.299557,0.699984,0.65704
450,0.267922,0.725254,0.66065
500,0.22957,0.719482,0.638989




{'epoch': 15.0, 'eval_acc': 0.6714801444043321, 'eval_loss': 0.757057785987854}

In [30]:
seed_list=[42, 77, 100]
for seed in seed_list:
    set_seed(seed)
    tokenizer = AutoTokenizer.from_pretrained(model_name)
    config = AutoConfig.from_pretrained(model_name)
    bert_model = BertModel.from_pretrained(model_name)
    model = SequenceClassificationBERT(config=config, bert_model=bert_model)
    print('[SEED]:', seed)
    trainer = Trainer(
              model=model,
              args=training_args,
              train_dataset=train_dataset,
              eval_dataset=eval_dataset,
              compute_metrics=compute_metrics,
          )

    trainer.train()
    trainer.evaluate()

[SEED]: 42


Step,Training Loss,Validation Loss,Acc
50,0.697165,0.681431,0.548736
100,0.676436,0.6589,0.610108
150,0.623303,0.640046,0.613718
200,0.548072,0.641502,0.646209
250,0.495864,0.629686,0.67148
300,0.422592,0.633591,0.66426
350,0.36263,0.660979,0.649819
400,0.33126,0.659077,0.65704
450,0.288745,0.68278,0.66426
500,0.253338,0.658682,0.696751




[SEED]: 77


Step,Training Loss,Validation Loss,Acc
50,0.695169,0.672763,0.617329
100,0.664624,0.659102,0.592058
150,0.607601,0.652916,0.602888
200,0.525232,0.656211,0.620939
250,0.468655,0.674218,0.617329
300,0.394102,0.690726,0.617329
350,0.334919,0.703073,0.631769
400,0.302861,0.715687,0.628159
450,0.26162,0.731467,0.638989
500,0.240568,0.753978,0.624549




[SEED]: 100


Step,Training Loss,Validation Loss,Acc
50,0.703993,0.666133,0.606498
100,0.655821,0.653204,0.602888
150,0.59992,0.640589,0.635379
200,0.505415,0.641108,0.642599
250,0.446843,0.653086,0.65343
300,0.36323,0.667209,0.65704
350,0.311898,0.685567,0.65343
400,0.278575,0.709224,0.66065
450,0.244235,0.725993,0.65343
500,0.225847,0.744602,0.65704




In [33]:
train_loss = [0.164747, 0.156541, 0.154866]
val_loss = [0.707724, 0.773094, 0.748200]
val_acc = [0.685921, 0.657040, 0.678700]

print("train loss mean:{:.4f}, train loss std:{:.4f}".format(
     np.mean(train_loss), np.std(train_loss)))
print("   val loss mean:{:.4f}, val loss std:{:.4f}".format(
    np.mean(val_loss), np.std(val_loss)
))
print("   acc mean:{:.4f}, acc std:{:.4f}".format(
    np.mean(val_acc), np.std(val_acc)
))

train loss mean:0.1587, train loss std:0.0043
   val loss mean:0.7430, val loss std:0.0269
   acc mean:0.6739, acc std:0.0123
