---   
# HW3 - Transfer learning

#### Due October 30, 2019

In this assignment you will learn about transfer learning. This technique is perhaps one of the most important techniques for industry. When a problem you want to solve does not have enough data, we use a different (larger) dataset to learn representations which can help us solve our task using the smaller task.

The general steps to transfer learning are as follows:

1. Find a huge dataset with similar characteristics to the problem you are interested in.
2. Choose a model powerful enough to extract meaningful representations from the huge dataset.
3. Train this model on the huge dataset.
4. Use this model to train on the smaller dataset.


### This homework has the following sections:
1. Question 1: MNIST fine-tuning (Parts A, B, C, D).
2. Question 2: Pretrain on Wikitext2 (Part A, B, C, D)
3. Question 3: Finetune on MNLI (Part A, B, C, D)
4. Question 4: Finetune using pretrained BERT (Part A, B, C)

In [1]:
!pip install jsonlines
import torch
from torch.utils.data import Dataset
from tqdm import tqdm
import pickle
import torch
import torchvision.models as models
import os
from torchvision import transforms
from torchvision.datasets import  MNIST
from torch.utils.data import DataLoader, random_split
import torch.nn.functional as F
import os
import json
import jsonlines
import numpy as np
from collections import defaultdict
from torch import nn
import numpy

Collecting jsonlines
  Downloading https://files.pythonhosted.org/packages/4f/9a/ab96291470e305504aa4b7a2e0ec132e930da89eb3ca7a82fbe03167c131/jsonlines-1.2.0-py2.py3-none-any.whl
Installing collected packages: jsonlines
Successfully installed jsonlines-1.2.0


In [2]:
!pip install transformers
from transformers.data.processors.glue import MnliProcessor
import torch
import pandas as pd
import os
import sys
import shutil
import argparse
import tempfile
import urllib.request
import zipfile
from transformers import glue_convert_examples_to_features as convert_examples_to_features
from transformers import BertTokenizer
from torch.utils.data import TensorDataset, RandomSampler, DataLoader


from transformers import (
    BertModel,
    BertTokenizer
)

tokenizer = BertTokenizer.from_pretrained('bert-base-cased')
bert = BertModel.from_pretrained('bert-base-cased', output_attentions=True)

Collecting transformers
[?25l  Downloading https://files.pythonhosted.org/packages/fd/f9/51824e40f0a23a49eab4fcaa45c1c797cbf9761adedd0b558dab7c958b34/transformers-2.1.1-py3-none-any.whl (311kB)
[K     |████████████████████████████████| 317kB 2.7MB/s 
Collecting regex (from transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/ff/60/d9782c56ceefa76033a00e1f84cd8c586c75e6e7fea2cd45ee8b46a386c5/regex-2019.08.19-cp36-cp36m-manylinux1_x86_64.whl (643kB)
[K     |████████████████████████████████| 645kB 48.4MB/s 
[?25hCollecting sacremoses (from transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/1f/8e/ed5364a06a9ba720fddd9820155cc57300d28f5f43a6fd7b7e817177e642/sacremoses-0.0.35.tar.gz (859kB)
[K     |████████████████████████████████| 860kB 43.6MB/s 
Collecting sentencepiece (from transformers)
[?25l  Downloading https://files.pythonhosted.org/packages/14/3d/efb655a670b98f62ec32d66954e1109f403db4d937c50d779a75b9763a29/sentencepiece-0.1.83-cp36-c

100%|██████████| 213450/213450 [00:00<00:00, 5422322.16B/s]
100%|██████████| 313/313 [00:00<00:00, 153545.87B/s]
100%|██████████| 435779157/435779157 [00:08<00:00, 49019016.69B/s]


In [3]:
TASKS = ["CoLA", "SST", "MRPC", "QQP", "STS", "MNLI", "SNLI", "QNLI", "RTE", "WNLI", "diagnostic"]
TASK2PATH = {
    "CoLA": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FCoLA.zip?alt=media&token=46d5e637-3411-4188-bc44-5809b5bfb5f4",  # noqa
    "SST": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSST-2.zip?alt=media&token=aabc5f6b-e466-44a2-b9b4-cf6337f84ac8",  # noqa
    "MRPC": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2Fmrpc_dev_ids.tsv?alt=media&token=ec5c0836-31d5-48f4-b431-7480817f1adc",  # noqa
    "QQP": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQQP-clean.zip?alt=media&token=11a647cb-ecd3-49c9-9d31-79f8ca8fe277",  # noqa
    "STS": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSTS-B.zip?alt=media&token=bddb94a7-8706-4e0d-a694-1109e12273b5",  # noqa
    "MNLI": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FMNLI.zip?alt=media&token=50329ea1-e339-40e2-809c-10c40afff3ce",  # noqa
    "SNLI": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FSNLI.zip?alt=media&token=4afcfbb2-ff0c-4b2d-a09a-dbf07926f4df",  # noqa
    "QNLI": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FQNLIv2.zip?alt=media&token=6fdcf570-0fc5-4631-8456-9505272d1601",  # noqa
    "RTE": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FRTE.zip?alt=media&token=5efa7e85-a0bb-4f19-8ea2-9e1840f077fb",  # noqa
    "WNLI": "https://firebasestorage.googleapis.com/v0/b/mtl-sentence-representations.appspot.com/o/data%2FWNLI.zip?alt=media&token=068ad0a0-ded7-4bd7-99a5-5e00222e0faf",  # noqa
    "diagnostic": [
        "https://storage.googleapis.com/mtl-sentence-representations.appspot.com/tsvsWithoutLabels%2FAX.tsv?GoogleAccessId=firebase-adminsdk-0khhl@mtl-sentence-representations.iam.gserviceaccount.com&Expires=2498860800&Signature=DuQ2CSPt2Yfre0C%2BiISrVYrIFaZH1Lc7hBVZDD4ZyR7fZYOMNOUGpi8QxBmTNOrNPjR3z1cggo7WXFfrgECP6FBJSsURv8Ybrue8Ypt%2FTPxbuJ0Xc2FhDi%2BarnecCBFO77RSbfuz%2Bs95hRrYhTnByqu3U%2FYZPaj3tZt5QdfpH2IUROY8LiBXoXS46LE%2FgOQc%2FKN%2BA9SoscRDYsnxHfG0IjXGwHN%2Bf88q6hOmAxeNPx6moDulUF6XMUAaXCSFU%2BnRO2RDL9CapWxj%2BDl7syNyHhB7987hZ80B%2FwFkQ3MEs8auvt5XW1%2Bd4aCU7ytgM69r8JDCwibfhZxpaa4gd50QXQ%3D%3D",  # noqa
        "https://www.dropbox.com/s/ju7d95ifb072q9f/diagnostic-full.tsv?dl=1",
    ],
}

MRPC_TRAIN = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_train.txt"
MRPC_TEST = "https://dl.fbaipublicfiles.com/senteval/senteval_data/msr_paraphrase_test.txt"


def download_and_extract(task, data_dir):
    print("Downloading and extracting %s..." % task)
    data_file = "%s.zip" % task
    urllib.request.urlretrieve(TASK2PATH[task], data_file)
    with zipfile.ZipFile(data_file) as zip_ref:
        zip_ref.extractall(data_dir)
    os.remove(data_file)
    print("\tCompleted!")
download_and_extract('MNLI', '.')

Downloading and extracting MNLI...
	Completed!


---  
### Question 4 (BERT)

A major direction in research came from a model called BERT, released last year.  

In this question you'll use BERT as your feature_extractor instead of the model you
designed yourself.

To get BERT, head on over to (https://github.com/huggingface/transformers) and load your BERT model here

### Part A (init BERT)
In this section you need to create an instance of BERT and return if from the function

In [0]:
def init_mnli_dataset():
  # ----------------------
  # TRAIN/VAL DATALOADERS
  # ----------------------
  train = processor.get_train_examples('MNLI')
  features = convert_examples_to_features(train,
                                          tokenizer,
                                          label_list=['contradiction','neutral','entailment'],
                                          max_length=128,
                                          output_mode='classification',
                                          pad_on_left=False,
                                          pad_token=tokenizer.pad_token_id,
                                          pad_token_segment_id=0)
  train_dataset = TensorDataset(torch.tensor([f.input_ids for f in features], dtype=torch.long), 
                                torch.tensor([f.attention_mask for f in features], dtype=torch.long), 
                                torch.tensor([f.token_type_ids for f in features], dtype=torch.long), 
                                torch.tensor([f.label for f in features], dtype=torch.long))

  nb_train_samples = int(0.95 * len(train_dataset))
  nb_val_samples = len(train_dataset) - nb_train_samples

  bert_mnli_train_dataset, bert_mnli_val_dataset = random_split(train_dataset, [nb_train_samples, nb_val_samples])

  # train loader
  train_sampler = RandomSampler(bert_mnli_train_dataset)
  bert_mnli_train_dataloader = DataLoader(bert_mnli_train_dataset, sampler=train_sampler, batch_size=32)

  # val loader
  val_sampler = RandomSampler(bert_mnli_val_dataset)
  bert_mnli_val_dataloader = DataLoader(bert_mnli_val_dataset, sampler=val_sampler, batch_size=32)


  # ----------------------
  # TEST DATALOADERS
  # ----------------------
  dev = processor.get_dev_examples('MNLI')
  features = convert_examples_to_features(dev,
                                          tokenizer,
                                          label_list=['contradiction','neutral','entailment'],
                                          max_length=128,
                                          output_mode='classification',
                                          pad_on_left=False,
                                          pad_token=tokenizer.pad_token_id,
                                          pad_token_segment_id=0)

  bert_mnli_test_dataset = TensorDataset(torch.tensor([f.input_ids for f in features], dtype=torch.long), 
                                torch.tensor([f.attention_mask for f in features], dtype=torch.long), 
                                torch.tensor([f.token_type_ids for f in features], dtype=torch.long), 
                                torch.tensor([f.label for f in features], dtype=torch.long))

  # test dataset
  test_sampler = RandomSampler(bert_mnli_test_dataset)
  bert_mnli_test_dataloader = DataLoader(bert_mnli_test_dataset, sampler=test_sampler, batch_size=32)
  
  return bert_mnli_train_dataloader, bert_mnli_val_dataloader, bert_mnli_test_dataloader

In [0]:
processor = MnliProcessor()

In [0]:
train_loader, val_loader, test_loader = init_mnli_dataset()

In [7]:
from transformers import BertTokenizer, BertModel, BertForMaskedLM

def init_bert():
    pretrained_weights = "bert-base-uncased"
    bert = BertModel.from_pretrained(pretrained_weights, output_attentions=True)
    def freeze_model(model):
        for param in model.parameters():
            param.requires_grad = False
        
    freeze_model(bert)
    
    return bert
  
bert = init_bert()

100%|██████████| 313/313 [00:00<00:00, 174298.61B/s]
100%|██████████| 440473133/440473133 [00:07<00:00, 58461152.18B/s]


## Part B (fine-tune with BERT)

Use BERT as your feature extractor to finetune MNLI. Use a new finetune model (reset weights).

In [0]:
class BERTSequenceClassifier(nn.Module):
    def __init__(self, bert, num_classes):
        super().__init__()
        self.bert = bert
        self.W = nn.Linear(bert.config.hidden_size, num_classes)
        self.num_classes = num_classes
        
    def forward(self, input_ids, attention_mask, token_type_ids):
        h, _, attn = self.bert(input_ids=input_ids, 
                               attention_mask=attention_mask, 
                               token_type_ids=token_type_ids)
        h_cls = h[:, 0]
        logits = self.W(h_cls)
        return logits, attn

In [0]:
def init_finetune_model(bert):
  model = BERTSequenceClassifier(bert, 3)
  return model

fine_tune_model = init_finetune_model(bert)
  

In [0]:
import torch.optim as optim
from tqdm import trange

In [0]:
plot_cache = []
num_gpus = torch.cuda.device_count()
if num_gpus > 0:
    current_device = 'cuda'
else:
    current_device = 'cpu'

In [0]:
fine_tune_model.to(current_device)
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.convert_tokens_to_ids("[PAD]")).to(current_device)
optimizer = optim.Adam([p for p in fine_tune_model.parameters() if p.requires_grad], lr=2e-05, eps=1e-08)

In [13]:
[p for p in fine_tune_model.parameters() if p.requires_grad]

[Parameter containing:
 tensor([[-0.0237, -0.0244,  0.0062,  ..., -0.0012, -0.0132, -0.0165],
         [-0.0315, -0.0306,  0.0277,  ...,  0.0099,  0.0013, -0.0290],
         [-0.0132,  0.0326,  0.0145,  ...,  0.0224, -0.0180,  0.0262]],
        device='cuda:0', requires_grad=True), Parameter containing:
 tensor([-0.0111, -0.0088, -0.0087], device='cuda:0', requires_grad=True)]

In [0]:
LOAD_PRETRAINED = False
import torch.nn.functional as F

In [0]:
def fine_tune_mnli_BERT(model, train_loader, val_loader):
  plot_cache = []
  for epoch_number in range(2):
      avg_loss=0
      model.train()
      train_log_cache = []
      for i, (inp, attention_masks , token_type_ids ,target) in enumerate(train_loader):
          optimizer.zero_grad()
          inp = inp.to(current_device)
          target = target.to(current_device)
          token_type_ids = token_type_ids.to(current_device)
          attention_masks = attention_masks.to(current_device)
          logits, _ = model(inp,attention_masks,token_type_ids)

          loss = criterion(logits.view(-1, logits.size(-1)), target.view(-1))

          loss.backward()
          nn.utils.clip_grad_norm_(model.parameters(), 1.0)
          optimizer.step()

          train_log_cache.append(loss.item())

          if i % 1000 == 0:
              avg_loss = sum(train_log_cache)/len(train_log_cache)
              print('Step {} avg train loss = {:.{prec}f}'.format(i, avg_loss, prec=4))
              train_log_cache = []

      #do valid
      valid_losses = []
      #model.eval()
      with torch.no_grad():
          model.eval()
          total = 0
          correct = 0
          for i,(inp, attention_masks , token_type_ids ,target) in enumerate(val_loader):
            inp = inp.to(current_device)
            target = target.to(current_device)
            token_type_ids = token_type_ids.to(current_device)
            attention_masks = attention_masks.to(current_device)
            logits, _ = model(inp,attention_masks,token_type_ids)

            outputs = F.softmax(logits, dim=1)
            predicted = outputs.max(1, keepdim=True)[1]
            temp = predicted
            total += target.size(0)
            correct += predicted.eq(target.view_as(predicted).to(current_device)).sum().item()
          print("val acc",100 * correct / total)     

In [16]:
fine_tune_mnli_BERT(fine_tune_model, train_loader, val_loader)

Step 0 avg train loss = 1.1723
Step 1000 avg train loss = 0.7608
Step 2000 avg train loss = 0.6769
Step 3000 avg train loss = 0.6647
Step 4000 avg train loss = 0.6653
Step 5000 avg train loss = 0.6591
Step 6000 avg train loss = 0.6563
Step 7000 avg train loss = 0.6538
Step 8000 avg train loss = 0.6515
Step 9000 avg train loss = 0.6536
Step 10000 avg train loss = 0.6503
Step 11000 avg train loss = 0.6491
val acc 42.32022815237319
Step 0 avg train loss = 0.5556
Step 1000 avg train loss = 0.6466
Step 2000 avg train loss = 0.6476
Step 3000 avg train loss = 0.6466
Step 4000 avg train loss = 0.6451
Step 5000 avg train loss = 0.6442
Step 6000 avg train loss = 0.6457
Step 7000 avg train loss = 0.6434
Step 8000 avg train loss = 0.6431
Step 9000 avg train loss = 0.6440
Step 10000 avg train loss = 0.6449
Step 11000 avg train loss = 0.6435
val acc 43.18598492564677


## Part C
Evaluate how well we did

In [17]:
import torch.nn.functional as F
def calculate_mnli_test_accuracy_BERT(model, test_loader):
  with torch.no_grad():
    model.eval()
    total = 0
    correct = 0
    for i,(inp, attention_masks , token_type_ids ,target) in enumerate(test_loader):
      inp = inp.to(current_device)
      target = target.to(current_device)
      token_type_ids = token_type_ids.to(current_device)
      attention_masks = attention_masks.to(current_device)
      logits, _ = model(inp,attention_masks,token_type_ids)

      outputs = F.softmax(logits, dim=1)
      predicted = outputs.max(1, keepdim=True)[1]
      temp = predicted
      total += target.size(0)
      correct += predicted.eq(target.view_as(predicted).to(current_device)).sum().item()
    print("test acc",100 * correct / total)
  return 100 * correct / total

calculate_mnli_test_accuracy_BERT(fine_tune_model, test_loader)

test acc 42.83239938869078


42.83239938869078

## Let's grade your BERT results!

In [0]:
def grade_mnli_BERT():
    BERT_feature_extractor = init_bert()
    
    # load data
    #mnli_train, mnli_val, mnli_test = init_mnli_dataset()

    # init the fine_tune model
    fine_tune_model = init_finetune_model()
    
    # finetune
    fine_tune_mnli(BERT_feature_extractor, fine_tune_model, mnli_train, mnli_val)

    # check test accuracy
    test_accuracy = calculate_mnli_test_accuracy(feature_extractor, wikitext_test)
    
    # the real threshold will be released by Oct 11 
    assert test_accuracy > 0.0, 'ummm... your accuracy is too low...'
    
grade_mnli_BERT()