Batch Norm 
(1) reduce the problem the problem of input values changes <br>
cause them to be stable so later neural network has ground to stay on <br>
even the earlier layer keeps learning, the later layer that is forced to learn is reduced<br>
each layer can be learn more independently <br>
(2) Regularization effect <br>


In [1]:
def check_grad(model_ft):
    print("Parameters to learn:")
    params_to_update = []
    total_params = 0
    total_trainable_params = 0
    for name,param in model_ft.named_parameters():
        total_params+=param.numel()
        if param.requires_grad == True:
            params_to_update.append(param)
            total_trainable_params += param.numel()
            print("\t",name)
    print(f'{total_params:,} total parameters.')
    print(f'{total_trainable_params:,} training parameters.')
    return params_to_update

In [2]:
from torchvision.models.resnet import ResNet, BasicBlock
from torchvision.datasets import MNIST
from tqdm.autonotebook import tqdm
from sklearn.metrics import precision_score, recall_score, f1_score, accuracy_score
import inspect
import time
from torch import nn, optim
import torch
from torchvision.transforms import Compose, ToTensor, Normalize, Resize
from torch.utils.data import DataLoader
import time
import copy
from torch.nn import functional as F



---   
# HW3 - Transfer learning

#### Due October 30, 2019

In this assignment you will learn about transfer learning. This technique is perhaps one of the most important techniques for industry. When a problem you want to solve does not have enough data, we use a different (larger) dataset to learn representations which can help us solve our task using the smaller task.

The general steps to transfer learning are as follows:

1. Find a huge dataset with similar characteristics to the problem you are interested in.
2. Choose a model powerful enough to extract meaningful representations from the huge dataset.
3. Train this model on the huge dataset.
4. Use this model to train on the smaller dataset.


### This homework has the following sections:
1. Question 1: MNIST fine-tuning (Parts A, B, C, D).
2. Question 2: Pretrain on Wikitext2 (Part A, B, C, D)
3. Question 3: Finetune on MNLI (Part A, B, C, D)
4. Question 4: Finetune using pretrained BERT (Part A, B, C)

---   
## Question 1 (MNIST transfer learning)
To grasp the high-level approach to transfer learning, let's first do a simple example using computer vision. 

The torchvision library has pretrained models (resnets, vggnets, etc) on the Imagenet dataset. Imagenet is a dataset
with 1.3 million images covering over 1000 classes of objects. When you use one of these models, the weights of the model initialize
with the weights saved from training on imagenet.

In this task we will:
1. Choose a pretrained model.
2. Freeze the model so that the weights don't change.
3. Fine-tune on a few labels of MNIST.   

#### Choose a model
Here we pick any of the models from torchvision

In [3]:
import torch
import torchvision.models as models
from torch import nn as nn
class Identity(torch.nn.Module):
    def __init__(self):
        super(Identity, self).__init__()
        
    def forward(self, x):
        return x

# init the pretrained feature extractor
pretrained_resnet18 = models.resnet18(pretrained=True)

# we don't want the built in last layer, we're going to modify it ourselves
pretrained_resnet18.fc = Identity()

In [4]:
pretrained_resnet18

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

#### Freeze the model
Here we freeze the weights of the model. Freezing means the gradients will not backpropagate
into these weights.

By doing this you can think about the model as a feature extractor. This feature extractor outputs
a **representation** of an input. This representation is a matrix that encodes information about the input.

In [5]:
def freeze_model(model):
    for param in model.parameters():
        param.requires_grad = False
        
def unfreeze_model(model):
    for param in model.parameters():
        param.requires_grad = True
        
freeze_model(pretrained_resnet18)

In [6]:
check_grad(pretrained_resnet18)

Parameters to learn:
11,176,512 total parameters.
0 training parameters.


[]

#### Init target dataset
Here we define the dataset we are actually interested in.

In [7]:
import os
from torchvision import transforms
from torchvision.datasets import  MNIST
from torch.utils.data import DataLoader, random_split
import torch.nn.functional as F


#  train/val  split
mnist_dataset = MNIST(os.getcwd(), train=True, download=True, 
                      transform = transforms.Compose([
                          transforms.Resize((256,256)),
                          transforms.Grayscale(3),
                          transforms.ToTensor(),
                         transforms.Normalize((0.485, 0.456, 0.406), 
                                              (0.229, 0.224, 0.225))]))
mnist_train, mnist_val = random_split(mnist_dataset, [55000, 5000])

mnist_train = DataLoader(mnist_train, batch_size=32,shuffle=True)
mnist_val = DataLoader(mnist_val, batch_size=32,shuffle=True)

# test split
mnist_test = DataLoader(MNIST(os.getcwd(), train=False, download=True, 
                              transform=transforms.Compose([
                                transforms.Resize((256,256)),
                                transforms.Grayscale(3),
                                transforms.ToTensor(),
                                transforms.Normalize((0.485, 0.456, 0.406), 
                                              (0.229, 0.224, 0.22))])
                             ), 
                        batch_size=32,
                        shuffle=False)

dataloaders = dict()
dataloaders['train'] = mnist_train
dataloaders['val'] = mnist_val


In [8]:
# check dataset
for images, targets in mnist_train:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', targets.shape)
    break

Image batch dimensions: torch.Size([32, 3, 256, 256])
Image label dimensions: torch.Size([32])


In [9]:
for images, targets in mnist_test:
    print('Image batch dimensions:', images.shape)
    print('Image label dimensions:', targets.shape)
    break

Image batch dimensions: torch.Size([32, 3, 256, 256])
Image label dimensions: torch.Size([32])


### Part A (init fine-tune model)
decide what model to use for fine-tuning

In [9]:
pretrained_resnet18

ResNet(
  (conv1): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (relu): ReLU(inplace=True)
  (maxpool): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (layer1): Sequential(
    (0): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    )
    (1): BasicBlock(
      (conv1): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=True)
  

In [39]:
for i,p in enumerate(model.parameters()):
    print(p)
    print(p.numel())
    break

Parameter containing:
tensor([[[[-1.0419e-02, -6.1356e-03, -1.8098e-03,  ...,  5.6615e-02,
            1.7083e-02, -1.2694e-02],
          [ 1.1083e-02,  9.5276e-03, -1.0993e-01,  ..., -2.7124e-01,
           -1.2907e-01,  3.7424e-03],
          [-6.9434e-03,  5.9089e-02,  2.9548e-01,  ...,  5.1972e-01,
            2.5632e-01,  6.3573e-02],
          ...,
          [-2.7535e-02,  1.6045e-02,  7.2595e-02,  ..., -3.3285e-01,
           -4.2058e-01, -2.5781e-01],
          [ 3.0613e-02,  4.0960e-02,  6.2850e-02,  ...,  4.1384e-01,
            3.9359e-01,  1.6606e-01],
          [-1.3736e-02, -3.6746e-03, -2.4084e-02,  ..., -1.5070e-01,
           -8.2230e-02, -5.7828e-03]],

         [[-1.1397e-02, -2.6619e-02, -3.4641e-02,  ...,  3.2521e-02,
            6.6221e-04, -2.5743e-02],
          [ 4.5687e-02,  3.3603e-02, -1.0453e-01,  ..., -3.1253e-01,
           -1.6051e-01, -1.2826e-03],
          [-8.3730e-04,  9.8420e-02,  4.0210e-01,  ...,  7.0789e-01,
            3.6887e-01,  1.2455e-01]

In [76]:
# adding dropout: decrease accuracy 
class finetune(nn.Module):
    def __init__(self, input_size, n_classes, d =0.35):
        super(finetune, self).__init__()
        self.fc1 = nn.Linear(input_size, 256)
        self.bn = nn.BatchNorm1d(256*2,eps=1e-05, momentum=0.1, affine=True)
        self.relu = nn.ReLU()
        self.fc2 = nn.Linear(256*2, n_classes)
        self.sm = nn.LogSoftmax(dim=1)
        self.dropout = nn.Dropout(d)
    
    def forward(self, x):
        x = self.fc1(x)
        x = self.bn(x)
        x = self.dropout(x)
        x = self.fc2(x)
        x = self.sm(x) 
        return x 

def init_fine_tune_model():
    return finetune(input_size=512, n_classes=10)

In [94]:
init_fine_tune_model()
test_accuracy, test_loss = calculate_mnist_test_accuracy(pretrained_resnet18, init_fine_tune_model(), mnist_test)

Test Loss: 2.3055 Acc: 0.1002


## (2) Second: run using 2 linear with relu

In [135]:
# Simple MLP
def init_fine_tune_model(n_inputs=512, n_classes=10 ):
    return nn.Sequential(
                      nn.Linear(n_inputs, 256), 
                      nn.ReLU(), 
                      nn.Linear(256, n_classes),                   
                      nn.LogSoftmax(dim=1))

In [118]:
test_accuracy, test_loss = calculate_mnist_test_accuracy(pretrained_resnet18, init_fine_tune_model(), mnist_test)

Test Loss: 2.3067 Acc: 0.0922


## (1) First run using MLP

In [124]:
### Simple MLP
# Runned This 
def init_fine_tune_model(n_inputs=512, n_classes=10 ):
    return nn.Sequential(
                      nn.Linear(n_inputs, n_classes),                   
                      nn.LogSoftmax(dim=1))
#test_accuracy, test_loss = calculate_mnist_test_accuracy(pretrained_resnet18, init_fine_tune_model(), mnist_test)

In [122]:
def init_fine_tune_model(n_inputs=512, n_classes=10 ):
    return nn.Sequential(
    nn.BatchNorm1d(n_inputs),
    nn.Dropout(p=0.25),
    nn.Linear(in_features=n_inputs, out_features=2048),
    nn.ReLU(),
    nn.BatchNorm1d(2048, eps=1e-05, momentum=0.1),
    nn.Dropout(p=0.5),
    nn.Linear(2048, n_classes))
test_accuracy, test_loss = calculate_mnist_test_accuracy(pretrained_resnet18, init_fine_tune_model(), mnist_test)

Test Loss: 2.3042 Acc: 0.0934


In [115]:
# ResNet
def conv3x3(in_planes, out_planes, stride=1, groups=1, dilation=1):
    """3x3 convolution with padding"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=3, stride=stride,
                     padding=dilation, groups=groups, bias=False, dilation=dilation)

def conv1x1(in_planes, out_planes, stride=1):
    """1x1 convolution"""
    return nn.Conv2d(in_planes, out_planes, kernel_size=1, stride=stride, bias=False)

class BasicBlock(nn.Module):
    expansion = 1

    def __init__(self, inplanes, planes, stride=1, downsample=None):
        super(BasicBlock, self).__init__()
        self.conv1 = conv3x3(inplanes, planes, stride)
        self.bn1 = nn.BatchNorm2d(planes)
        self.relu = nn.ReLU(inplace=True)
        
        self.conv2 = conv3x3(planes, planes)
        self.bn2 = nn.BatchNorm2d(planes)
        self.downsample = downsample
        self.stride = stride

    def forward(self, x):
        residual = x

        out = self.conv1(x)
        out = self.bn1(out)
        out = self.relu(out)

        out = self.conv2(out)
        out = self.bn2(out)

        if self.downsample is not None:
            residual = self.downsample(x)

        out += residual
        out = self.relu(out)

        return out

class finetune_resnet(nn.Module):
    def __init__(self, input_size, n_classes, block, inplanes=512, planes = 1024, blocks = 2, stride=1, norm_layer=None):
        super(finetune_resnet, self).__init__()
        if norm_layer is None:
            self._norm_layer = nn.BatchNorm2d

        #self.conv1 = nn.Conv1d(512, 512, kernel_size = 1, stride=1, padding =3, bias=False)
#         self.bn1 = self._norm_layer(inplanes)
#         self.relu = nn.ReLU(inplace = True)
            
        self.layer5 = self._make_layer(block, inplanes, planes, blocks , stride=2) 
        self.avgpool = nn.AdaptiveAvgPool2d((1, 1)) ## new 
        self.fc = nn.Linear(planes * block.expansion, n_classes)
        self.softmax = nn.LogSoftmax(dim=1)

    def _make_layer(self, block, inplanes, planes, blocks, stride=1):
        norm_layer = self._norm_layer
        downsample = None
        if stride != 1 or inplanes != planes * block.expansion:
            downsample = nn.Sequential(
                conv1x1(inplanes, planes * block.expansion, stride),
                norm_layer(planes * block.expansion),
            )
           # print(downsample)
        layers = []
        layers.append(block(inplanes, planes, stride,downsample))
        inplanes = planes * block.expansion
        for _ in range(1, blocks):
            layers.append(block(inplanes, planes, stride=1,downsample=None))

        return nn.Sequential(*layers)
    
    
    def forward(self, x):
#         #x = self.conv1(x)
#         x = self.bn1(x)
#         x = self.relu(x)
        
        x = self.layer5(x)
        x = self.avgpool(x)
        x = torch.flatten(x, 1)
        x = self.fc(x)
        x = self.softmax(x)
        return x 

def init_fine_tune_model():
    return finetune_resnet(input_size= 512, n_classes=10, block=BasicBlock)

In [110]:
pretrained_resnet18_s = nn.Sequential(*list(pretrained_resnet18.children())[:-2])

In [116]:
test_accuracy, test_loss = calculate_mnist_test_accuracy(pretrained_resnet18_s, init_fine_tune_model(), mnist_test)

Test Loss: 2.3043 Acc: 0.1010


### Part B (Fine-tune (Frozen))

The actual problem we care about solving likely has a different number of classes or is a different task altogether. Fine-tuning is the process of using the extracted representations (features) to solve this downstream task  (the task you're interested in).

To illustrate this, we'll use our pretrained model (on Imagenet), to solve the MNIST classification task.

There are two types of finetuning. 

#### 1. Frozen feature_extractor
In the first type we pretrain with the FROZEN feature_extractor and NEVER unfreeze it during finetuning.


#### 2. Unfrozen feature_extractor
In the second, we finetune with a FROZEN feature_extractor for a few epochs, then unfreeze the feature extractor and finish training.


In this part we will use the first version

In [125]:
def train_val_multiple(feature_extractor, fine_tune_model, optimizer, criterion, dataloaders, num_epochs, frozen = True): 
    '''
    This model takes input of multiple models and return trianing and validation losses 
    '''
    train_acc_history, val_acc_history = [], []
    best_acc = 0.0
    since = time.time()
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 40)

        # Each epoch has a training and validation phase
        for phase, dataloader in dataloaders.items():
            if phase == 'train':
                fine_tune_model.train()
                feature_extractor.train()
#                 if frozen == False:
#                     feature_extractor.train() 
#                 else:
#                     feature_extractor.eval()  
            else:
                feature_extractor.eval()   
                fine_tune_model.eval()
                
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                
                # why is it necessary?? 
                inputs.requires_grad = True
                
                # zero the parameter gradients
                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                     
                    # Get model outputs and calculate loss
                    intermediates = feature_extractor(inputs)
                    # only if ResNet
                    # intermediates= intermediates.view(intermediates.size()[0],intermediates.size()[1],1,1)
                    outputs = fine_tune_model(intermediates)
                    loss = criterion(outputs, labels)
                    preds = torch.max(outputs, 1)[1]

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                    # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects / len(dataloaders[phase].dataset)
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_ft_dict = copy.deepcopy(fine_tune_model.state_dict())
                best_model_ft = fine_tune_model
            if phase == 'val':
                val_acc_history.append(epoch_acc.item())
            if phase == 'train':
                train_acc_history.append(epoch_acc.item())

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))
    
    print('Saving best model...')
    torch.save({
    'train_loss': train_acc_history,
    'val_loss': val_acc_history,
    'model_dict': best_model_ft_dict
            }, './unfreeze_model_ft.pt')
    
    return best_model_ft, train_acc_history, val_acc_history

In [126]:
## abbrev version
def FROZEN_fine_tune_mnist(feature_extractor, fine_tune_model, dataloaders, num_epochs, frozen = True):
    """
    model is a feature extractor (resnet).
    Create a new model which uses those features to finetune on MNIST
    
    return the fine_tune model
    """         
    n_inputs = models.resnet18(pretrained=True).fc.in_features
    n_classes = 10

    # cpu/gpu
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
    # put model on device
    fine_tune_model = fine_tune_model()
    feature_extractor, fine_tune_model = feature_extractor.to(device), fine_tune_model.to(device)       
    print('Training on', device)  
    # params_to_update= check_grad(model_ft)  

    optimizer = optim.SGD(fine_tune_model.parameters(), lr=0.001, momentum=0.9)
    #optimizer = optim.SGD(fine_tune_model.parameters(), lr=0.001, momentum=0.9)

    criterion = nn.CrossEntropyLoss()

    print('Frozen feature extractor and train fine tune for 10 epochs on ', device) 
    
    print('\n Feature extractor model:')
    check_grad(feature_extractor)
    print('\n Fine tune model:')
    check_grad(fine_tune_model)
    
    best_model_ft, train_acc_history, val_acc_history = train_val_multiple(feature_extractor, 
                                                                     fine_tune_model, optimizer, 
                                                                     criterion, dataloaders,
                                                                     num_epochs=10,frozen = True)    
    return best_model_ft, train_acc_history, val_acc_history

In [130]:
import torch.optim as optim

def FROZEN_fine_tune_mnist(feature_extractor, fine_tune_model, dataloaders, num_epochs, frozen = True):
    """
    model is a feature extractor (resnet).
    Create a new model which uses those features to finetune on MNIST
    
    return the fine_tune model
    """         
    n_inputs = models.resnet18(pretrained=True).fc.in_features
    n_classes = 10

    # cpu/gpu
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
    # put model on device
    fine_tune_model = fine_tune_model()
    feature_extractor, fine_tune_model = feature_extractor.to(device), fine_tune_model.to(device)       
    print('Training on', device)  
    # params_to_update= check_grad(model_ft)
    
    ## starts to train, source: https://pytorch.org/tutorials/beginner/finetuning_torchvision_models_tutorial.html
    
    # record time 
    since = time.time()
    # record val and train loss accuracy 
    train_acc_history, val_acc_history = [], []
    best_acc = 0.0
    #optimizer = optim.Adam(params_to_update, lr=0.001)
    optimizer = optim.SGD(fine_tune_model.parameters(), lr=0.001, momentum=0.9)
    criterion = nn.CrossEntropyLoss()
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 40)

        # Each epoch has a training and validation phase
        for phase in ['train', 'val']:
            if phase == 'train':
                feature_extractor.train()  
                fine_tune_model.train()
            else:
                feature_extractor.eval()   
                fine_tune_model.eval()
                
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloaders[phase]:
                inputs = inputs.to(device)
                labels = labels.to(device)
                
                # why is it necessary?? 
                inputs.requires_grad = True
                
                # zero the parameter gradients
                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                     
                    # Get model outputs and calculate loss
                    intermediates = feature_extractor(inputs)
                    # only if ResNet
                    #intermediates= intermediates.view(intermediates.size()[0],intermediates.size()[1],1,1)
                    outputs = fine_tune_model(intermediates)
                    loss = criterion(outputs, labels)
                    preds = torch.max(outputs, 1)[1]

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                    # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_ft_dict = copy.deepcopy(fine_tune_model.state_dict())
                best_model_ft = fine_tune_model
            if phase == 'val':
                val_acc_history.append(epoch_acc.item())
            if phase == 'train':
                train_acc_history.append(epoch_acc.item())

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))
    
    print('Saving best model...')
    torch.save({
    'train_loss': train_acc_history,
    'val_loss': val_acc_history,
    'model_dict': best_model_ft_dict
            }, './frozen_model_ft.pt')
    
    return best_model_ft, train_acc_history, val_acc_history

In [131]:
# (1) A simple linear layer with softmax
model_ft, train_acc, val_acc = FROZEN_fine_tune_mnist(
    pretrained_resnet18, init_fine_tune_model, dataloaders, 10,frozen = True) 

Training on cuda:0
Epoch 0/9
----------------------------------------
train Loss: 0.5477 Acc: 0.8709
val Loss: 0.2761 Acc: 0.9342
Epoch 1/9
----------------------------------------
train Loss: 0.2741 Acc: 0.9290
val Loss: 0.2117 Acc: 0.9456
Epoch 2/9
----------------------------------------
train Loss: 0.2306 Acc: 0.9371
val Loss: 0.1849 Acc: 0.9514
Epoch 3/9
----------------------------------------
train Loss: 0.2066 Acc: 0.9420
val Loss: 0.1703 Acc: 0.9536
Epoch 4/9
----------------------------------------
train Loss: 0.1937 Acc: 0.9444
val Loss: 0.1606 Acc: 0.9550
Epoch 5/9
----------------------------------------
train Loss: 0.1818 Acc: 0.9476
val Loss: 0.1512 Acc: 0.9566
Epoch 6/9
----------------------------------------
train Loss: 0.1747 Acc: 0.9496
val Loss: 0.1444 Acc: 0.9576
Epoch 7/9
----------------------------------------
train Loss: 0.1699 Acc: 0.9518
val Loss: 0.1398 Acc: 0.9582
Epoch 8/9
----------------------------------------
train Loss: 0.1622 Acc: 0.9526
val Loss: 0

In [136]:
# (2) Trying to reproduce 
model_ft, train_acc, val_acc = FROZEN_fine_tune_mnist(
    pretrained_resnet18, init_fine_tune_model, dataloaders, 10,frozen = True) 

Training on cuda:0
Epoch 0/9
----------------------------------------
train Loss: 0.6308 Acc: 0.8471
val Loss: 0.2591 Acc: 0.9232
Epoch 1/9
----------------------------------------
train Loss: 0.2376 Acc: 0.9290
val Loss: 0.1742 Acc: 0.9478
Epoch 2/9
----------------------------------------
train Loss: 0.1925 Acc: 0.9409
val Loss: 0.1501 Acc: 0.9546
Epoch 3/9
----------------------------------------
train Loss: 0.1706 Acc: 0.9471
val Loss: 0.1368 Acc: 0.9580
Epoch 4/9
----------------------------------------
train Loss: 0.1580 Acc: 0.9507
val Loss: 0.1335 Acc: 0.9586
Epoch 5/9
----------------------------------------
train Loss: 0.1485 Acc: 0.9534
val Loss: 0.1233 Acc: 0.9610
Epoch 6/9
----------------------------------------
train Loss: 0.1418 Acc: 0.9555
val Loss: 0.1146 Acc: 0.9624
Epoch 7/9
----------------------------------------
train Loss: 0.1343 Acc: 0.9580
val Loss: 0.1134 Acc: 0.9644
Epoch 8/9
----------------------------------------
train Loss: 0.1304 Acc: 0.9588
val Loss: 0

In [None]:
# try abbrev version
model_ft, train_acc, val_acc = FROZEN_fine_tune_mnist(
    pretrained_resnet18, init_fine_tune_model, dataloaders, 2) 

In [21]:
# try some different results 
# forgot wtf this actually was......
model_ft, train_acc, val_acc = FROZEN_fine_tune_mnist(
    pretrained_resnet18, init_fine_tune_model, dataloaders, 10) 

Training on cuda:0
Epoch 0/9
----------------------------------------
train Loss: 0.3161 Acc: 0.9074
val Loss: 0.1537 Acc: 0.9544
Epoch 1/9
----------------------------------------
train Loss: 0.1949 Acc: 0.9394
val Loss: 0.1290 Acc: 0.9616
Epoch 2/9
----------------------------------------
train Loss: 0.1769 Acc: 0.9438
val Loss: 0.1201 Acc: 0.9620
Epoch 3/9
----------------------------------------
train Loss: 0.1705 Acc: 0.9455
val Loss: 0.1137 Acc: 0.9640
Epoch 4/9
----------------------------------------
train Loss: 0.1637 Acc: 0.9483
val Loss: 0.1095 Acc: 0.9668
Epoch 5/9
----------------------------------------
train Loss: 0.1601 Acc: 0.9489
val Loss: 0.1056 Acc: 0.9672
Epoch 6/9
----------------------------------------
train Loss: 0.1585 Acc: 0.9484
val Loss: 0.1072 Acc: 0.9650
Epoch 7/9
----------------------------------------
train Loss: 0.1561 Acc: 0.9505
val Loss: 0.1027 Acc: 0.9670
Epoch 8/9
----------------------------------------
train Loss: 0.1532 Acc: 0.9509
val Loss: 0

### Part C (compute test accuracy)
Compute the test accuracy of fine-tuned model on MNIST

In [133]:
def calculate_mnist_test_accuracy(feature_extractor, fine_tune_model, mnist_test):
    
    test_correct = 0 # YOUR CODE HERE...
    test_loss = 0
    
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    feature_extractor, fine_tune_model = feature_extractor.to(device), fine_tune_model.to(device)       

    feature_extractor.eval()
    fine_tune_model.eval()

    criterion = nn.CrossEntropyLoss()
    
    for inputs, labels in mnist_test:
        inputs = inputs.to(device)
        labels = labels.to(device)

        intermediates = feature_extractor(inputs)
#         intermediates= intermediates.view(intermediates.size()[0],intermediates.size()[1],1,1)
        logits = fine_tune_model(intermediates)
        outputs = F.softmax(logits, dim =1)
        #preds = outputs.max(1, keepdim = True)[1]
        preds = torch.max(outputs, 1)[1]
        loss = criterion(outputs, labels)

        # statistics
        test_loss += loss.item() * labels.size(0)
        test_correct += preds.eq(labels.view_as(preds).to(device)).sum().item()

    epoch_loss = test_loss / len(mnist_test.dataset)
    epoch_acc = test_correct / len(mnist_test.dataset)
    print('Test Loss: {:.4f} Acc: {:.4f}'.format(epoch_loss, epoch_acc))

    return epoch_acc, epoch_loss

In [134]:
# Single Liear layer + softmax with val > 0.96
calculate_mnist_test_accuracy(pretrained_resnet18, model_ft, mnist_test)

Test Loss: 1.8469 Acc: 0.6338


(0.6338, 1.8469313505172729)

In [137]:
# Reproduce
calculate_mnist_test_accuracy(pretrained_resnet18, model_ft, mnist_test)

Test Loss: 1.8881 Acc: 0.5809


(0.5809, 1.8880692113876343)

### Grade!
Let's see how you did

In [75]:
def grade_mnist_frozen():
    
    # init a ft model
    fine_tune_model = init_fine_tune_model()
    
    # run the transfer learning routine
    model_ft, train_acc, val_acc = FROZEN_fine_tune_mnist(pretrained_resnet18, init_fine_tune_model, dataloaders, 10)
    
    # calculate test accuracy
    test_accuracy = calculate_mnist_test_accuracy(pretrained_resnet18, model_ft, mnist_test, frozen = True)
    
    # the real threshold will be released by Oct 11 
    assert test_accuracy > 0.0, 'your accuracy is too low...'
    
    return test_accuracy
    
frozen_test_accuracy = grade_mnist_frozen()

TypeError: forward() missing 1 required positional argument: 'input'

### Part D (Fine-tune Unfrozen)
Now we'll learn how to train using the "unfrozen" approach.

In this approach we'll:
1. keep the feature_extract frozen for a few epochs (10)
2. Unfreeze it.
3. Finish training

In [None]:
def train_val(model_ft, optimizer, criterion, dataloaders, num_epochs, save_location): 
    '''
    This model takes input of multiple models and return trianing and validation losses 
    '''
    train_acc_history, val_acc_history = [], []
    best_acc = 0.0
    since = time.time()
    
    for epoch in range(num_epochs):
        print('Epoch {}/{}'.format(epoch, num_epochs - 1))
        print('-' * 40)

        # Each epoch has a training and validation phase
        for phase, dataloader in dataloaders.items():
            if phase == 'train':
                model_ft.train()
            else:
                model_ft.eval()
                
            running_loss = 0.0
            running_corrects = 0

            # Iterate over data
            for inputs, labels in dataloader:
                inputs = inputs.to(device)
                labels = labels.to(device)
                
                # why is it necessary?? 
                inputs.requires_grad = True
                
                # zero the parameter gradients
                optimizer.zero_grad()

                with torch.set_grad_enabled(phase == 'train'):
                
                    outputs = model_ft(inputs)
                    loss = criterion(outputs, labels)
                    preds = torch.max(outputs, 1)[1]

                    # backward + optimize only if in training phase
                    if phase == 'train':
                        loss.backward()
                        optimizer.step()

                    # statistics
                running_loss += loss.item() * inputs.size(0)
                running_corrects += torch.sum(preds == labels.data)
                
            epoch_loss = running_loss / len(dataloaders[phase].dataset)
            epoch_acc = running_corrects.double() / len(dataloaders[phase].dataset)
            print('{} Loss: {:.4f} Acc: {:.4f}'.format(phase, epoch_loss, epoch_acc))

            # deep copy the model
            if phase == 'val' and epoch_acc > best_acc:
                best_acc = epoch_acc
                best_model_ft_dict = copy.deepcopy(model_ft.state_dict())
                best_model_ft = model_ft
            if phase == 'val':
                val_acc_history.append(epoch_acc.item())
            if phase == 'train':
                train_acc_history.append(epoch_acc.item())

    time_elapsed = time.time() - since
    print('Training complete in {:.0f}m {:.0f}s'.format(time_elapsed // 60, time_elapsed % 60))
    print('Best val Acc: {:4f}'.format(best_acc))
    
    print('Saving best model...')
    torch.save({
    'train_loss': train_acc_history,
    'val_loss': val_acc_history,
    'model_dict': best_model_ft_dict
            }, './{}.pt'.format(save_location))
    
    return best_model_ft, train_acc_history, val_acc_history

In [57]:
def UNFROZEN_fine_tune_mnist(feature_extractor, fine_tune_model, dataloaders, epochs = 10):
    """
    model is a feature extractor (resnet).
    Create a new model which uses those features to finetune on MNIST
    
    return the fine_tune model
    """     
    # INSERT YOUR CODE:
    n_inputs = models.resnet18(pretrained=True).fc.in_features
    n_classes = 10
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
    fine_tune_model = init_fine_tune_model()
    feature_extractor, fine_tune_model = feature_extractor.to(device), fine_tune_model.to(device)         
    
    criterion = nn.CrossEntropyLoss()
    # Frozen First 
    print('Frist freeze feature extractor and only train finetune model for 1 epochs on', device) 
    print('\nOn feature extractor model')
    check_grad(feature_extractor)
    print('\nOn fine tune model')
    check_grad(fine_tune_model)
    optimizer = optim.SGD(fine_tune_model.parameters(), lr=0.001, momentum=0.9)
    best_model_ft, train_acc_history, val_acc_history = train_val_multiple(feature_extractor, 
                                                                     fine_tune_model, optimizer, 
                                                                     criterion, dataloaders,
                                                                     num_epochs=epochs,frozen = True)

    # Train on the both models
    print('Next unfreeze feature extractor and train both models for 1 epochs on', device) 
    unfreeze_model(feature_extractor)
    feature_extractor.ft = best_model_ft # add finetune to last layer
    check_grad(feature_extractor) # make sure all grads are now require_grad
    optimizer = optim.SGD(feature_extractor.parameters(), lr=0.001, momentum=0.9) # update the trainable parameters
    best_model_ft, train_acc_history, val_acc_history = train_val_single(feature_extractor, 
                                                                     optimizer, 
                                                                     criterion, 
                                                                     dataloaders,
                                                                     num_epochs= epochs)

    return best_model_ft, train_acc_history, val_acc_history

In [138]:
# sanity check
test_loader = dict()
test_loader['test'] = mnist_test

UNFROZEN_fine_tune_mnist(pretrained_resnet18, init_fine_tune_model, test_loader, 2)

TypeError: UNFROZEN_fine_tune_mnist() takes 3 positional arguments but 4 were given

### Grade UNFROZEN
Let's see if there's a difference in accuracy!

In [58]:
def grade_mnist_unfrozen():
    
    # init a ft model
    fine_tune_model = init_fine_tune_model()
    
    # run the transfer learning routine
    best_model_ft, train_acc_history, val_acc_history = UNFROZEN_fine_tune_mnist(pretrained_resnet18, 
                                                                               fine_tune_model, dataloaders)
    
    fine_tune_model.load_dict(unfreeze_model_ft['model_dict'])
    # calculate test accuracy
    test_accuracy, test_loss = calculate_mnist_test_accuracy(best_model_ft, _, mnist_test, frozen = False)
    
    # the real threshold will be released by Oct 11 
    assert test_accuracy > 0.0, 'your accuracy is too low...'
    
    return test_accuracy
    
unfrozen_test_accuracy = grade_mnist_unfrozen()

Frist freeze feature extractor and only train finetune model for 1 epochs on cuda:0

 On feature extractor model
Parameters to learn:
11,176,512 total parameters.
0 training parameters.

 On fine tune model
Parameters to learn:
	 bn.weight
	 bn.bias
	 fc.weight
	 fc.bias
12,298 total parameters.
12,298 training parameters.
Epoch 0/9
----------------------------------------


KeyboardInterrupt: 

In [75]:
assert unfrozen_test_accuracy > frozen_test_accuracy, 'the unfrozen model should be better'

In [None]:
def fine_tune_mnist(feature_extractor, fine_tune_model, dataloaders, num_epochs, frozen = True, save_point):
    """
    model is a feature extractor (resnet).
    Create a new model which uses those features to finetune on MNIST
    use frozen to train feature_extractor or not train the feature extractor 
    return the fine_tune model
    """         
    n_inputs = models.resnet18(pretrained=True).fc.in_features
    n_classes = 10

    # cpu/gpu
    device = torch.device('cuda:0' if torch.cuda.is_available() else 'cpu')
    
    # put model on device
    feature_extractor.train()
    fine_tune_model = fine_tune_model()
    if frozen: # keep feature extractor frozen
        freeze_model(feature_extractor)
    else: # train entire model
        unfreeze_model(feature_extractor)
    
    feature_extractor.fc = fine_tune_model
    feature_extractor = feature_extractor.to(device)
    parameters_to_train = checkgrad(feature_extractor)
    print('Training on', device)  
  
    optimizer = optim.SGD(parameters_to_train, lr=0.001, momentum=0.9)
    criterion = nn.CrossEntropyLoss()
    
    best_model_ft, train_acc_history, val_acc_history = train_val(feature_extractor, 
                                                                  optimizer, criterion, 
                                                                  dataloaders, num_epochs,
                                                                  save_point)
    
    return best_model_ft, train_acc_history, val_acc_history

In [None]:
# sanity check
_, _, _, = fine_tune_mnist(pretrained_resnet18, init_fine_tune_model, dataloaders, 2, frozen = True, 'test_true')

In [None]:
_, _, _, = fine_tune_mnist(pretrained_resnet18, init_fine_tune_model, dataloaders, 2, frozen = False, 'test_false')

--- 
# Question 2 (train a model on Wikitext-2)

Here we'll apply what we just learned to NLP. In this section we'll make our own feature extractor and pretrain it on Wikitext-2.

The WikiText language modeling dataset is a collection of over 100 million tokens extracted from the set of verified Good and Featured articles on Wikipedia. The dataset is available under the Creative Commons Attribution-ShareAlike License.

#### Part A
In this section you need to generate the training, validation and test split. Feel free to use code from your previous lectures.

In [76]:
from torchtext.datasets import WikiText2


def init_wikitext_dataset():
    """
    Fill in the details
    """
    wikitext_val = None # YOUR CODE HERE
    wikitext_train = None # YOUR CODE HERE
    wikitext_test = None # YOUR CODE HERE
    
    return wikitext_train, wikitext_val, wikitext_test

#### Part B   
Here we design our own feature extractor. In MNIST that was a resnet because we were dealing with images. Now we need to pick a model that can model sequences better. Design an RNN-based model here.

In [77]:
def init_feature_extractor():
    
    feature_extractor = None #  YOUR CODE
    
    return feature_extractor

#### Part C
Pretrain the feature extractor

In [79]:
def fit_feature_extractor(feature_extractor, wikitext_train, wikitext_val):
    # FILL IN THE DETAILS
    pass

#### Part D
Calculate the test perplexity on wikitext2. Feel free to recycle code from previous assignments from this class. 

In [78]:
def calculate_wiki2_test_perplexity(feature_extractor, wikitext_test):
    
    # FILL IN DETAILS
    
    return test_ppl

#### Let's grade your results!
(don't touch this part)

In [28]:
def grade_wikitext2():
    # load data
    wikitext_train, wikitext_val, wikitext_test = init_wikitext_dataset()

    # load feature extractor
    feature_extractor = init_feature_extractor()

    # pretrain using the feature extractor
    fit_feature_extractor(feature_extractor, wikitext_train, wikitext_val)

    # check test accuracy
    test_ppl = calculate_wiki2_test_perplexity(feature_extractor, wikitext_test)

    # the real threshold will be released by Oct 11 
    assert test_ppl < 10000, 'ummm... your perplexity is too high...'
    
grade_wikitext2()

---   
## Question 3 (fine-tune on MNLI)
In this question you will use your feature_extractor from question 2
to fine-tune on MNLI.

(From the website):
The Multi-Genre Natural Language Inference (MultiNLI) corpus is a crowd-sourced collection of 433k sentence pairs annotated with textual entailment information. The corpus is modeled on the SNLI corpus, but differs in that covers a range of genres of spoken and written text, and supports a distinctive cross-genre generalization evaluation. The corpus served as the basis for the shared task of the RepEval 2017 Workshop at EMNLP in Copenhagen.

MNLI has 3 genres (3 classes).
The goal of this question is to maximize the test accuracy in MNLI. 

### Part A
In this section you need to generate the training, validation and test split. Feel free to use code from your previous lectures.

In [80]:
from torchtext.datasets import MultiNLI

def init_mnli_dataset():
    """
    Fill in the details
    """
    mnli_val = None # TODO
    mnli_train = None # TODO
    mnli_test = None # TODO
    
    return mnli_train, mnli_val, mnli_test

### Part B
Here we again design a model for finetuning. Use the output of your feature-extractor as the input to this model. This should be a powerful classifier (up to you).

In [81]:
def init_finetune_model():
    
    # TODO FILL IN THE DETAILS
    fine_tune_model = ...
    
    return fine_tune_model

### Part C
Use the feature_extractor and your fine_tune_model to fine_tune MNLI

In [82]:
def fine_tune_mnli(feature_extractor, fine_tune_model, mnli_train, mnli_val):
    # YOUR CODE HERE
    pass

### Part D
Evaluate the test accuracy

In [83]:
def calculate_mnli_test_accuracy(feature_extractor, fine_tune_model, mnli_test):
    
    # YOUR CODE HERE...
    
    return test_ppl

### Let's grade your results

In [55]:
def grade_mnli():
    # load data
    mnli_train, mnli_val, mnli_test = init_mnli_dataset()

    # no need to load feature extractor because it is fine-tuned
    feature_extractor = feature_extractor

    # init the fine_tune model
    fine_tune_model = init_finetune_model()
    
    # finetune
    fine_tune_mnli(feature_extractor, fine_tune_model, mnli_train, mnli_val)

    # check test accuracy
    test_accuracy = calculate_mnli_test_accuracy(feature_extractor, wikitext_test)

    # the real threshold will be released by Oct 11 
    assert test_ppl > 0.00, 'ummm... your accuracy is too low...'
    
grade_mnli()

---  
### Question 4 (BERT)

A major direction in research came from a model called BERT, released last year.  

In this question you'll use BERT as your feature_extractor instead of the model you
designed yourself.

To get BERT, head on over to (https://github.com/huggingface/transformers) and load your BERT model here

In [None]:
!pip install transformers

### Part A (init BERT)
In this section you need to create an instance of BERT and return if from the function

In [87]:
from transformers import BertTokenizer, BertModel, BertForMaskedLM

def init_bert():
    
    BERT = None # ... YOUR CODE HERE
    
    return BERT

## Part B (fine-tune with BERT)

Use BERT as your feature extractor to finetune MNLI. Use a new finetune model (reset weights).

In [88]:
def fine_tune_mnli_BERT(BERT_feature_extractor, fine_tune_model, mnli_train, mnli_val):
    # YOUR CODE HERE
    pass

## Part C
Evaluate how well we did

In [90]:
def calculate_mnli_test_accuracy_BERT(feature_extractor, fine_tune_model, mnli_test):
    
    # YOUR CODE HERE...
    
    return test_ppl

## Let's grade your BERT results!

In [91]:
def grade_mnli_BERT():
    BERT_feature_extractor = init_bert()
    
    # load data
    mnli_train, mnli_val, mnli_test = init_mnli_dataset()

    # init the fine_tune model
    fine_tune_model = init_finetune_model()
    
    # finetune
    fine_tune_mnli(BERT_feature_extractor, fine_tune_model, mnli_train, mnli_val)

    # check test accuracy
    test_accuracy = calculate_mnli_test_accuracy(feature_extractor, wikitext_test)
    
    # the real threshold will be released by Oct 11 
    assert test_ppl > 0.0, 'ummm... your accuracy is too low...'
    
grade_mnli_BERT()