# 01: Transfer Learning

Welcome to the course on Deep Learning with Multiple Objectives!

<img width=400 src="https://www.dropbox.com/s/unl81onqf0kz02w/fig2.png?dl=1">

## Plan for today

1. Go through course outline, logistics, etc.
2. Introduction 
  * Walk through PyTorch basics to make sure we are on the same page
3. Transfer Learning
  * Excite you about transfer learning
4. Home-work

Next time: Implementing Transformer from scratch and mini-project ideas discussion!

## Questions?

Do you have any questions about the lecture? (probably not yet, but we will try to always start with a discussion about what questions you had about papers and the lecture)

# 1. Course logistics


(See README.md on github)

# 2. Introduction to "Static World"

The objective is to get familiar with PyTorch and the concept (and the importance) of transfer learning.

The traditional approach to solving classifcation tasks is to obtain large quantities (ImageNet is ~14M images) of supervised data and learn a machine learning model.

<center><img src="https://i0.wp.com/semiengineering.com/wp-content/uploads/2019/10/Synopsys_computer-vision-processors-EV7-Fig2-ImageNet.jpeg?ssl=1"></center>

What if we want to quickly learn based on a small sample of data?

<center><img src="https://www.dropbox.com/s/7gvah7w7h1jvkcm/fig1.png?dl=1"></center>

How did we manage to complete this task? We leverage prior experience. We know we are in the *Static World*. 

Transfer learning is a research agenda that is at the essence of why deep learning is important. Transfer learning is usually defined as applying knowledge gained in one task to another task. Deep learning is important because it enables acquiring broadly generalizing algorithms (models).

<center><img src="https://www.dropbox.com/s/v1trzeu5dw2834x/fig3.png?dl=1"></center>

Partially adapted from https://cs330.stanford.edu/slides/cs330_multitask_transfer_2020.pdf.

# 3. Setup

Please walk through these steps

In [None]:
# 0. Clone, install & configure some software

! git clone https://github.com/asyml/vision-transformer-pytorch
! pip install dotmap

import sys
sys.path.append("vision-transformer-pytorch")
sys.path.append("vision-transformer-pytorch/src")

In [None]:
# 1. Generic imports

import torch
import torchvision
import torchvision.transforms as transforms
import matplotlib.pyplot as plt
import numpy as np
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim

## Torch requires specifying where computation happens
import torch.nn as nn
if torch.cuda.is_available():
  device = "cuda"
else:
  device = "cpu"

import os
import torch
import gc; 
import torch.nn as nn
import numpy as np

from dotmap import DotMap

In [None]:
# 2. Imports from Vision Transformer repository

from utils import setup_device, accuracy, MetricTracker, TensorboardWriter
from src.model import VisionTransformer
from src.config import get_b16_config, get_train_config
from src.checkpoint import load_checkpoint
from src.data_loaders import *

In [None]:
# 3. Finally, make sure you have ViT pretrained weights

"""
Please grab imagenet21k+imagenet2012_ViT-B_16.pth from https://drive.google.com/drive/folders/1azgrD1P413pXLJME0PjRRU-Ez-4GWN-S 
Two ways:

  * (recommended) create copy of imagenet21k+imagenet2012_ViT-B_16.pth from https://drive.google.com/drive/folders/1azgrD1P413pXLJME0PjRRU-Ez-4GWN-S and link your google drive
  * (discouraged) you can download using "!wget https://www.dropbox.com/s/ihi643hkcer2cu5/imagenet21k%2Bimagenet2012_ViT-B_16.pth?dl=1"
"""

# 4. PyTorch Introduction

Quick question: How many of you have trained a deep convolutional neural network using PyTorch or a similar framework?

In [None]:
# 1. Visualize dataset

transform = transforms.Compose(
    [transforms.ToTensor(),
     transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

trainset = torchvision.datasets.CIFAR10(root='./data', train=True,
                                        download=True, transform=transform)
trainloader = torch.utils.data.DataLoader(trainset, batch_size=4,
                                          shuffle=True, num_workers=2)

testset = torchvision.datasets.CIFAR10(root='./data', train=False,
                                       download=True, transform=transform)
testloader = torch.utils.data.DataLoader(testset, batch_size=4,
                                         shuffle=False, num_workers=2)

classes = ('plane', 'car', 'bird', 'cat',
           'deer', 'dog', 'frog', 'horse', 'ship', 'truck')

# get some random training images
dataiter = iter(trainloader)
images, labels = dataiter.next()

## WRITEME Show first 4 images and their labels. ##

In [None]:
# 2. Define the model

class Net(nn.Module):
    def __init__(self):
        super(Net, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.pool = nn.MaxPool2d(2, 2)
        self.conv2 = nn.Conv2d(6, 16, 5)
        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.fc2 = nn.Linear(120, 84)
        self.fc3 = ## WRITEME ##

    def forward(self, x):
        x = self.pool(F.relu(self.conv1(x)))
        x = self.pool(F.relu(self.conv2(x)))
        x = x.view(-1, 16 * 5 * 5)
        x = F.relu(self.fc1(x))
        x = F.relu(self.fc2(x))
        x = self.fc3(x)
        return x

net = Net()
assert output.shape[-1] == 10
print("Congratulations!")

In [None]:
# 3. Train the network

criterion = nn.CrossEntropyLoss()
optimizer = optim.SGD(net.parameters(), lr=0.001, momentum=0.9)

for epoch in range(5):  # loop over the dataset multiple times
    running_loss = 0.0
    for i, data in enumerate(trainloader, 0):
        # get the inputs; data is a list of [inputs, labels]
        inputs, labels = data

        # zero the parameter gradients
        optimizer.zero_grad()

        ## WRITEME ##

        # print statistics
        running_loss += loss.item()
        if i % 6000 == 5999:
            print('[%d, %5d] loss: %.3f' %
                  (epoch + 1, i + 1, running_loss / 6000))
            running_loss = 0.0

print('Finished Training')

In [None]:
# 4. Visualize predictions

dataiter = iter(testloader)
images, labels = dataiter.next()

_, predicted = torch.max(outputs, 1)

print('Predicted: ', ' '.join('%5s' % classes[predicted[j]]
                              for j in range(4)))

In [None]:
# 5. Calculate accuracy 

correct = 0
total = 0
with torch.no_grad():
    for data in testloader:
        images, labels = data
        outputs = net(images)
        _, predicted = torch.max(outputs.data, 1)
        total += labels.size(0)
        correct += (predicted == labels).sum().item()

print('Accuracy of the network on the 10000 test images: %d %%' % (
    100 * correct / total))

assert correct / total > 0.5
print("Congratulations!")

(The above was based on https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html, but do not look up the answer as it defeats the purpose of the exercise to get you on the same page.)

# 5. Transfer Learning: Fine-tune a SOTA model:

Vision Transformer is a very recently proposed model by Google Brain [quick "whiteboard" presentation].

Please complete the code below, and then complete the exercises.

(Based on https://github.com/asyml/vision-transformer-pytorch)

In [None]:
# 1. Define Config

# Some defaults taken form README.md in https://github.com/asyml/vision-transformer-pytorch
config = get_b16_config(DotMap())
config.image_size = 384
config.num_classes = 10
config.lr = 0.005
config.warmup_steps = 10 
config.wd = 0.0
config.num_classes = 10
config.train_steps_per_epoch = 100
config.epochs = 3
config.batch_size = 8
config.num_workers = 1
config.data_dir = "data"
# Make sure this path exists!
config.checkpoint_path = "drive/imagenet21k+imagenet2012_ViT-B_16.pth" 
assert len(config.checkpoint_path) == 0 or os.path.exists(config.checkpoint_path)
config.dataset = "CIFAR10"

In [None]:
# 2. Define model

# Warning: do not create more than a single instance. Otherwise GPU might run out of
model = VisionTransformer(
          image_size=(config.image_size, config.image_size),
          patch_size=(config.patch_size, config.patch_size),
          emb_dim=config.emb_dim,
          mlp_dim=config.mlp_dim,
          num_heads=config.num_heads,
          num_layers=config.num_layers,
          num_classes=config.num_classes,
          attn_dropout_rate=config.attn_dropout_rate,
          dropout_rate=config.dropout_rate)
_ = model.to(device)

In [None]:
# 3. Get dataloaders

train_dataloader = eval("{}DataLoader".format(config.dataset))(
                data_dir=os.path.join(config.data_dir, config.dataset),
                image_size=config.image_size,
                batch_size=config.batch_size,
                num_workers=config.num_workers,
                split='train')
valid_dataloader = eval("{}DataLoader".format(config.dataset))(
                data_dir=os.path.join(config.data_dir, config.dataset),
                image_size=config.image_size,
                batch_size=config.batch_size,
                num_workers=config.num_workers,
                split='val')

In [None]:
# 4. Load in checkpoint
if config.checkpoint_path:
    state_dict = load_checkpoint(config.checkpoint_path)
    print(state_dict['classifier.weight'].size(0))
    if config.num_classes != state_dict['classifier.weight'].size(0):
        del state_dict['classifier.weight']
        del state_dict['classifier.bias']
        print("re-initialize fc layer")
        model.load_state_dict(state_dict, strict=False)
    else:
        model.load_state_dict(state_dict)
    print("Load pretrained weights from {}".format(config.checkpoint_path))

In [None]:
# 5. Is everything fine?
x, y = next(train_dataloader.__iter__())
x = x.to(device)
y_pred = model.forward(x[0:4])
assert y_pred is not None
print("Worked!")
del y_pred

In [None]:
# 6. Train and evaluate the model 

# this is important in colab - it keeps a single session that can fill up GPU memory
gc.collect(); torch.cuda.empty_cache()

# training criterion
criterion = nn.CrossEntropyLoss()

# create optimizers and learning rate scheduler
optimizer = torch.optim.SGD(
    params=model.parameters(),
    lr=config.lr,
    weight_decay=config.wd,
    momentum=0.9)
lr_scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer=optimizer,
    max_lr=config.lr,
    pct_start=config.warmup_steps / (config.train_steps_per_epoch * (config.epochs + 1)),
    total_steps=config.train_steps_per_epoch * (config.epochs + 1))

# some boilerplate
metric_names = ['loss', 'acc1']
writer = TensorboardWriter(".", config.tensorboard)
train_metrics = MetricTracker(*[metric for metric in metric_names], writer=writer)
valid_metrics = MetricTracker(*[metric for metric in metric_names], writer=writer)

# start training
print("start training")
best_acc = 0.0

def train_epoch(epoch, model, data_loader, criterion, optimizer, lr_scheduler, metrics, device=torch.device('cpu'), steps=np.inf):
    metrics.reset()

    # training loop
    for batch_idx, (batch_data, batch_target) in enumerate(data_loader):
        batch_data = batch_data.to(device)
        batch_target = batch_target.to(device)

        optimizer.zero_grad()
        batch_pred = model(batch_data)
        loss = criterion(batch_pred, batch_target)
        loss.backward()
        optimizer.step()
        lr_scheduler.step()

        acc1, = accuracy(batch_pred, batch_target, topk=(1,))

        metrics.writer.set_step((epoch - 1) * len(data_loader) + batch_idx)
        metrics.update('loss', loss.item())
        metrics.update('acc1', acc1.item())

        if batch_idx % 10 == 0:
            print("Train Epoch: {:03d} Batch: {:05d}/{:05d} Loss: {:.4f} Acc@1: {:.2f}"
                    .format(epoch, batch_idx, min(len(data_loader), steps), loss.item(), acc1.item()))
            

        if batch_idx > steps:
          break
    return metrics.result()


def valid_epoch(epoch, model, data_loader, criterion, metrics, device=torch.device('cpu'), steps=np.inf):
    metrics.reset()
    losses = []
    acc1s = []
    acc5s = []
    # validation loop
    with torch.no_grad():
        for batch_idx, (batch_data, batch_target) in enumerate(data_loader):
            batch_data = batch_data.to(device)
            batch_target = batch_target.to(device)

            batch_pred = model(batch_data)
            loss = criterion(batch_pred, batch_target)
            acc1,  = accuracy(batch_pred, batch_target, topk=(1,))

            losses.append(loss.item())
            acc1s.append(acc1.item())

            if batch_idx > steps:
              break

    loss = np.mean(losses)
    acc1 = np.mean(acc1s)
    metrics.writer.set_step(epoch, 'valid')
    metrics.update('loss', loss)
    metrics.update('acc1', acc1)
    return metrics.result()

for epoch in range(1, config.epochs + 1):
    log = {'epoch': epoch}

    # train the model
    model.train()
    result = train_epoch(epoch, model, train_dataloader, criterion, optimizer, lr_scheduler, train_metrics, device, steps=config.train_steps_per_epoch)
    log.update(result)
    print("Finished epoch")

    # validate the model
    model.eval()
    result = valid_epoch(epoch, model, valid_dataloader, criterion, valid_metrics, device, steps=50)
    log.update(**{'val_' + k: v for k, v in result.items()})

    # print logged informations to the screen
    for key, value in log.items():
        print('    {:15s}: {}'.format(str(key), value))


Fine-tuning has slightly different challenges than training model from scratch. The following two exercises are designed to provide opportunity to experiment with fine-tuning. 

When you are done please raise hand, and you can go.

## Exercise 0: Train without pretraining

Try training the model without using the checkpoint. Did it work?

## Exercise 1: Improve the result

Tune different hyperparameters (learning rate? number of epochs? ...) to achieve better performance than 93%. Can you beat state-of-the-art methods on CIFAR-10 that train only on CIFAR-10?

Please write results on the "whiteboard". This is not a competition of course, just curious what tricks are important. Fine-tuning of large pretrained models is tricky (spoiler alert: this is one idea for project).

## Exercise 2: Training only the final classifier

A common strategy in fine-tuning is to only unfreeze a subset of the weights. Train a variant of the model where you fine-tune only the final classifier. What performance did it reach?


# 6. Homework

* (1p) Read https://machinelearningmastery.com/transfer-learning-for-deep-learning/

* (2p) Read https://arxiv.org/pdf/2005.14165.pdf (you can skip Section 4 and Section 6. Feel free to skim over some of the results) and   https://arxiv.org/abs/2010.11929.



  Then, for each write a concise summary of the paper that answers (note: you should read the whole paper, not only answer these questions!) the following questions:

    * What problem did the authors propose to tackle (pay close attention to Introduction)
  
    * What was previosly popular approach to solve this problem?
  
    * What is the method and motivation for using this method?
  
    * What are topics for the future work?

    * What question you would ask the author (we will discuss some during the next class)

  Please send the summaries to my email.

  Fair warning: You might be asked to present answers.

* Find a group of 2 to 3. This will be the group you will be working on throughout the semester on the mini-project and the final project. It is fine to change the group after the mini-project. 

* If you are unfamiliar with PyTorch, please walkthrough https://pytorch.org/tutorials/beginner/deep_learning_60min_blitz.html in detail