# An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

[Vision Transformer, ViT](https://arxiv.org/abs/2010.11929) proposed in 2020, gaps the bridge of model architecture between CV and NLP. We implemented ViT on our own and did some experiment and exploration here.

This Notebook is organized as follows:
- trained a tiny vision transfomer image classifier from scratch on cifar10/tiny-imagenet200
- visualized the attention map
- showed the ViT receptive field area of each head across layers 

In [1]:
import torch
import einops
import numpy as np
import matplotlib.pyplot as plt

from paperlab.zoo import vit
from paperlab.core import Config
from torch.utils.data import Subset, DataLoader

import os
os.environ['KMP_DUPLICATE_LIB_OK'] = 'True'

In [2]:
params = {
    'image_size': (64, 64),
    'patch_size': (4, 4),
    'num_channel': 3,
    'pool': 'cls',
    'num_class': 10,
#     'use_dataset': 'tiny-imagenet-200',
    'use_dataset': 'cifar10',

    'transformer.depth': 4,
    'transformer.dim': 128,
    'transformer.dropout': 0.1,
    'transformer.emb_dropout': 0.,
    'transformer.num_head': 4,
    'transformer.dim_head': 64,
    'transformer.dim_mlp': 128,

    'learning.batch_size': 16,
    'learning.lr': 5e-4,
    'learning.num_epoch': 100,
    'learning.early_stop_patience': 10,
    
    'validate_freq': 10000,
}

config = Config(**params)

In [3]:
model = vit.train(config)

number of model parameter: 699530
Files already downloaded and verified
Files already downloaded and verified
torch.Size([16, 3, 32, 32]) torch.Size([16]) tensor([[[0.9176, 0.9451, 0.9765,  ..., 0.4235, 0.5569, 0.6392],
         [0.8549, 0.9137, 0.9569,  ..., 0.4902, 0.6235, 0.6275],
         [0.7373, 0.8157, 0.8667,  ..., 0.3686, 0.4902, 0.5373],
         ...,
         [0.2863, 0.2980, 0.3059,  ..., 0.4157, 0.4078, 0.4000],
         [0.2902, 0.3137, 0.3255,  ..., 0.4157, 0.4039, 0.4039],
         [0.2902, 0.3137, 0.3333,  ..., 0.3843, 0.3922, 0.4000]],

        [[0.8863, 0.9490, 0.9804,  ..., 0.3922, 0.5020, 0.5922],
         [0.8078, 0.8980, 0.9451,  ..., 0.4824, 0.5882, 0.5608],
         [0.6784, 0.7647, 0.8039,  ..., 0.3843, 0.4863, 0.5020],
         ...,
         [0.2824, 0.2824, 0.2824,  ..., 0.3765, 0.3686, 0.3569],
         [0.2824, 0.2941, 0.3020,  ..., 0.3804, 0.3608, 0.3608],
         [0.2824, 0.3020, 0.3176,  ..., 0.3529, 0.3608, 0.3608]],

        [[0.8471, 0.9333, 0.9725,

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_INITIALIZED when calling `cublasCreate(handle)`

In [None]:
def scale(attn):
    return (attn - attn.min()) / (attn.max() -  attn.min())

We visualize the internal attention map using the [attention rollout](https://arxiv.org/abs/2005.00928) algorithm.

The computation process is as follows:
- get the headwise-averaged attention weight of each layer in the model
- according to *attention rollout*, multiply the averaged attention weight matrix of each layer recursively and get the attention flow each ouput unit at the last layer to each input token at the first layer
- get the attention flow value at the `[CLS]` token, i.e. the comprehensive attention score quried by the `[CLS]` token at the final layer to each image patch token  at the input layer (the `[CLS]` token in input is removed)
- normalize the attention flow (since the `[CLS]`-`[CLS]` query-key pair is removed)

The highlighted area in the image is where the `[CLS]` token places the most attention weight, i.e. where the most valuable information for object detection is provided.

In [None]:
num_row, num_col = 4, 4

val_dataset = vit.get_cifar10('val')
# random sample `num_row * num_col` images from the devset 
sample_indices = np.random.permutation(np.arange(len(val_dataset)))[: num_row * num_col]
sampled_subset = Subset(val_dataset, sample_indices)

# feed the sampled images and get its attention maps
attn_maps, images = vit.get_attention_maps(model, 
                                           DataLoader(sampled_subset, batch_size=config.learning.batch_size))

f, ax = plt.subplots(num_row, num_col * 2, 
                     figsize=(num_row * 2, num_col),
                     gridspec_kw={'wspace': 0, 'hspace': 0}, 
                     squeeze=True)

for i in range(num_row):
    for j in range(num_col):
        attn_map, image = attn_maps[i * num_col], images[i * num_col + j]
        attend_image = scale(torch.unsqueeze(attn_map, 0)) * image
        
        ax[i, 2 * j].imshow(einops.rearrange(attend_image, 'c h w -> h w c').cpu(), aspect=1)  # show image masked with attention
        ax[i, 2 * j + 1].imshow(einops.rearrange(image, 'c h w -> h w c').cpu(), aspect=1)  # show original image

We also show the attention distance of each head across layers. Heads at deeper layer have greater attneiton distance, i.e., larger receptive field, which is simlar to the behaviour of CNN arch. The `mean attention distance` is computed as follows:
- get the attention map of each head at each layer, drop the `[CLS]` token and normalize attention scores among the image patch tokens
- compute the attention score among pixels, i.e. the raw attention value of the patch divided by its pixel number 
- compute the mean euclidean distance between any two pixel, weighted by the pixel attention score
- get average value of attention distance over sampled images

In [None]:
N = 128
sample_indices = np.random.permutation(np.arange(len(val_dataset)))[: N]
sampled_subset = Subset(val_dataset, sample_indices)

attn_distance = vit.get_attention_distance(model, 
                                           DataLoader(sampled_subset, batch_size=config.learning.batch_size))

for head in range(config.transformer.num_head):
    dist_by_layer = attn_distance[:, head].cpu()
    plt.scatter(range(config.transformer.depth), dist_by_layer, label=f"head {head}")

plt.legend()
plt.xlabel('layer')
plt.ylabel('mean attention distance (pixels)')