# Tutorial 4.2: Compact Convolutional Transformer

Author: [René Larisch](mailto:rene.larisch@informatik.tu-chemnitz.de)

[Hassani et al. (2021)](https://arxiv.org/abs/2104.05704) published the Compact Convolutional Transformer (CCT), which attempts to address two of the Vision Transformer's (ViT) main drawbacks: the large number of parameters required for processing patches and the substantial amount of data needed for training.

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f'PyTorch version: {torch.__version__} running on {device}')

import os, sys
notebook_dir = os.getcwd()
root_path = os.path.abspath(os.path.join(notebook_dir, ".."))
if root_path not in sys.path:
    sys.path.append(root_path)
    print(f"Added {root_path} to sys.path")

from Utils.dataloaders import prepare_imagenette
from Utils.little_helpers import timer, set_seed, get_parameters
from Utils.functions import train_model, evaluate_model, test_model
from Utils.plotting import visualize_training_results, visualize_test_results

set_seed(42)
    

<div align="center">
    <img src="figures/CCT_0.png" width="750"/>
    <p><i>Figure 1: CCT feeds the output of convolutional layers into the Transformer blocks, instead of patches.</i></p>
</div>

To achieve this, the CCT employs a block of convolutional layers and max-pooling operations. The output of these layers is fed into the Transformer encoder (see Fig. 1).
These representations can be understood as **convolutional tokenization**, which replaces the patch embedding from the ViT. Additionally, the convolutional block encapsulates local information and reduces the dimensionality of the input to the encoder. This results in fewer parameters and reduced computation time with similar performance.

A similar notation as used for the ViT can be used to distinguish between different CCT network configurations, extended by the number of convolutional layers. For example, CCT-7/3x2 has seven Transformer encoder layers and two convolutional layers with a 3x3 kernel size.

<div align="center">
    <img src="figures/comparision.png" style="width:90%"/>
    <p><i>Figure 2: Left: CIFAR-10 accuracy vs. model size (sizes < 12M(illion) parameters). CCT* was trained longer. Right: ImageNet Top-1 validation accuracy comparison (no extra data or pretraining). MACs are the number of <a href="https://en.wikipedia.org/wiki/Multiply%E2%80%93accumulate_operation">Multiply-Accumulate Operations</a>, where each MAC is one operation consisting of a multiplication and an addition. </i></p>
</div>



## Convolutional Tokenizer

Unlike ViT, which uses patch embedding, CCT uses a convolutional block and employs its output as tokens for the Transformer encoder.
We implemented the convolutional tokenizer as a class that inherits from [nn.Module](https://pytorch.org/docs/stable/generated/torch.nn.Module.html). 
The block consists of one or more convolutional layers, with a max-pooling layer following each one.
Therefore, we need input parameters to define the convolutional layers, such as kernel size, stride, and padding, as well as the parameters of the potential max-pooling layers and the total number of convolutional layers and feature maps.

In [None]:
class Tokenizer(nn.Module):
    def __init__(self, kernel_size, stride, padding, pooling_kernel_size=3,
                 pooling_stride = 2, pooling_padding =1, n_conv_layers = 1,
                 n_input_channels = 3, n_output_channels = 64, in_planes = 64,
                 activation=None, max_pool=True, conv_biase = False):
        super(Tokenizer, self).__init__()

        n_filter_list = [n_input_channels] + \
                        [in_planes for _ in range(n_conv_layers -1)] + \
                        [n_output_channels]
        
        ## use nn.Sequential to create the convolutional blocks
        ## each block consists of a 2D-convolutional layer, an activation function and a 2D-max-pooling operation
        self.conv_layers = nn.Sequential(
            *[nn.Sequential(
                nn.Conv2d(n_filter_list[i], n_filter_list[i+1],
                          kernel_size=(kernel_size,kernel_size),
                          stride = (stride,stride),
                          padding= (padding,padding), 
                           bias = conv_biase ),
                nn.Identy() if activation is None else activation(),
                nn.MaxPool2d(kernel_size = pooling_kernel_size,
                             stride = pooling_stride,
                             padding = pooling_padding) if max_pool else nn.Identy()
            )
                 for i in range(n_conv_layers)
             ])
        self.flattener = nn.Flatten(2,3)
        self.apply(self.init_weight)

    ## function to calculate the final sequence length of the convolution tokens
    def sequence_length(self, n_channels = 3, height=128, width=128):
        return self.forward(torch.zeros((1, n_channels, height, width))).shape[1]

    def forward(self, x):
        return self.flattener(self.conv_layers(x)).transpose(-2,-1)

    @staticmethod
    def init_weight(m):
        if isinstance(m, nn.Conv2d):
            nn.init.kaiming_normal_(m.weight)

## Transformer (Encoder-only)

For the Transformer encoder, we first define the multi-head attention layer as done for the ViT.

In [None]:
class Attention(nn.Module):
    def __init__(self, dim, num_heads=8, attention_dropout=0.1, projection_dropout=0.1):
        super().__init__()
        self.num_heads = num_heads
        head_dim = dim // self.num_heads
        self.scale = head_dim ** -0.5

        self.qkv = nn.Linear(dim, dim * 3, bias=False)
        self.attn_drop = nn.Dropout(attention_dropout)
        self.proj = nn.Linear(dim, dim)
        self.proj_drop = nn.Dropout(projection_dropout)

    def forward(self, x):
        B, N, C = x.shape
        #print(x.shape, self.num_heads, C//self.num_heads)        
        qkv = self.qkv(x).reshape(B, N, 3, self.num_heads, C // self.num_heads).permute(2, 0, 3, 1, 4)
        q, k, v = qkv[0], qkv[1], qkv[2]

        attn = (q @ k.transpose(-2, -1)) * self.scale
        attn = attn.softmax(dim=-1)
        attn = self.attn_drop(attn)

        x = (attn @ v).transpose(1, 2).reshape(B, N, C)
        x = self.proj(x)
        x = self.proj_drop(x)
        return x

### Stochastic Depth

 The CCT uses [Stochastic Depth](https://arxiv.org/abs/1603.09382) as a regularization technique. During training, it randomly drops a subset of layers (unlike Dropout, which drops single neurons).
Stochastic depth was originally developed for very deep neural networks, such as ResNet. See the ResNet notebook (Tutorial 2.4) for more information.

In [None]:
# Thanks to rwightman's timm package
# github.com:rwightman/pytorch-image-models

def drop_path(x, drop_prob: float = 0., training: bool = False):
    if drop_prob == 0. or not training:
        return x
    keep_prob = 1 - drop_prob
    shape = (x.shape[0],) + (1,) * (x.ndim - 1)  # work with diff dim tensors, not just 2D ConvNets
    random_tensor = keep_prob + torch.rand(shape, dtype=x.dtype, device=x.device)
    random_tensor.floor_()  # binarize
    output = x.div(keep_prob) * random_tensor
    return output

#Stochastic Depth
class DropPath(nn.Module):

    def __init__(self, drop_prob=None):
        super(DropPath, self).__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        return drop_path(x, self.drop_prob, self.training)

With that, we can implement the class for a Transformer block, consisting of a multi-attention head layer, a MLP and some operations for regularization.

In [None]:
import torch.nn.functional as F

class TransformerEncoderLayer(nn.Module):
    def __init__(self, d_model, nhead, dim_feedforward=2048, dropout=0.1,
                 attention_dropout=0.1, drop_path_rate=0.1):
        super(TransformerEncoderLayer, self).__init__()
        self.pre_norm = nn.LayerNorm(d_model)
        self.self_attn = Attention(dim=d_model, num_heads=nhead,
                                   attention_dropout=attention_dropout, projection_dropout=dropout)

        self.linear1 = nn.Linear(d_model, dim_feedforward)
        self.dropout1 = nn.Dropout(dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.linear2 = nn.Linear(dim_feedforward, d_model)
        self.dropout2 = nn.Dropout(dropout)

        self.drop_path = DropPath(drop_path_rate) if drop_path_rate > 0 else nn.Identity()

        self.activation = F.gelu

    def forward(self, src: torch.Tensor, *args, **kwargs) -> torch.Tensor:
        src = src + self.drop_path(self.self_attn(self.pre_norm(src))) # residual connection 1
        src = self.norm1(src)
        src2 = self.linear2(self.dropout1(self.activation(self.linear1(src))))
        src = src + self.drop_path(self.dropout2(src2)) # residual connection 1
        return src

## Classifier

To perform classification, we build a classification block consisting of multiple attention layers and a classification head. This block receives the convolution tokens as input.

### Positional embedding
For the positional embedding, the authors of the CCT evaluated three different approaches.
1. Sinusoidal embedding: Introduced by [Vaswani et al. (2017)](https://arxiv.org/abs/1706.03762) for Transformers in a NLP task.
2. Learnable embedding: As used by [Dosovitskiy et al. (2020)](https://arxiv.org/abs/2010.11929) in the ViT model (see the ViT notebook (Tutorial 4.1) for more information)
3. None embedding: As the convolutional layer already encapsulates some local information, the authors demonstrate that removing the positional embedding has only a weak influence on the accuracy (see Table 7 in [Hassani et al. (2022)](https://arxiv.org/abs/2104.05704)).

### Sequence Pooling
Unlike the ViT, which has an additional class token used to aggregate class-specific information, the CCT introduces sequence pooling (**SeqPool**).
The main idea is to pool the output sequence of tokens and weight them to determine the most important one.


Assume the output of an $L$ layer Transformer encoder 
$$
x_{L} \in \mathbb{R}^{B\times n \times d}
$$
where $B$ is the batch size, $n$ the sequence length, and $d$ is the embedding dimension. 
The output is fed into a linear layer $g(x_L) \in \mathbb{R}^{d \times 1}$ and a softmax activation function is applied on that. So the new output is
$$
x'_L = \text{softmax}\left( g(x_L)^T \right) \in \mathbb{R}^{B \times 1 \times n}
$$

This output ($x'_L$) can be understood as an importance weighting for each token in our original output, giving the output of the sequence pooling:
$$
 z = x'_L x_L = \text{softmax}\left( g(x_L)^T \right) \times x_{L} \in \mathbb{R}^{B \times 1 \times d}
$$

By reducing the $1$-dimension, the final output is $z \in \mathbb{R}^{B \times d}$ and can be sent to the classification head.

Instead of sequence pooling, the CCT can be used with the cls token. This will be considered in the implementation.

In [None]:
class TransformerClassifier(nn.Module):
    def __init__(self,
                 seq_pool=True,
                 embedding_dim = 256,
                 num_layers=12,
                 num_heads = 6,
                 mlp_ratio = 4.,
                 num_classes = 10,
                 dropout =.1,
                 attention_dropout=.1,
                 stochastic_depth = .1,
                 positional_embedding = 'learnable',
                 sequence_length = None):
        super().__init__()
        
        ## positional embedding can be realised in three ways:
        ## 1. sinus embedding 
        ## 2. learnable
        ## 3. No additional posistional embedding
        positional_embedding = positional_embedding if \
            positional_embedding in ['sine', 'learnable', 'none'] else 'sine'
        dim_feedforward = int(embedding_dim * mlp_ratio)
        self.embedding_dim = embedding_dim
        self.sequence_length = sequence_length
        self.seq_pool = seq_pool
        self.num_tokens = 0

        assert sequence_length is not None or positional_embedding == 'none', \
            f"Positional embedding is set to {positional_embedding} and" \
            f" the sequence length was not specified."

        ## if no SeqPool is used, us a cls token like the ViT
        if not seq_pool:
            sequence_length +=1
            self.class_emb = nn.Parameter(torch.zeros(1,1, self.embedding_dim),
                                      requires_grad=True)
            self.num_tokens = 1
        else:
            ## if SeqPool is used, intialize here the linear layer g()
            self.attention_pool = nn.Linear(self.embedding_dim, 1)

        ## initialize the positional embedding
        if positional_embedding != 'none':
            if positional_embedding == 'learnable':
                self.positional_emb = nn.Parameter(torch.zeros(1, sequence_length, embedding_dim),
                                                requires_grad=True)
                nn.init.trunc_normal_(self.positional_emb, std=0.2)
            else:
                self.positional_emb = nn.Parameter(self.sinusoidal_embedding(sequence_length, embedding_dim),
                                                requires_grad=False)
        else:
            self.positional_emb = None

        self.dropout = nn.Dropout(p=dropout)
        dpr = [x.item() for x in torch.linspace(0, stochastic_depth, num_layers)]

        ## create a list of transformer encoder layer
        self.blocks = nn.ModuleList([
            TransformerEncoderLayer(d_model=embedding_dim, nhead=num_heads,
                                    dim_feedforward=dim_feedforward, dropout=dropout,
                                    attention_dropout=attention_dropout, drop_path_rate=dpr[i])
            for i in range(num_layers)])
        self.norm = nn.LayerNorm(embedding_dim)

        self.fc = nn.Linear(embedding_dim, num_classes)
        self.apply(self.init_weight)

    def forward(self, x):
        if self.positional_emb is None and x.size(1) < self.sequence_length:
            x = F.pad(x, (0, 0, 0, self.n_channels - x.size(1)), mode='constant', value=0)

        ## if no SeqPool is used, use the cls token
        if not self.seq_pool:
            cls_token = self.class_emb.expand(x.shape[0], -1, -1)
            x = torch.cat((cls_token, x), dim=1)

        if self.positional_emb is not None:
            x += self.positional_emb

        x = self.dropout(x)

        for blk in self.blocks:
            x = blk(x)
        x = self.norm(x)

        ## apply SeqPool
        if self.seq_pool:
            x = torch.matmul(F.softmax(self.attention_pool(x), dim=1).transpose(-1, -2), x).squeeze(-2)
        else:
            x = x[:, 0]

        x = self.fc(x)
        return x

    @staticmethod
    def init_weight(m):
        if isinstance(m, nn.Linear):
            nn.init.trunc_normal_(m.weight, std=.02)
            if isinstance(m, nn.Linear) and m.bias is not None:
                nn.init.constant_(m.bias, 0)
        elif isinstance(m, nn.LayerNorm):
            nn.init.constant_(m.bias, 0)
            nn.init.constant_(m.weight, 1.0)

    @staticmethod
    def sinusoidal_embedding(n_channels, dim):
        pe = torch.FloatTensor([[p / (10000 ** (2 * (i // 2) / dim)) for i in range(dim)]
                                for p in range(n_channels)])
        pe[:, 0::2] = torch.sin(pe[:, 0::2])
        pe[:, 1::2] = torch.cos(pe[:, 1::2])
        return pe.unsqueeze(0) 

## CCT Model
Now we bring everything together to built the CCT network.

In [None]:
class CCT(nn.Module):
    def __init__(self,
                 img_size=224,
                 embed_dim = 256,
                 n_input_channels = 3,
                 n_conv_layers = 2,
                 kernel_size=3,
                 stride = 1,
                 padding = 3,
                 pooling_kernel_size = 2,
                 pooling_stride = 2,
                 pooling_padding = 1,
                 dropout = 0.2,
                 attention_dropout = .1,
                 stochastic_depth =.1,
                 num_layers=7,
                 num_heads = 6,
                 mlp_ratio=4.,
                 num_classes = 10,
                 pos_embeding = 'learnable'):
        super(CCT, self).__init__()

        ## Convolutional-Tokenizer
        self.tokenizer = Tokenizer(kernel_size = kernel_size,
                                   n_input_channels=n_input_channels,
                                   n_output_channels = embed_dim,
                                   stride = stride,
                                   padding = padding,
                                   pooling_kernel_size = pooling_kernel_size,
                                   pooling_stride = pooling_stride,
                                   pooling_padding = pooling_padding,
                                   max_pool = True,
                                   activation = nn.ReLU,
                                   n_conv_layers = n_conv_layers,
                                   conv_biase = False)

        ## Transformer-based classifier
        self.classifier = TransformerClassifier(
            sequence_length = self.tokenizer.sequence_length(
                n_channels=n_input_channels,
                height = img_size,
                width = img_size),
                embedding_dim = embed_dim,
                seq_pool = True,
                dropout = dropout,
                attention_dropout = attention_dropout,
                stochastic_depth = stochastic_depth,
                num_layers = num_layers,
                num_heads = num_heads,
                mlp_ratio = mlp_ratio,
                num_classes = num_classes,
                positional_embedding = pos_embeding
                )
        
    def forward(self,x):
        x = self.tokenizer(x)
        return self.classifier(x)

## Dataset preparation

While the CCT has a relative low number of parameters, we use different data augmentation methods, such as crop and resize, horizontal flipping, rotation, jitter and more.

In [None]:
from torchvision.transforms import v2, Compose
from torchvision import transforms

transform_augm = transforms.Compose([
    v2.ToImage(),
    # Core transformations
    v2.RandomResizedCrop(size=224, scale=(0.75, 1.0), ratio=(0.9, 1.05)),
    v2.RandomHorizontalFlip(p=0.5),  # People can face either direction
    v2.RandomRotation(degrees=(-10, 10)),  # Small rotations
    
    # Lighting and appearance variations
    v2.ColorJitter(brightness=0.15, contrast=0.15, saturation=0.1, hue=0.05),
    v2.RandomAutocontrast(p=0.2),
    
    # Occasional realistic variations - with proper probability handling
    v2.RandomApply([v2.GaussianBlur(kernel_size=3, sigma=(0.1, 1.0))], p=0.3),
    v2.RandomAdjustSharpness(sharpness_factor=1.5, p=0.3),
    v2.RandomPerspective(distortion_scale=0.15, p=0.3),
    v2.RandomErasing(p=0.1, scale=(0.02, 0.08), ratio=(0.3, 3.3)),
    
    # Normalization
    v2.ToDtype(torch.float32, scale=True),
    v2.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

transform_norm = Compose(
[    v2.ToImage(),
     v2.ToDtype(torch.float32, scale=True),
     v2.Resize(size=(224,224)),
     v2.Normalize(mean = [0.485, 0.456, 0.406], std=[0.229,0.224,0.225]) , 
])


## Dataset

To train the network we use the [Imagenette](https://github.com/fastai/imagenette) dataset, which is already provided by [torchvision](https://pytorch.org/vision/main/generated/torchvision.datasets.Imagenette.html).

It is a 10 class subset of Imagenet (tench, english springer, cassette player, chain saw, church, french horn, garbage truck, gas pump, golf ball, parachute), easily to be classified. 

In [None]:
batch_size=32
train_loader, test_loader, classes = prepare_imagenette(transform_augm, transform_norm, 
                                                        save_path='../Dataset/', batch_size = batch_size, num_workers=4)

Show some samples with the augmentations.

In [None]:
import matplotlib.pyplot as plt
import numpy as np

def imshow(imges):
    plt.figure()
    for i in range(4):
        img = imges[i]
        img = img / 2 + 0.5 #unnormalize
        npimg = img.numpy()
        plt.subplot(1,4,i+1)
        plt.imshow(np.transpose(npimg,(1,2,0)))
    plt.show

dataiter = iter(train_loader)
images, labels = next(dataiter)
imshow(images)

## Initialize the network

In [None]:
embed_dim =32*7
num_heads =embed_dim//32
depth = 7
n_convlayers = 3
kernel_size = 3

net_cct = CCT(num_classes = len(classes), 
              embed_dim = embed_dim, 
              num_heads=num_heads, 
              num_layers=depth,
              n_conv_layers=n_convlayers,
              kernel_size = kernel_size)

print('Trainable Parameters in CCT: %.3fM' % get_parameters(net_cct))

## Loss and optimizer

We use the AdamW optimizer, provided by [PyTorch](https://docs.pytorch.org/docs/stable/generated/torch.optim.AdamW.html). Without going into too much detail, the normal Adam optimizer applies a weight decay term directly to the calculated gradient of a batch. Since the gradient is affected by learning rates, so is the weight decay term, too. In contrast, AdamW adds the weight decay term to the parameter updates. This leads to a stable influence of the weight decay term (see [Loshchilov and Hutter, 2019](https://arxiv.org/abs/1711.05101) for a deeper explanation). 

Similar to the ViT implementation, we will use a cosine annealing learning rate scheduler for training. Please note that no warmup scheduler is used here.

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

num_epochs = 10

# define optimizer
init_lr = 1e-4
optimizer = optim.AdamW(net_cct.parameters(), lr=init_lr, weight_decay=5e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)

In [None]:
from torch.optim.lr_scheduler import StepLR
# train model
history = train_model(model=net_cct, 
                        train_loader=train_loader, 
                        val_loader=test_loader, 
                        criterion=nn.CrossEntropyLoss(),
                        optimizer=optimizer,
                        scheduler=scheduler,
                        device=device,
                        num_epochs=num_epochs)


# save model
results_folder = 'cct_model/'
os.makedirs(results_folder, exist_ok=True)

torch.save(net_cct.state_dict(), f'{results_folder}cct_net_state_dict.pt')

# save history + predictions
np.save(f'{results_folder}history.npy', history)

In [None]:
# evaluate model
with timer("Evaluating process"):
    aggregate_df, per_image_df, overall_accuracy = test_model(model=net_cct,
                                                              test_loader=test_loader,
                                                              device=device,
                                                              class_names=classes)


In [None]:
from Utils.plotting import visualize_training_results, visualize_test_results

visualize_training_results(train_losses=history['train_loss'],
                           train_accs=history['train_acc'],
                           test_losses=history['val_loss'],
                           test_accs=history['val_acc'],
                           output_dir=None)

visualize_test_results(aggregate_df=aggregate_df, per_image_df=per_image_df, overall_accuracy=overall_accuracy)

## Pre-trained CCT
It is also possible to use a pre-trained CCT and fine-tune it on the new dataset.
We use the pre-trained model from [SHI-Labs](https://github.com/SHI-Labs/Compact-Transformers) 

In [None]:
import os

if not os.path.exists('Compact_Transformers'):
    os.popen('git clone https://github.com/SHI-Labs/Compact-Transformers.git Compact_Transformers')

## Fine tune on Imagenette

We load a CCT network, pre-trained on the imagenet dataset, consisting of 7 transformer encoder layers, a convolutional block with 2 convolutional layers and a kernel size of 7.

In [None]:
from Compact_Transformers.src import cct_7_7x2_224_sine
cct_net = cct_7_7x2_224_sine(pretrained=True, progress=False, img_size=224, positional_embedding='sine')

print('Trainable Parameters in CCT: %.3fM' % get_parameters(cct_net))

In [None]:
num_classes = len(classes)
## get the input dimensionality of the old classifier
in_features = cct_net.classifier.fc.in_features
## put a new classifier in top
cct_net.classifier.fc = nn.Sequential(nn.Linear(in_features, 256),
                                      nn.ReLU(),
                                      nn.Dropout(0.3),
                                      nn.Linear(256, num_classes)) 

We use the same optimizer and scheduler as above.

In [None]:
import torch.optim as optim
from torch.optim.lr_scheduler import CosineAnnealingLR

num_epochs = 10

# define optimizer
init_lr = 1e-4
optimizer = optim.AdamW(cct_net.parameters(), lr=init_lr, weight_decay=5e-4)
scheduler = CosineAnnealingLR(optimizer, T_max=num_epochs, eta_min=1e-6)


In [None]:
# train model
history = train_model(model=cct_net, 
                        train_loader=train_loader, 
                        val_loader=test_loader, 
                        criterion=nn.CrossEntropyLoss(),
                        optimizer=optimizer,
                        scheduler=scheduler,
                        device=device,
                        num_epochs=num_epochs)


# save model
results_folder = 'cct_model/'
os.makedirs(results_folder, exist_ok=True)

torch.save(cct_net.state_dict(), f'{results_folder}model_state_dict.pt')

# save history + predictions
np.save(f'{results_folder}history.npy', history)


In [None]:
# evaluate model
with timer("Evaluating process"):
    aggregate_df, per_image_df, overall_accuracy = test_model(model=cct_net,
                                                              test_loader=test_loader,
                                                              device=device,
                                                              class_names=classes)



In [None]:
# plot results
from Utils.plotting import visualize_training_results, visualize_test_results

visualize_training_results(train_losses=history['train_loss'],
                           train_accs=history['train_acc'],
                           test_losses=history['val_loss'],
                           test_accs=history['val_acc'],
                           output_dir=None)


visualize_test_results(aggregate_df=aggregate_df, per_image_df=per_image_df, overall_accuracy=overall_accuracy)

# Exercises

## 1. Change the parameters

The CCT is a good example of how different techniques and approaches (such as Transformers, convolutional networks, sequence pooling, and stochastic depth) can be combined. Experiment with the various components and adjust their parameters to observe their impact on performance.


In [None]:
# Your code here

## 2. Plot the attention maps

In the ViT notebook, you visualized the attention maps for the ViT model. You can use the same algorithm for the CCT with some adaptations, such as adapting the code for the max pooling layers in the convolution tokenizer.
How does the algorithm change depending on whether **SeqPool** or the cls token is used?

In [None]:
# Your code here