# T6 - Swin-Transformer - Pawpularity

This is an ongoing CV-focused kaggle contest (3 months to go from now, Oct, 2021). And you are getting the chance to win a cash prize!

In this contest, you will help the website [PetFinder.my](https://petfinder.my/) to give "Pawpularity" scores to pet photos, which will help them find their homes faster.

The "Pawpularity" scores in the trainning set is derived from each pet profile's page view statistics at the listing pages, using an algorithm that normalizes the traffic data across different pages, platforms and various metrics.

`Metadata`
* For each image, you are provided optional metadata, manually labeling each photo for key visual quality and composition parameters.

* These labels are not used for deriving our Pawpularity score, but it may be beneficial for better understanding the content and co-relating them to a photo's attractiveness. Our end goal is to deploy AI solutions that can generate intelligent recommendations (i.e. show a closer frontal pet face, add accessories, increase subject focus, etc) and automatic enhancements (i.e. brightness, contrast) on the photos, so we are hoping to have predictions that are more easily interpretable.

* You may use these labels as you see fit, and optionally build an intermediate / supplementary model to predict the labels from the photos. If your supplementary model is good, we may integrate it into our AI tools as well.

* In our production system, new photos that are dynamically scored will not contain any photo labels. If the Pawpularity prediction model requires photo label scores, we will use an intermediary model to derive such parameters, before feeding them to the final model.

`Evaluation Metrics`

$$RMSE = \sqrt{\frac{1}{n}\sum_{i=1}^n (y_i - \hat{y}_i)^2}$$

[PetFinder.my - Pawpularity Contest](https://www.kaggle.com/c/petfinder-pawpularity-score/overview/description)

[Reference](https://www.kaggle.com/phalanx/train-swin-t-pytorch-lightning/notebook)

In this tutorial, I will build the pipeline and use AlexNet as a demo. When you submit the code to Kaggle, you may encounter error even though you can successfully run and save the notebook. Please refer to Discussion for more information.

## key updates:

### 1. use new model: swin-transformer

2021 ICCV Best Paper: https://arxiv.org/abs/2103.14030

An explanation for Swin-Transformer: https://medium.com/codex/swin-transformers-the-most-powerful-tool-in-computer-vision-659f78744871

For loading the pretrained model, please include the dataset: https://www.kaggle.com/liucong12601/timmswin

The initial ViT(Vision Transformer) showed promising performance in vision problem but adapting Transformers to fully supplement convolutions was still considered a challenge. Swin transformers on the other hand can model the differences between the two domains such as variations in the scale of objects and the high resolution of pixels in images more efficiently and can serve as a general-purpose pipeline for vision.

The paper describes Swin Transformers as a hierarchical Transformer whose representation is computed with Shifted WINdows.

![Swin-Transformer](https://miro.medium.com/max/1400/1*KYN2Xg7IUE_YP0ieE6nysw.png)

##### Hierarchical representation
1.  the input RGB image is split into patches by the patch partition layer. Each patch is 4 x 4 x 3(3 for RGB channels) and is considered a “token”. The patch is subject to a linear embedding layer which projects it to a C dimensional token as in ViT.

2. The main architecture is composed of multiple stages(4 stages for Swin-T) which again is built by connecting a patch merging layer and multiple Swin transformer blocks. The Swin transformer block is based on a modified self-attention. 

3. A hierarchical representation is implemented through the patch merging layers. This layer concatenates the features of 2 × 2 neighboring patches which reduce the number of tokens and applies a linear transformation that sets the output dimension by a factor of 2(relative to the input).

![hierachical representation](https://miro.medium.com/max/682/1*KSNFDRw_C-PVt-Hg8ugn8g.png)

##### Shifted Windows

1. The shifted windows approach is based on the observation that standard vision Transformers conduct self-attention on a global receptive field. Therefore, vision transformers have quadratic computational complexity to the number of tokens.

2. The shifted window aims to compute self-attention within local windows. A window contains M × M non-overlapping patches(M=7), and self-attention is calculated in the window. 

3.  As illustrated in the figure below, the first module uses standard window configuration to compute self-attention locally from evenly separated windows starting from the top-left pixel. The next Swin transformer block adopts a window configuration that is shifted by (M/2, M/2) pixels from the preceding layer. During the Swin transformer blocks, the network alternates between standard window configuration(W-MSA) and shited window configuration(SW-MSA). This approach introduces connections between neighboring overlapping windows just like how deep convolutions work. 

![shifted window approach](https://miro.medium.com/max/828/1*tmXrQOcPcpSjOqx-iQZAXw.png)


##### Efficient Shifting

1. For efficient processing of edge windows smaller than M × M, the paper applies cyclic-shifting before computing self-attention as illustrated in the figure below. A masking mechanism is applied to the partitions so that computation is limited within each original window. (This part is hard to understand. Please refer to the official code for better understanding)

![efficient shifting](https://miro.medium.com/max/844/1*7_GOjYTuuJPJPOby6zr1OA.png)

### 2. new optimizer: AdamW

Pytorch doc for AdamW: https://pytorch.org/docs/stable/generated/torch.optim.AdamW.html#torch.optim.AdamW

A very good comparison for Adam and AdamW: https://towardsdatascience.com/why-adamw-matters-736223f31b5d

Try to imagine minimizing a cost function f of a neural network like walking down a hillside in the mountains: You initialize the weights of your network randomly which translates to starting at a random point on the mountain. Your goal is to reach a good minimum of the cost function (the valley) as quickly as possible. Before each step you calculate the gradient ∇ f (determine in which direction the hillside inclines the most) and take a step in the opposite direction

* Adam: Take bigger, more daring steps when walking down a meadow where the gradient does not change much — or smaller steps when climbing down rocks where the gradient constantly changes.

    Adam keeps track of (exponential moving) averages of the gradient (called the first moment, from now on denoted as $m$) and the square of the gradients (called raw second moment, from now on denoted as $v$). When the gradients do not change much ($m$ is close to $\sqrt{v}$) and “we do not have to be careful walking down the hill”, the step size is of the order of α, if they do ($\sqrt{v} >> m$) and “we need to be careful not to walk in the wrong direction”, the step size is much smaller.
    
* Problem of Adam: The violet term in line 6 shows L2 regularization in Adam (not AdamW) as it is usually implemented in deep learning libraries. The regularization term is added to the cost function which is then derived to calculate the gradients $g$. However, if one adds the weight decay term at this point, the moving averages of the gradient and its square ($m$ and $v$) keep track not only of the gradients of the loss function but also of the regularization term! This means that L2 regularization does not work as intended and is not as effective as with SGD which is why SGD yields models that generalize better and has been used for most state-of-the-art results.

* AdamW: the weight decay is performed only after controlling the parameter-wise step size (see the green term in line 12). The weight decay or regularization term does not end up in the moving averages and is thus only proportional to the weight itself. The authors show experimentally that AdamW yields better training loss and that the models generalize much better than models trained with Adam allowing the new version to compete with stochastic gradient descent with momentum.

![adam-adamW](https://miro.medium.com/max/1296/1*BOPnuP6VP0JVnJsoCdTo-g.png)

### 3. use new learning rate scheduler: CosineAnnealingWarmRestarts

* T_0 (int) – Number of iterations for the first restart.
* T_mult (int, optional) – A factor increases T_{i} after a restart. Default: 1.

* eta_min (float, optional) – Minimum learning rate. Default: 0.

In [None]:
import torch
import torch.optim as optim
import matplotlib.pyplot as plt

eta_min = 1e-6
T_0 = 20
T_mult = 1
demo_model = [torch.tensor([0., 1.])] # I randomly create some parameter for demo
optimizer = optim.SGD(demo_model, lr=1e-5)
scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, eta_min=eta_min, T_0=T_0, T_mult=T_mult)
lrs = []
x = range(20)
for epoch in x:
    lrs.append(scheduler.get_lr())
    ## or we can use:
    #lrs.append(optimizer.param_groups[0]['lr'])
    scheduler.step()
plt.plot(x, lrs)

### 4. use new loss:
* Instead of using MSELoss as we did in the last tutorial, here we use BCEWithLogitsLoss. This loss is for `binary classififation` originally. Here we use it for our `regression` prblom. During the training, we firstly transform our targets from [0,100] to [0,1], which make it `like` a binary classification problem: we would like to predict the `pobability that our score is 1`, then we can used the BCEWithLogitsLoss like a binary classification problem.
    
* This loss combines a Sigmoid layer and the BCELoss in one single class. This version is more numerically stable than using a plain Sigmoid followed by a BCELoss as, by combining the operations into one layer, we take advantage of the log-sum-exp trick for numerical stability.

* During the training, our model will return a numerical value for a given image, then the Sigmod function will transform this value to [0,1]. Then it will calculate the BCELoss bewteen the output of the Sigmod function and the targets. For validation and testing, we not longer need to BCELoss part. We only need to transform the output of the model by the Sigmoid function (between [0,1]), representing "the probability that this image is of score 1". Then we multiple it by 100, then we get the Pawpularity in [0,100].
    
* The advantange of this loss function over MSELoss is that, if we directly use the output of the model as our predicted Pawpularity, then this value can be negative (which is not reasonable). However, if we use BCEWithLogitsLoss, its embedded Sigmoid function will transform the model output to [0,1], which is a reasonable region for Pawpularity Score (normalized by 100).
    
* In the evaluation of the validation set, we still use RMSE to measure our loss function, which is used by the contest.
    
    


### 5. mixture:

During the training, here we will try to mixing up the training data within a batch to further create "diversity" to improve the generalization ability of the model.

Given a batch of data $(x,y)$ of size (batchsize, channel, image_size, image_size), we do the mixup with probability $p$. During the mixup, we firstly sample a ratio $\lambda \sim \text{Beta}(\alpha, \alpha)$. Then we get the mixup results:

$$x_{mix} = \lambda x + (1-\lambda) x[\text{random permutation}, \text{channel}, \text{image_size}, \text{image_size}]$$

$$y_a = y$$

$$y_b = y[\text{random permutation}]$$

Then we set the loss to be:

$$\text(loss)_{mix}(x_{mix} ,y_a, y_b) = \lambda \text{loss}(M(x_{mix}), y_a) + (1-\lambda) \text{loss}(M(x_{mix}), y_b)$$

### 6. GradCam

grad_cam: https://github.com/jacobgil/pytorch-grad-cam

explanation: 

https://research.fb.com/wp-content/uploads/2017/09/iccv-grad-cam.pdf

https://towardsdatascience.com/demystifying-convolutional-neural-networks-using-gradcam-554a85dd4e48



While CNN enable superior performance, their lack of decomposability into intuitive and understandable components makes them hard to interpret. Interpretability of deep learning models matters to build trust and move towards their successful integration in our daily lives. To achieve this goal the model transparency is useful to explain why they predict what they predict.

Grad-Cam, uses the gradient information flowing into the last convolutional layer of the CNN to understand each neuron for a decision of interest.

Step 1. First compute the gradient of the score for the class $c$, $yc$ (before the softmax) with respect to feature maps $A_k$ of a convolutional layer. These gradient flowing back are global average-pooled to obtain the neuron importance weights $a_k$ for the target class.

![calculating_weights_ak](https://miro.medium.com/max/796/1*RE3V1anNLUuYd18NbBuQjg.png)

Step 2. After calculating $a_k$ for the target class $c$, we perform a weighted combination of activation maps and follow it by ReLU.

![Linear Combination](https://miro.medium.com/max/397/1*FqE04KDQukS5h6doLszqNQ.png)

This results in a coarse heatmap of the same size as that of the convolutional feature maps. We apply ReLU to the linear combination because we are only interested in the features that have a positive influence on the class of interest. Without ReLU, the class activation map highlights more than that is required and hence achieve low localization performance.

($L^c_{Grad-CAM}$ is first up-sampled to the input image resolution using bi-linear interpolation.)


![grad_cam](https://www.statworx.com/wp-content/uploads/CAM_orig_paper-2048x919.png)

In [None]:
# install package by adding dataset
# you can find the torchsummary package in dataset: https://www.kaggle.com/truthr/torchsummary
# pytorch_grad_cam dataset in https://www.kaggle.com/zhicongliang/gradcam131
# ttah 0.0.3: https://www.kaggle.com/dmitrykonovalov/ttach003
# and I also upload the python-box package in dataset: https://www.kaggle.com/zhicongliang/pythonbox
# and you can find timm in dataset: https://www.kaggle.com/kozodoi/timm-pytorch-image-models
# then you can add these datasets to your notebook
!pip install ../input/torchsummary/torchsummary-1.5.1-py3-none-any.whl
!pip install ../input/pythonbox/python_box-5.4.1-py3-none-any.whl
!pip install ../input/timm-pytorch-image-models/pytorch-image-models-master
!pip install ../input/gradcam131/grad-cam-1.3.1
!pip install ../input/ttach003/ttach-0.0.3-py3-none-any.whl

In [None]:
!pip install ../input/gradcam131/grad-cam-1.3.1
!pip install ../input/ttach003/ttach-0.0.3-py3-none-any.whl

In [None]:
import os
import pandas as pd
import numpy as np
import tqdm
from PIL import Image
import copy

from sklearn.model_selection import StratifiedKFold

from box import Box

from pytorch_grad_cam import GradCAMPlusPlus
from pytorch_grad_cam.utils.image import show_cam_on_image

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
from torch.utils.data import DataLoader, Dataset
import torchvision.transforms as T
# https://rwightman.github.io/pytorch-image-models/
import timm

import matplotlib.pyplot as plt
plt.style.use('seaborn')

## Step 0. Configuration

Here we define a dictionary to store our parameters.

In [None]:
config = {
    'root': '../input/petfinder-pawpularity-score/',
    'device': 'cuda', # 'cpu' for cpu, 'cuda' for gpu
    'n_splits': 5,
    'seed': 2021,
    'train_batchsize': 64,
    'val_batchsize': 64,
    'epoch': 20,
    'learning_rate': 1e-5,
    'logger_interval': 1,
    'model_name': 'swin_tiny_patch4_window7_224',
    'pretrain_path': '../input/timmswin/swin_tiny_patch4_window7_224.pth',
    'eta_min': 1e-4,
    'T_0': 20
}

# transform key to attribute. it will be easier for us to refer to these parameters later
config = Box(config)

### Step 1. Load the data

If we are using dataset like cifar10, mnist, svhn and etc., we can directly use torchvision.datasets. However, if you would like to use our own data, we need to constrcut a custom Dataset that will help us load the data and perform some basic transformations.

The most important functions of a custom Dataset is `__len__` and `__getitem__`.

The `__len__` function will return the number of elements in this dataset, while `__getitem__` will return an image-label pair that can be accepted by pytorch given an index.

In [None]:
# define Custom Dataset with pytorch
class PetfinderDataset(Dataset):

    def __init__(self, df, image_size=224, transform=None):
        self._X = df["Id"].values
        self._y = None
        if "Pawpularity" in df.keys():
            self._y = df["Pawpularity"].values
        if not transform:
            # we resize all the image to the same size
            self._transform = T.Compose([
                T.Resize([image_size, image_size]),
                T.ToTensor(), # transform the PIL image type to torch.tensor
            ])
        else:
            self._transform = transform

    def __len__(self):
        return len(self._X)

    def __getitem__(self, idx):
        image_path = self._X[idx]
        # given the index(path), read the raw image, and then transform it
        # image = read_image(image_path)  # this require the latest torchvision version
        image = Image.open(image_path)
        image = self._transform(image)
        # if we have label, then we return the image-label pair (for training)
        # if not, we directly return the image (for testing)
        if self._y is not None:
            label = self._y[idx]
            return image, label
        return image

In [None]:
df = pd.read_csv(os.path.join(config.root, 'train.csv'))
df['Id'] = df["Id"].apply(lambda x: os.path.join(config.root, "train", x + ".jpg")) # we transform the Id to its image path

train_val_set = PetfinderDataset(df)

print('# of data:', len(df))
print('range of label [{}, {}]'.format(df['Pawpularity'].min(), df['Pawpularity'].max()))

In [None]:
# we show some images here
plt.figure(figsize=(12, 12))
for idx  in range(16):
    image, label = train_val_set.__getitem__(idx)
    plt.subplot(4, 4, idx+1)
    plt.imshow(image.permute(1, 2, 0));
    plt.axis('off')
    plt.title('Pawpularity: {}'.format(label))

## Step 2. Define Swin-Transformer



In [None]:
class Model(nn.Module):
    def __init__(self, name):
        super(Model, self).__init__()
        self.backbone = timm.create_model(name, 
                                          pretrained=False, # it would be very easy to set it to true
                                                            # but in kaggle we could not use internet to download it
                                          num_classes=0, 
                                          in_chans=3)
        
        state_dict = torch.load(config.pretrain_path, map_location=config.device)['model']
        del state_dict['head.weight'] # in the model, we don't have these two parameters actually
        del state_dict['head.bias']
        
        self.backbone.load_state_dict(state_dict)
        num_features = self.backbone.num_features
        self.fc = nn.Sequential(nn.Dropout(0.5),
                                nn.Linear(num_features, 1)
                               )
        
    def forward(self, x):
        f = self.backbone(x)
        out = self.fc(f)
        return out

In [None]:
import torchsummary
# here we show the summary of the model
model = Model(config.model_name)
torchsummary.summary(model, (3,224,224), device='cpu')

## Step 3. Train Our Model with Cross Validation

In [None]:
def test(model, test_loader):
    model.eval() # turn model into evaluation mode
    test_loss = 0
    with torch.no_grad():
        for data, target in test_loader:
            data, target = data.to(config.device), target.float().to(config.device)/100.
            output = model(data)
            test_loss += F.mse_loss(output.sigmoid().view(-1), 
                                    target.view(-1), reduction='sum').item()  # sum up batch loss

    test_loss /= len(test_loader.dataset)
    return np.sqrt(test_loss) # RMSE 

In [None]:
def mixup(x, y, alpha=1):
    assert alpha > 0, "alpha should be larger than 0"
    assert x.size(0) > 1, "Mixup cannot be applied to a single instance."
    
    lam = np.random.beta(alpha, alpha)
#     for the shape of lam, run the following two lines
#     import seaborn as sns
#     sns.distplot(np.random.beta(0.5,0.5, 1000), bins=100)
    rand_index = torch.randperm(x.size()[0]) # random permutation of images in the batch x
    mixed_x = lam * x + (1-lam) * x[rand_index, :]
    target_a, target_b = y, y[rand_index]
    return mixed_x, target_a, target_b, lam
    

In [None]:
# we split the dataset into for cross-validation
# here we treat the label "Pawpularity" as categorical data, and use the StratifiedKfol Function
# actually it is numerical data
skf = StratifiedKFold(
    n_splits=config.n_splits, shuffle=True, random_state=config.seed
)

In [None]:
# we keep record of the training in each fold
train_losses_fold = []
val_losses_fold = []
best_model_fold = []
learning_rate_fold = []

for fold, (train_idx, val_idx) in enumerate(skf.split(df["Id"], df["Pawpularity"])):
    
    print('================================ CV fold {} ================================'.format(fold))
    
    train_df = df.loc[train_idx].reset_index(drop=True)
    val_df = df.loc[val_idx].reset_index(drop=True)
    
    # we would like to do some random transformation to our training data such that
    # our model can be more rubost against different patterns in out-of-sample data
    train_transform = T.Compose([
        T.Resize([224, 224]), # crop the image size to 3*224*224
        T.RandomHorizontalFlip(), # random flip the image horizontally
        T.RandomVerticalFlip(), # random flip the image vertically
        T.RandomAffine(15, translate=(0.1, 0.1), scale=(0.9, 1.1)), # Random affine transformation of the image keeping center invariant.
        T.ColorJitter(brightness=0.1, contrast=0.1, saturation=0.1), # randomly changes the brightness, saturation, and other properties of an image
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    # in validation set, we only convert our data to torch.float and do a normalization
    val_transform = T.Compose([
        T.Resize([224, 224]),
        T.ToTensor(),
        T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
    ])
    
    train_set = PetfinderDataset(train_df, transform=train_transform)
    val_set = PetfinderDataset(val_df, transform=val_transform)
    
    # then we define the dataloader for training and validation
    # it tells the machine how to sample from our training/validation set
    train_loader = DataLoader(train_set, batch_size=config.train_batchsize, shuffle=True, num_workers=4)
    val_loader = DataLoader(val_set, batch_size=config.val_batchsize, num_workers=4)
    
    model = Model(config.model_name).to(config.device) # use GPU to accelerate the training. Kaggle gives us 30h every week.
    optimizer = optim.AdamW(model.parameters(), lr=config.learning_rate)
    # we decay the learning rate by factor gamma=0.1 when we reach each milestone epoch
    scheduler = optim.lr_scheduler.CosineAnnealingWarmRestarts(optimizer, eta_min=config.eta_min, T_0=config.T_0)
    # https://pytorch.org/docs/stable/generated/torch.nn.BCEWithLogitsLoss.html
    criterion = nn.BCEWithLogitsLoss()
    
    train_losses = []
    val_losses = []
    learning_rates = []
    
    best_val_loss = np.inf
    best_model = None
    
    for epoch in range(config.epoch):
        print('\t=================== Epoch {} ==================='.format(epoch))
        
        model.train() # turn model into training mode
        batch_train_loss = []
        
        # iterate each bactch to update the model
        for batch_idx, (data, target) in tqdm.tqdm(enumerate(train_loader), total=len(train_loader)):
            data, target = data.to(config.device), target.float().to(config.device) / 100. # we transform the label to [0,1]
            optimizer.zero_grad() # very important. without this step, grad will accumulate
            
            if torch.rand(1)[0] < 0.5:
                mix_images, target_a, target_b, lam = mixup(data, target, alpha=0.5)
                output = model(mix_images)
                loss = lam * criterion(output.view(-1), target_a.view(-1)) + (1-lam) * criterion(output.view(-1), target_b)
            else:
                output = model(data)
                loss = criterion(output.view(-1), target.view(-1))
                
            loss.backward()
            optimizer.step() # update the model by the gradient
            
            batch_train_loss.append(loss.item())
        
        if epoch % config.logger_interval == 0:
            train_loss = np.sum(batch_train_loss)/len(train_loader) # BCEWithLogitsLoss loss
            val_loss = test(model, val_loader) * 100 # RMSE loss
            
            train_losses.append(train_loss)
            val_losses.append(val_loss)
            
            print('\t\t train loss: {:.4f}'.format(train_loss))
            print('\t\t val loss: {:.4f} -- best loss: {:.4f}'.format(val_loss, best_val_loss))
            
            # if we get a lower validation loss, then we record the model
            if val_loss < best_val_loss:
                best_val_loss = val_loss
                best_model = copy.deepcopy(model)

            learning_rates.append(optimizer.param_groups[0]['lr'])
                  
        scheduler.step()
    
    train_losses_fold.append(train_losses)
    val_losses_fold.append(val_losses)
    learning_rate_fold.append(learning_rates)
    best_model_fold.append(best_model)
             
    

### how to save our model in kaggle

1. Save the model by using model.save("model_name.h5") or other similar command. (Make sure to use .h5 extension. That would create a single file for your saved model.) Using this command will save your model in your notebook's memory.

2. Save your notebook by going to Advanced Settings and select Always save output. Hit Save and then select Quick Save if you want your notebook to get saved as it is or otherwise it will run all your notebook and then save it (which might take long depending on your model training phase etc.)

3. Go to notebook viewer (the saved notebook). Go to Output of notebook and create a private (or even public) dataset for that model.

4. Then load that dataset into your any notebook. You can load the model by using model = tf.keras.models.load_model("..input/dataset_name/model_name.h5"). You can even download the model file from dataset for offline purposes.

I did not try the method above. It is just for your reference. https://www.kaggle.com/questions-and-answers/92749

In [None]:
## save the models

for fold, model in enumerate(best_model_fold):
    torch.save(model.state_dict(), 'swin_transformer_fold_{}.h5'.format(fold))

In [None]:
# # my pretrained weighted: https://www.kaggle.com/zhicongliang/pawpularity-swintransformer
# ## load the models
# best_model_fold = []
# for fold in range(5):
#     model = Model(config.model_name).to(config.device)
#     model.load_state_dict(torch.load('../input/pawpularity-swintransformer/swin_transformer_fold_{}.h5'.format(fold), map_location=torch.device(config.device)))
#     best_model_fold.append(model)

## step 4. Visualize the training/validation curve

In [None]:
train_losses = np.array(train_losses_fold)
val_losses = np.array(val_losses_fold)

In [None]:
index = range(0, config.epoch, config.logger_interval)
fig = plt.figure(figsize=(16,6))
plt.subplot(121)
plt.plot(index, train_losses.mean(axis=0), label='Training Loss')
plt.subplot(122)
plt.plot(index, val_losses.mean(axis=0), label='Validation Loss')
plt.legend(fontsize=15)
plt.xlabel('Epoch', fontsize=15)

## Step 5. Visualize the learning rate

In [None]:
learning_rate = np.array(learning_rate_fold)
index = range(0, config.epoch, config.logger_interval)
fig = plt.figure(figsize=(8,6))
plt.plot(index, learning_rate[0,:], label='learning rate')

## Step 6. Grad-Cam

In [None]:
# gradcam reshape_transform for vit
def reshape_transform(tensor, height=7, width=7):
    result = tensor.reshape(tensor.size(0),
                            height, width, tensor.size(2))

    # like in CNNs.
    result = result.permute(0, 3, 1, 2)
    return result

In [None]:
## prepare the images we would like to visualize

from copy import deepcopy

norm_transform = T.Compose([
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])


org_images = [] # store the original image inside the range [0,1]
labels = []     # store the label
images = []     # store the normalized images as the input for the model
for idx in range(16):
    image, label = train_val_set.__getitem__(idx)
    images.append(norm_transform(deepcopy(image)).unsqueeze(0).float().to(config.device))
    org_images.append(image.unsqueeze(0))
    labels.append(label)
    
images = torch.cat(images)
org_images = torch.cat(org_images)
labels = torch.tensor(labels)

In [None]:
# load the first well-trained model and do the prediction
model = best_model_fold[0]
model = model.eval().to(config.device)
logits = model(images)
preds = logits.sigmoid().detach().cpu().squeeze(1).numpy() * 100
labels = labels.cpu().numpy()

In [None]:
## use gradcam for visualization
cam = GradCAMPlusPlus(
            model=model,
            target_layer=model.backbone.layers[-1].blocks[-1].norm1, 
            use_cuda=config.device,
            reshape_transform=reshape_transform)

grayscale_cams = cam(input_tensor=images, target_category=None, eigen_smooth=True)
org_images = org_images.numpy().transpose(0, 2, 3, 1)

In [None]:
plt.figure(figsize=(12, 12))
for it, (image, grayscale_cam, pred, label) in enumerate(zip(org_images, grayscale_cams, preds, labels)):
    plt.subplot(4, 4, it + 1)
    visualization = show_cam_on_image(image, grayscale_cam)
    plt.imshow(visualization)
    plt.title('pred: {:.1f} label: {}'.format(pred, label))
    plt.axis('off')

## step 6. Make submission

In [None]:
df_test = pd.read_csv(os.path.join(config.root, 'test.csv'))
test_id = df_test.index
df_test['Id'] = df_test["Id"].apply(lambda x: os.path.join(config.root, "test", x + ".jpg")) # we transform the Id to its image path

test_transform = T.Compose([
    T.Resize([224, 224]),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])

test_set = PetfinderDataset(df_test, transform=test_transform)
test_loader = DataLoader(test_set, batch_size=config.val_batchsize, num_workers=4)

# get the testing prediction
test_pred = np.zeros((df_test.shape[0],1))

for model in best_model_fold:
    for batch_idx, data in enumerate(test_loader):
        data = data.to(config.device)
        output = model(data)
        if batch_idx == 0:
            preds = output.detach().sigmoid().to('cpu').numpy()* 100
        else:
            preds = np.vstack((preds, output.sigmoid().detach().to('cpu').numpy()* 100))

    test_pred += preds

# take the average over folds
test_pred = test_pred / len(best_model_fold)

submission = pd.read_csv(os.path.join(config.root, 'test.csv'))[['Id']]
submission['Pawpularity'] = test_pred
submission

In [None]:
submission.to_csv('submission.csv', index=False)

### Conclusion: 
This notebook gives a score 18.27021, which ranks 544/1431 in the leaderboard. (Oct. 29, 2021) 