# CSE527 Homework 5 - 2
**Due date: 11:59 pm EST on Dec. 1, 2022 (Thu.)**

In this semester, we will use Google Colab for the assignments, which allows us to utilize resources that some of us might not have in their local machines such as GPUs. You will need to use your Stony Brook (*.stonybrook.edu) account for coding and Google Drive to save your results.

Reading for HW5
--------------------------------------------------------------------------------
This time, to understand the task and network we are using, you need to
understand some papers from Google research and our nice colleagues:

Understand counting: https://www3.cs.stonybrook.edu/~minhhoai/papers/fewshot_counting_CVPR21.pdf

understand attention: https://www.youtube.com/watch?v=ptuGllU5SQQ
Stanford CS224N NLP with Deep Learning | Winter 2021 | Lecture 9 - Self- Attention and Transformers

understand ViT: https://github.com/google-research/vision_transformer
[An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale](https://arxiv.org/abs/2010.11929)

Few-shot counting using Vision Transformer
--------------------------------------------------------------------------------

In this programing assignment, we will train a Network to
count objects in a image.
Find the dataset here:
https://drive.google.com/file/d/1QVr5_wRTiRiCDKRVaN_bdb3Rm_IJaD0I/view?usp=sharing
Find the pretrained model parameter here:
https://drive.google.com/file/d/1PujXdKtUAVdrCaEeeTorFO5G7trWvs8S/view?usp=sharing


Put the compressed file inside the data folder, uncompress it.
You will see following structure
- gt_density_map_adaptive_384_VarV2
- json_annotationCombined_384_VarV2.json
- Train_Test_Val_FSC_147.json
- images_384_VarV2

The two folders contain 6,146 images and labels and two json files for data precessing, you will not need to modify the data preprocessing part

This is an experimental model. Your tasks are:
1. Most parts of the model are completed except the \\
  **Fully Connected Layer (MlpBlock - part I)** (20 points), \\
  the **Attention (EncoderBlock - part II )** (30 points),  \\
  and the **Transformer (Encoder - part III)** (30 points). \\
  You need to implement these modules exactly as described in order to load the pre-trained matrix. 
2. Because the whole ViT Encoder is heavy to finetune with Colab, you will need to modify the code so that **the last 3 encoder layers of the ViT model are not included in the Transformer**. (It is fine if the performance is bad since you will be deleting 3 layers).

Note 0: refer to doc/More_Details.pdf for details   \\
Note 1: You need to copy your code into lib/VIT.py, it has identical structure with this colab notebook \\
Note 2: The model has a long training time, let it run overnight and see what's the best loss value you can get. \\
HINT:
```shell
    (11): EncoderBlock(
      (norm1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (attn): SelfAttention(
        (query): LinearGeneral()
        (key): LinearGeneral()
        (value): LinearGeneral()
        (out): LinearGeneral()
      )
      (dropout): Dropout(p=0.1, inplace=False)
      (norm2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
      (mlp): MlpBlock(
        (fc1): Linear(in_features=768, out_features=3072, bias=True)
        (fc2): Linear(in_features=3072, out_features=768, bias=True)
        (act): GELU()
        (dropout1): Dropout(p=0.1, inplace=False)
        (dropout2): Dropout(p=0.1, inplace=False)
      )
    )
```

![img](./doc/model.JPG)

In [None]:

import os
from google.colab import drive
drive.mount('/content/gdrive')

Mounted at /content/gdrive


# Load Data(Unzipping everytime can increase the speed)

In [None]:
os.chdir('/content/gdrive/My Drive/Colab Notebooks/CV_2022Fall_HW5/data/')
#!ls
#!unzip CV_HW5_data.zip

os.chdir('/content/gdrive/My Drive/Colab Notebooks/CV_2022Fall_HW5/')
!/opt/bin/nvidia-smi
!ls

Sat Nov 26 23:37:58 2022       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|   0  Tesla T4            Off  | 00000000:00:04.0 Off |                    0 |
| N/A   61C    P8    11W /  70W |      0MiB / 15109MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Proces

In [None]:

"""
# Counting Transformer Model
"""

import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np

# Position Embedding in the VIT
class PositionEmbs(nn.Module):
    def __init__(self, Max_num_patches, emb_dim, dropout_rate=0.1):
        super(PositionEmbs, self).__init__()
        self.pos_embedding = nn.Parameter(torch.randn(1, Max_num_patches, emb_dim))
        if dropout_rate > 0:
            self.dropout = nn.Dropout(dropout_rate)
        else:
            self.dropout = None

    def forward(self, x):
        patches = x.shape[0]
        pos_embedding = self.pos_embedding.squeeze()
        out = x + pos_embedding[0:patches]

        if self.dropout:
            out = self.dropout(out)
        out = out.unsqueeze(0)
        return out


class MlpBlock(nn.Module):
    """ Transformer Feed-Forward Block """
    def __init__(self, in_dim, mlp_dim, out_dim, dropout_rate=0.1):
        super(MlpBlock, self).__init__()
        """
            PART I: 20 points
            FC layer(with dropout) + GELU act layer + FC layer(with dropout)
        """
        """ STUDENT CODE START """
        self.fc1 = nn.Linear(in_dim, mlp_dim)
        self.fc2 = nn.Linear(mlp_dim, out_dim)
        self.activation = nn.GELU()
        self.dropout1 = nn.Dropout(dropout_rate)
        self.dropout2 = nn.Dropout(dropout_rate)
        """ STUDENT CODE END """

    def forward(self, x):

        """ STUDENT CODE START """
        x = self.fc1(x)
        x = self.fc2(x)
        x = self.activation(x)
        #x = self.dropout(x)
        x = self.dropout1(x)
        out = self.dropout2(x)
        """ STUDENT CODE END """
        return out

# For implement the self-attention
class LinearGeneral(nn.Module):
    def __init__(self, in_dim=(768,), feat_dim=(12, 64)):
        super(LinearGeneral, self).__init__()

        self.weight = nn.Parameter(torch.randn(*in_dim, *feat_dim))
        self.bias = nn.Parameter(torch.zeros(*feat_dim))

    def forward(self, x, dims):
        a = torch.tensordot(x, self.weight, dims=dims) + self.bias
        return a

# Multi-head attention
class SelfAttention(nn.Module):
    def __init__(self, in_dim, heads=8, dropout_rate=0.1):
        super(SelfAttention, self).__init__()

        """
            PART II: 40 points
            multihead SelfAttention part with Key, Query, Value
            use LinearGeneral class
        """
        """ STUDENT CODE BEGIN """
        self.head = heads
        self.head_dimension = in_dim // heads
        self.scale = np.sqrt(self.head_dimension)

        self.q = LinearGeneral((in_dim,), (self.head, self.head_dimension))
        self.k = LinearGeneral((in_dim,), (self.head, self.head_dimension))
        self.v = LinearGeneral((in_dim,), (self.head, self.head_dimension))
        self.out = LinearGeneral((self.head, self.head_dimension), (in_dim, ))
        self.droput = nn.Dropout(dropout_rate)
        self.cluster = nn.Sequential(nn.Linear(297216, 256 * 100),
                                         nn.LeakyReLU(0.2),
                                         nn.Linear(256 * 100, 8 * 100))

        """ STUDENT CODE END"""

    def forward(self, x):

        """ STUDENT CODE BEGIN """
        b, n, _ = x.shape
        q = self.q(x, dims=([2], [0]))
        print(q.shape)
        q = self.cluster(q.view(b, -1)).view(b, 8, 1, 100)
        q = q.permute(0, 2, 1, 3)
        k = self.k(x, dims=([2], [0]))
        k = k.permute(0, 2, 1, 3)
        v = self.v(x, dims=([2], [0]))
        v = v.permute(0, 2, 1, 3)

        wt = torch.matmul(q, k.transpose(-2, -1)) / self.scale
        wt = F.softmax(wt, dim=-1)

        out = torch.matmul(wt, v)
        out = out.permute(0, 2, 1, 3)
        out = self.out(out, dims=([2, 3], [0, 1]))

        """ STUDENT CODE END """
        return out

# Encoder LayerNorm -> Self-attention -> Layernorm -> MLP
class EncoderBlock(nn.Module):
    def __init__(self, in_dim, mlp_dim, num_heads, dropout_rate=0.1, attn_dropout_rate=0.1):
        super(EncoderBlock, self).__init__()

        self.norm1 = nn.LayerNorm(in_dim)
        self.attn = SelfAttention(in_dim, heads=num_heads, dropout_rate=attn_dropout_rate)
        if dropout_rate > 0:
            self.dropout = nn.Dropout(dropout_rate)
        else:
            self.dropout = None
        self.norm2 = nn.LayerNorm(in_dim)
        self.mlp = MlpBlock(in_dim, mlp_dim, in_dim, dropout_rate)

    def forward(self, x):
        residual = x
        out = self.norm1(x)
        out = self.attn(out)
        if self.dropout:
            out = self.dropout(out)
        out += residual
        residual = out

        out = self.norm2(out)
        out = self.mlp(out)
        out += residual
        return out

# 12 Transformer layers
class Encoder(nn.Module):
    def __init__(self, emb_dim, mlp_dim, num_layers=12, num_heads=12, dropout_rate=0.1, attn_dropout_rate=0.0, Max_num_patches=512):
        super(Encoder, self).__init__()

        # positional embedding
        self.pos_embedding = PositionEmbs(Max_num_patches, emb_dim, dropout_rate)

        # encoder blocks
        in_dim = emb_dim
        self.encoder_layers = nn.ModuleList()
        """
            PART III: 20 points
            Apply 12 layer of transformer and one LyaerNorm
        """
        """ STUDENT CODE START """
        for i in range(num_layers):
            l = EncoderBlock(in_dim, mlp_dim, num_heads, dropout_rate, attn_dropout_rate)
            self.encoder_layers.append(l)
        self.norm = nn.LayerNorm(in_dim)
        """ STUDENT CODE END """


    def forward(self, x):

        out = self.pos_embedding(x)

        """ STUDENT CODE START """
        for l in self.encoder_layers:
            out = l(out)
        out = self.norm(out)
        """ STUDENT CODE END """
        return out

# Upsampling and conv
class Decoder(nn.Module):
    def __init__(self):
        super(Decoder, self).__init__()

        self.decoder = nn.Sequential(
            nn.Conv2d(768, 96, 7, padding=3),
            nn.ReLU(),

            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(96, 64, 5, padding=2),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(64, 32, 3, padding=1),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
            nn.Conv2d(32, 1, 1),
            nn.ReLU(),
            nn.UpsamplingBilinear2d(scale_factor=2),
        )
    # 4 * 4 * 768 -> 8 * 8 * 96 ->
    def forward(self, input, H, W, B):
        input = input.reshape(B, -1, H, W)
        out = self.decoder(input)
        return out

class CountingTransformer(nn.Module):
    def __init__(self,
                 patch_size=(16, 16),
                 emb_dim=768,
                 mlp_dim=3072,
                 num_heads=12,
                 num_layers=12,
                 attn_dropout_rate=0.0,
                 dropout_rate=0.1,
                 feat_dim=None,
                 Max_num_patches=2000):
        super(CountingTransformer, self).__init__()
        # embedding layer
        self.patch_size = patch_size
        fh, fw = self.patch_size

        # Image_patch -> token
        self.embedding = nn.Conv2d(3, emb_dim, kernel_size=(fh, fw), stride=(fh, fw))

        # image_patch -> token

        # Encoder
        self.transformer = Encoder(
            emb_dim=emb_dim,
            mlp_dim=mlp_dim,
            num_layers=num_layers,
            num_heads=num_heads,
            dropout_rate=dropout_rate,
            attn_dropout_rate=attn_dropout_rate,
            Max_num_patches=Max_num_patches)

        # Decoder
        self.decoder = Decoder()

    # Image and positive patches
    def forward(self, x, positive_token):

        # x image
        # pos exmplar

        # Image shape and patch size
        imh = x.shape[2]
        imw = x.shape[3]
        fh, fw = self.patch_size

        # Patch num
        patch_num = int((imh / fh) * (imw / fw))

        # H, W is the size to reshape at the decoder
        H = int(x.shape[2] / fh)
        W = int(x.shape[3] / fw)
        B = x.shape[0]

        # (H * W) * token_dim -> H * W * C(token_dim)
        # 32 * 600 -> 8 * 4 * 600

        # Embed a image patch into a token
        emb = self.embedding(x)     # (n, c, gh, gw)
        # Embed positive exemplar
        pos_emb = self.embedding(positive_token)

        # Change the shape
        emb = emb.permute(0, 2, 3, 1)  # (n, gh, hw, c)
        b, h, w, c = emb.shape
        emb = emb.reshape(b, h * w, c)
        emb = emb.squeeze()
        pos_emb = pos_emb.squeeze()
        emb = torch.cat((emb, pos_emb))

        # Encoder with image patches and pos exemplar
        feat = self.transformer(emb)

        # Drop the exemplar embedding
        feat_throw_pos = feat[0,0:patch_num,:]

        # Density map
        density_map = self.decoder(feat_throw_pos, H, W, B)

        return density_map


#Check the pipeline
model = CountingTransformer(num_layers=12)
x = torch.randn((1, 3, 384, 256))
token = torch.randn((3, 3, 16, 16))
out = model(x, token)
print(out.shape)


# Training
## Dataloader and Model

In [None]:
%cd /content/gdrive/MyDrive/Basak_Hritam_114783055_HW5_2
import cv2
import json
import copy
import os
import torch
import datetime
import numpy as np
import torch.optim as optim
import torch.nn.functional as F

from lib.dataset import FscBgDataset

os.chdir('/content/gdrive/My Drive/Colab Notebooks/CV_2022Fall_HW5/')
DATA_DIR = '/content/gdrive/My Drive/Colab Notebooks/CV_2022Fall_HW5/data/'
BATCH_SIZE = 16
PATCH_SIZE = (16,16)
Max_num_patches = 2500
LR = 5e-6
if torch.cuda.is_available():
    device = "cuda:0"
else:
    device = 'cpu'

/content/gdrive/MyDrive/Basak_Hritam_114783055_HW5_2


In [None]:

from torch.utils.data import DataLoader
train_dataset = FscBgDataset(DATA_DIR, 'train')
val_dataset = FscBgDataset(DATA_DIR, 'val')
test_dataset = FscBgDataset(DATA_DIR, 'test')
im_idx = np.arange(len(train_dataset))
train_loader = DataLoader(im_idx, shuffle=True, batch_size=BATCH_SIZE)
im_idx = np.arange(len(val_dataset))
val_loader = DataLoader(im_idx, shuffle=True, batch_size=BATCH_SIZE)
im_idx = np.arange(len(test_dataset))
test_loader = DataLoader(im_idx, shuffle=True, batch_size=BATCH_SIZE)

In [None]:

from lib.VIT import VisionTransformer
pretrain_path = '/content/gdrive/MyDrive/Basak_Hritam_114783055_HW5_2/model_save/imagenet21k+imagenet2012_ViT-B_16.pth'
pretrain_model = VisionTransformer(
          image_size=(384, 384),
          patch_size=(16, 16),
          emb_dim=768,
          mlp_dim=3072,
          num_heads=12,
          num_layers=12,
          num_classes=1000,
          attn_dropout_rate=0,
          dropout_rate=0.1)

#Load pre-trained VIT backbone
sdict = torch.load(pretrain_path)
pretrain_model.load_state_dict(sdict['state_dict'])


#Counting Transformer
model = CountingTransformer(num_layers=12, Max_num_patches=Max_num_patches)
model.transformer.encoder_layers.load_state_dict(pretrain_model.transformer.encoder_layers.state_dict())
model.to(device)
criterion = nn.MSELoss().to(device)
optimizer = optim.Adam(model.parameters(), lr=LR)

RuntimeError: ignored

## Train and test

In [None]:
from lib.utils import TransformTrain

def train():
    SSE = 0
    SAE = 0
    train_mae = 0
    train_rmse = 0
    train_loss = 0
    cnt = 0
    starttime = datetime.datetime.now()
    for idx_batch in train_loader:
      optimizer.zero_grad()
      if debug:
          if cnt > 10:
            break
      for idx in idx_batch:
        cnt += 1
        train_sample = train_dataset[idx]
        im_id, image, boxes, dots, bg_mask_img, density = train_sample['im_id'], train_sample['image'], train_sample['boxes'], train_sample['dots'], train_sample['bg_mask_img'], train_sample['gt_density']
        sample = {'image': image, 'lines_boxes': boxes, 'gt_density': density}
        sample = TransformTrain(sample)
        image, boxes, GT_density = sample['image'].to(device), sample['boxes'].to(device), sample['gt_density'].to(device)
        boxes = boxes.squeeze()

        # Get Positive Token Normal Scale
        Positive_Token = None
        for box_idx in range(boxes.shape[0]):
          box = boxes[box_idx]
          temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
          temp_Positive_Token = temp_Positive_Token.unsqueeze(0)

          # reshape interpolate
          temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
          if Positive_Token is None:
            Positive_Token = temp_Positive_Token
          else:
            Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        # Get scaling pos token
        scaling_para = [0.8, 0.9, 1/0.9, 1/0.8]
        for scaling in scaling_para:
          scaled_boxes = boxes / scaling
          scaled_boxes = scaled_boxes.squeeze()
          scaled_boxes[:, 1:3] = torch.floor(scaled_boxes[:, 1:3])
          scaled_boxes[:, 3:5] = torch.ceil(scaled_boxes[:, 3:5])
          scaled_boxes[:, 3:5] = scaled_boxes[:, 3:5] + 1
          scaled_boxes[:, 3] = torch.clamp_max(scaled_boxes[:, 3], image.shape[1] - 1)
          scaled_boxes[:, 4] = torch.clamp_max(scaled_boxes[:, 4], image.shape[2] - 1)
          scaled_boxes[:, 1:3] = torch.clamp_min(scaled_boxes[:, 1:3], 0)
          scaled_boxes = scaled_boxes.squeeze()

          for box_idx in range(scaled_boxes.shape[0]):
            box = scaled_boxes[box_idx]
            temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
            temp_Positive_Token = temp_Positive_Token.unsqueeze(0)
            if temp_Positive_Token.shape[2]!=0 and temp_Positive_Token.shape[3]!=0:
              temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
              Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        if torch.cuda.is_available():
            image = image.cuda()
            GT_density = GT_density.cuda()
            Positive_Token = Positive_Token.cuda()

        # Feed to the Network
        # Reshape the image to get int patch num
        image = image.unsqueeze(0)
        if image.shape[-1] % PATCH_SIZE[0] != 0 or image.shape[-2] % PATCH_SIZE[0] != 0:
          new_h = (image.shape[-2] // PATCH_SIZE[0]) * 16
          new_w = (image.shape[-1] // PATCH_SIZE[1]) * 16
          image = F.interpolate(image, size=(new_h, new_w), mode='bilinear')

        gt_cnt = dots.shape[0]
        out_density = model(image, Positive_Token)

        #
        GT_density = F.interpolate(GT_density, size=(out_density.shape[2], out_density.shape[3]), mode='bilinear')
        GT_density = GT_density*(gt_cnt/GT_density.sum())

        #Loss 1
        loss = criterion(out_density, GT_density) + 1e-3 * (out_density.sum() - gt_cnt) ** 2

        #Loss 2
        #loss = criterion(out_density, GT_density)

        train_loss += loss.item()

        #Error
        rec_output = np.maximum(out_density.detach().cpu(), 0)
        pred_cnt = rec_output.sum().item()
        err = abs(gt_cnt - pred_cnt)
        train_mae += err
        train_rmse += (err**2)
        SAE += err
        SSE += err**2
        loss.backward()
        endtime = datetime.datetime.now()
      optimizer.step()
    print('TRAIN MAE: {:6.2f}, TRAIN RMSE: {:6.2f}, Train Loss: {}, Running Time:{}'.format(SAE/cnt, (SSE/cnt)**0.5, train_loss/cnt, endtime - starttime))
    MAE = SAE/cnt
    RMSE = (SSE/cnt)**0.5
    TRAIN_LOSS = train_loss/cnt
    return MAE, RMSE, TRAIN_LOSS

def val():
    SSE = 0
    SAE = 0
    train_mae = 0
    train_rmse = 0
    train_loss = 0
    cnt = 0
    starttime = datetime.datetime.now()
    for idx_batch in val_loader:
      if debug:
          if cnt > 10:
            break
      for idx in idx_batch:
        cnt += 1
        val_sample = val_dataset[idx]
        im_id, image, boxes, dots, bg_mask_img, density = val_sample['im_id'], val_sample['image'], val_sample['boxes'], val_sample['dots'], val_sample['bg_mask_img'], val_sample['gt_density']
        sample = {'image': image, 'lines_boxes': boxes, 'gt_density': density}
        sample = TransformTrain(sample)
        image, boxes, GT_density = sample['image'].to(device), sample['boxes'].to(device), sample['gt_density'].to(device)
        boxes = boxes.squeeze()

        #Get Positive Token Normal Scale
        Positive_Token = None
        for box_idx in range(boxes.shape[0]):
          box = boxes[box_idx]
          temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
          temp_Positive_Token = temp_Positive_Token.unsqueeze(0)
          temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
          if Positive_Token is None:
            Positive_Token = temp_Positive_Token
          else:
            Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        #Get scaling pos token
        scaling_para = [0.8, 0.9, 1/0.9, 1/0.8]
        for scaling in scaling_para:
          scaled_boxes = boxes / scaling
          scaled_boxes = scaled_boxes.squeeze()
          scaled_boxes[:, 1:3] = torch.floor(scaled_boxes[:, 1:3])
          scaled_boxes[:, 3:5] = torch.ceil(scaled_boxes[:, 3:5])
          scaled_boxes[:, 3:5] = scaled_boxes[:, 3:5] + 1
          scaled_boxes[:, 3] = torch.clamp_max(scaled_boxes[:, 3], image.shape[1] - 1)
          scaled_boxes[:, 4] = torch.clamp_max(scaled_boxes[:, 4], image.shape[2] - 1)
          scaled_boxes[:, 1:3] = torch.clamp_min(scaled_boxes[:, 1:3], 0)
          scaled_boxes = scaled_boxes.squeeze()

          for box_idx in range(scaled_boxes.shape[0]):
            box = scaled_boxes[box_idx]
            temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
            temp_Positive_Token = temp_Positive_Token.unsqueeze(0)
            if temp_Positive_Token.shape[2]!=0 and temp_Positive_Token.shape[3]!=0:
              temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
              Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        if torch.cuda.is_available():
            image = image.cuda()
            GT_density = GT_density.cuda()
            Positive_Token = Positive_Token.cuda()

        #Feed to the Network
        #Reshape the image to get int patch num
        image = image.unsqueeze(0)
        if image.shape[-1] % PATCH_SIZE[0] != 0 or image.shape[-2] % PATCH_SIZE[0] != 0:
          new_h = (image.shape[-2] // PATCH_SIZE[0]) * 16
          new_w = (image.shape[-1] // PATCH_SIZE[1]) * 16
          image = F.interpolate(image, size=(new_h, new_w), mode='bilinear')
        gt_cnt = dots.shape[0]
        out_density = model(image, Positive_Token)
        GT_density = F.interpolate(GT_density, size=(out_density.shape[2], out_density.shape[3]), mode='bilinear')
        GT_density = GT_density*(gt_cnt/GT_density.sum())

        #Error
        rec_output = np.maximum(out_density.detach().cpu(), 0)
        pred_cnt = rec_output.sum().item()
        err = abs(gt_cnt - pred_cnt)
        train_mae += err
        train_rmse += (err**2)
        SAE += err
        SSE += err**2
        endtime = datetime.datetime.now()
    print('VAL MAE: {:6.2f}, VAL RMSE: {:6.2f}, Running Time: {}'.format(SAE/cnt, (SSE/cnt)**0.5, endtime - starttime))
    MAE = SAE/cnt
    RMSE = (SSE/cnt)**0.5
    return MAE, RMSE

def test():
    SSE = 0
    SAE = 0
    train_mae = 0
    train_rmse = 0
    train_loss = 0
    cnt = 0
    starttime = datetime.datetime.now()
    for idx_batch in test_loader:
      if debug:
          if cnt > 10:
            break
      for idx in idx_batch:
        cnt += 1
        test_sample = test_dataset[idx]
        im_id, image, boxes, dots, bg_mask_img, density = test_sample['im_id'], test_sample['image'], test_sample['boxes'], test_sample['dots'], test_sample['bg_mask_img'], test_sample['gt_density']
        sample = {'image': image, 'lines_boxes': boxes, 'gt_density': density}
        sample = TransformTrain(sample)
        image, boxes, GT_density = sample['image'].to(device), sample['boxes'].to(device), sample['gt_density'].to(device)
        boxes = boxes.squeeze()

        #Get Positive Token Normal Scale
        Positive_Token = None
        for box_idx in range(boxes.shape[0]):
          box = boxes[box_idx]
          temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
          temp_Positive_Token = temp_Positive_Token.unsqueeze(0)
          temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
          if Positive_Token is None:
            Positive_Token = temp_Positive_Token
          else:
            Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        #Get scaling pos token
        scaling_para = [0.8, 0.9, 1/0.9, 1/0.8]
        for scaling in scaling_para:
          scaled_boxes = boxes / scaling
          scaled_boxes = scaled_boxes.squeeze()
          scaled_boxes[:, 1:3] = torch.floor(scaled_boxes[:, 1:3])
          scaled_boxes[:, 3:5] = torch.ceil(scaled_boxes[:, 3:5])
          scaled_boxes[:, 3:5] = scaled_boxes[:, 3:5] + 1
          scaled_boxes[:, 3] = torch.clamp_max(scaled_boxes[:, 3], image.shape[1] - 1)
          scaled_boxes[:, 4] = torch.clamp_max(scaled_boxes[:, 4], image.shape[2] - 1)
          scaled_boxes[:, 1:3] = torch.clamp_min(scaled_boxes[:, 1:3], 0)
          scaled_boxes = scaled_boxes.squeeze()

          for box_idx in range(scaled_boxes.shape[0]):
            box = scaled_boxes[box_idx]
            temp_Positive_Token = image[:,int(box[1]):int(box[3]),int(box[2]):int(box[4])]
            temp_Positive_Token = temp_Positive_Token.unsqueeze(0)
            if temp_Positive_Token.shape[2]!=0 and temp_Positive_Token.shape[3]!=0:
              temp_Positive_Token = F.interpolate(temp_Positive_Token, size=(PATCH_SIZE[0], PATCH_SIZE[1]), mode='bilinear')
              Positive_Token = torch.cat((Positive_Token, temp_Positive_Token))

        if torch.cuda.is_available():
            image = image.cuda()
            GT_density = GT_density.cuda()
            Positive_Token = Positive_Token.cuda()

        #Feed to the Network
        #Reshape the image to get int patch num
        image = image.unsqueeze(0)
        if image.shape[-1] % PATCH_SIZE[0] != 0 or image.shape[-2] % PATCH_SIZE[0] != 0:
          new_h = (image.shape[-2] // PATCH_SIZE[0]) * 16
          new_w = (image.shape[-1] // PATCH_SIZE[1]) * 16
          image = F.interpolate(image, size=(new_h, new_w), mode='bilinear')
        gt_cnt = dots.shape[0]
        out_density = model(image, Positive_Token)
        GT_density = F.interpolate(GT_density, size=(out_density.shape[2], out_density.shape[3]), mode='bilinear')
        GT_density = GT_density*(gt_cnt/GT_density.sum())

        #Error
        rec_output = np.maximum(out_density.detach().cpu(), 0)
        pred_cnt = rec_output.sum().item()
        err = abs(gt_cnt - pred_cnt)
        train_mae += err
        train_rmse += (err**2)
        SAE += err
        SSE += err**2
        endtime = datetime.datetime.now()
    print('TEST MAE: {:6.2f}, TEST RMSE: {:6.2f}, Running Time: {}'.format(SAE/cnt, (SSE/cnt)**0.5, endtime - starttime))
    MAE = SAE/cnt
    RMSE = (SSE/cnt)**0.5
    return MAE, RMSE


EPOCH = 150
debug = True
stats = list()
Log_Save_dir = './'
best_mae, best_rmse = 1e7, 1e7
# Train
for epoch in range(EPOCH):
  if debug:
    if epoch > 5:
      break
  model.train()
  train_mae, train_rmse, train_loss = train()
  model.eval()
  val_mae, val_rmse = val()
  stats.append((train_loss, train_mae, train_rmse, val_mae, val_rmse))
  stats_file = os.path.join(Log_Save_dir, "Log.txt")
  with open(stats_file, 'w') as f:
      for s in stats:
          f.write("%s\n" % ','.join([str(x) for x in s]))
  if best_mae >= val_mae:
    best_mae = val_mae
    best_rmse = val_rmse
    model_save_path = os.path.join(Log_Save_dir, "Best_Model.pth")
    torch.save(model.state_dict(), model_save_path)
  if (epoch + 1) % 5 == 0:
    print("\033[1;31;47mEpoch {}, Avg. Epoch Loss: {} Train MAE: {} Train RMSE: {} Val MAE: {} Val RMSE: {} Best Val MAE: {} Best Val RMSE: {} \033[0m".format(
              epoch + 1, stats[-1][0], stats[-1][1], stats[-1][2], stats[-1][3], stats[-1][4], best_mae, best_rmse))

# Test
model.load_state_dict(torch.load(model_save_path))
model.eval()
test_mae, test_rmse = test()
print("\033[1;33;44mTest MAE {}, Test RMSE {} \033[0m".format(test_mae, test_rmse))

NameError: ignored

PART IV: (20 points)

What are your results like? Why do you think the model doesn't perform better? How would you improve this model's accuracy?

Answer these questions using what you have learned from this course and
material provided above.

BONUS: (30 points)

Modify this network according to your explanation, run it again.
How does the loss value change compared with the original model?

When you do this part, save the code and the model in another file named
CV_2022FALL_hw5.bonus.ipynb
and attach a graph of your model, explaining the input and output parts clearly.

## Submission guidelines
---
We will grade your homework based on your submitted notebook file. We will check the notebook for both results and code. Please make sure you run your code and print out the results in the notebook before submitting (we expect to see the results before running your code by ourselves.)

You submit your homework by first creating a ***google shared link*** of a folder for your homework (described below), and put that link into the ***text submission section*** of your homework submission on Blackboard. ([How to submit your link?](https://drive.google.com/file/d/16-FlPSiu5n-pRezLfcbAvgYxXtGtrs16))

To generate the ***google shared link***, first create a folder named ***Surname_Givenname_SBUID_hw**** in your Google Drive with your CS account (or your SBU account if you don't have a CS account). The structure of the files in the folder should be exactly the same as the one you downloaded. For instance in this homework:

```
Surname_Givenname_SBUID_hw5
        |---CSE527-22F-HW5.ipynb
```
Note that this folder should be in your Google Drive with your account.

Then right click this folder, click ***Get shareable link***, in the People textfield, enter the TAs' email: ***haoyuwu@cs.stonybrook.edu*** and ***vhnguyen@cs.stonybrook.edu***. Make sure that the TAs who have the link **can edit**, ***not just*** **can view**, and also **uncheck** the **Notify people** box. ([How to share link?](https://drive.google.com/file/d/17R6j6yE8_8vXioOB3nNvbEPzxcI-rr_H) )

***IMPORTANT: Please do not make any modification to the folder and its files after the submission deadline***. (All modifications can be seen by the TAs via the revision history.) Note that in google colab, we will only grade the version of the code right before the timestamp of the submission made in blackboard.

The input and output paths are predefined and **DO NOT** change them, (we assume that 'Surname_Givenname_SBUID_hw1' is your working directory, and all the paths are relative to this directory).  The image read and write functions are already written for you. All you need to do is to fill in the blanks as indicated to generate proper outputs.


-- DO NOT change the folder structure, please just fill in the blanks. <br>

You are encouraged to post and answer questions on Piazza. Based on the amount of email that we have received in past years, there might be delays in replying to personal emails. Please ask questions on Piazza and send emails only for personal issues.

If you alter the folder structures, the grading of your homework will be significantly delayed and possibly penalized.

Be aware that your code will undergo plagiarism check both vertically and horizontally. Please do your own work.

Late submission penalty: <br>
There will be a 10% penalty per day for late submission. However, you will have 4 days throughout the whole semester to submit late without penalty. Note that the grace period is calculated by days instead of hours. If you submit the homework one minute after the deadline, one late day will be counted. Likewise, if you submit one minute after the deadline, the 10% penaly will be imposed if not using the grace period.
