In [1]:
# Again the goal of this nb is not to go into tuning/improving accuracy but to understand detr for object detection as concepts. Nothing here is original work but referenced to understand concepts.

In [2]:
# Reference: https://colab.research.google.com/drive/1W8-2FOdawjZl3bGIitLKgutFUyBMA84q#scrollTo=92cR0XG1YDrJ&uniqifier=2

Note: The DEtection TRansformer (DETR) is an object detection model developed by the Facebook Research team which cleverly utilizes the Transformer architecture. 

The DETR model consists of a pretrained CNN backbone (like ResNet), which produces a set of lower dimensional set of features. These features are then formatted into a single set of features of and added to a positional encoding, which is fed into a Transformer consisting of an Encoder and a Decoder in a manner quite similar to the Encoder-Decoder transformer described in the original Transformer paper.

The output of the decoder is then fed into a fixed number of Prediction Heads which consist of a predefined number of feed forward networks. Each output of one of these prediction heads consists of a class prediction, as well as a predicted bounding box. The loss is calculated by computing the bipartite matching loss. The model makes a predefined number of predictions, and each of the predictions are computed in parallel.

<p align="center">
  <img src="https://miro.medium.com/max/967/1*ROEemTct0f47Y2kDlAAF4Q.png" alt>
  <em><p align="center">DETR Architecture</p></em>
</p>

**CNN Backbone:**

Assume that our input image xᵢₘ of height H₀, width W₀, and three input channels. CNN backbone consists of a (pretrained) CNN (usually ResNet), which we use to generate C lower dimensional features having width W and height H (In practice, we set C=2048, W=W₀/32 and H=H₀/32).
This leaves us with C two-dimensional features, and since we will be passing these features into a transformer, each feature must be reformatted in a way that will allow the encoder to process each feature as a sequence. This is done by flattening the feature matrices into an H⋅W vector, and then concatenating each one. The flattened convolutional features are added to a spatial positional encoding which can either be learned, or pre-defined.

**Transformer Architecture:**

The transformer is nearly identical to the original encoder-decoder architecture. The difference is that each decoder layers decodes each of the N (the predefined number of) objects in parallel. The model also learns a set of N object queries which are (similar to the encoder) learned positional encodings.

<p align="center">
  <img src="https://miro.medium.com/max/1400/0*cLjhFcQXKyq4akSO.png" alt>
  <em><p align="center">DETR Architecture</p></em>
</p>


“We observe that each slot learns to specialize on certain areas and box sizes with several operating modes.” — The DETR Authors

An intuitive way of understanding the object queries is by imagining that each object query is a person. And each person can ask the, via attention, about a certain region of the image. So one object query will always ask about what is in the center of an image, and another will always ask about what is on the bottom left, and so on.

**The Encoder** consists of  𝑁  Encoder Layers. Each encoder layer consits of a Multi-Head Self-Attention Layer, an Add & Norm Layer, a Feed Forward Neural Network, and another Add & Norm layer. This is nearly identical to the original Transformer Encoder from [2] except we are only adding our spatial positional encoding to the Key and Queue matrices. Also note that we add the spatial encoding tho the Query matrix of the decoder after the decoder's first MHSA and Normalization layer.

**The decoder** is more complicated than the Encoder. The object queries consist of a set of  𝑁  vectors which are added to the key and query matrices of the decoder. The output of the encoder and the spatial positional encoding is added to the key matrix (before the Multi-Head Attention layer).

The prediction heads consists of two Feed-Forward networks which compute class predictions and bounding boxes. Note that the number of predictions is equal to the number of object queries. If there are less predictions than the number of object queries, then the outputted class will be  ∅

In [4]:
import torch
import torch.nn as nn
from torchvision.models import resnet50


class SimpleDETR(nn.Module):
  """
  Minimal Example of the Detection Transformer model with learned positional embedding
  """

  def __init__(self, num_classes, hidden_dim, num_heads,
               num_enc_layers, num_dec_layers):
    
    super(SimpleDETR,self).__init__()
    self.num_classes = num_classes
    print(f"self.num_classes:{self.num_classes}")
    self.hidden_dim = hidden_dim
    print(f"self.hidden_dim:{self.hidden_dim}")
    self.num_heads = num_heads
    print(f"self.num_heads:{self.num_heads}")
    self.num_enc_layers = num_enc_layers
    print(f"self.num_enc_layers:{self.num_enc_layers}")
    self.num_dec_layers = num_dec_layers 
    print(f"self.num_dec_layers:{self.num_dec_layers}")

    self.backbone = nn.Sequential(
        *list(resnet50(pretrained=True).children())[:-2])
    
    print(f"self.backbone:{self.backbone}")

    self.conv = nn.Conv2d(2048, hidden_dim, 1)

    self.transformer = nn.Transformer(hidden_dim, num_heads,
                                      num_enc_layers, num_dec_layers)
    
    print(f"self.transformer:{self.transformer}")

    self.to_classes = nn.Linear(hidden_dim, num_classes+1)

    print(f"self.to_classes:{self.to_classes}")

    self.to_bbox = nn.Linear(hidden_dim, 4)
    print(f"self.to_bbox:{self.to_bbox}")
    
    # Learns 100 object queries *256
    self.object_query = nn.Parameter(torch.rand(100, hidden_dim))
    print(f"self.object_query:{self.object_query}")
    
    # Arranging these object queries in a matrix of 50*128
    self.row_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))
    self.col_embed = nn.Parameter(torch.rand(50, hidden_dim // 2))

    print(f"self.row_embed:{self.row_embed}")
    print(f"self.col_embed:{self.col_embed}")


    self.states = dict({'conv_features':None,'H':None,'W':None,
                        'pos_enc':None,'object_query':self.object_query,
                        'pred_classes':None,'pred_bboxes':None})
    
    print(f"self.states:{self.states}")
    
  def forward(self, X):
    X = self.backbone(X)
    print(f"X:{X}")

    h = self.conv(X)
    print(f"h:{h}")

    self.conv_features = h.data
    print(f"conv_features:{self.conv_features}")

    self.states['conv_features'] = h.data
    
    print(f"h.shape:{h.shape}")
    H, W = h.shape[-2:]

    print(f"H,W:{H} ,{W}")
    self.states['H']=H
    self.states['W']=W
    

    pos_enc = torch.cat([
                         self.col_embed[:W].unsqueeze(0).repeat(H,1,1),
                         self.row_embed[:H].unsqueeze(1).repeat(1,W,1)
                         ],
                    dim=-1).flatten(0,1).unsqueeze(1)
    
    print(f"pos_enc:{pos_enc}")

    self.states['pos_enc'] = pos_enc.data
    
    h = self.transformer(pos_enc + h.flatten(2).permute(2,0,1),
                         self.object_query.unsqueeze(1))
    print(f"h:{h}")

    class_pred = self.to_classes(h)
    print(f"class_pred:{class_pred}")

    bbox_pred = self.to_bbox(h).sigmoid()
    print(f"bbox_pred:{bbox_pred}")

    self.states['pred_classes']=class_pred.detach().data
    self.states['pred_bbox']=bbox_pred.detach().data

    return class_pred, bbox_pred

In [5]:
detr = SimpleDETR(1,256,8,6,6)
X = torch.rand(1,3,100,100)
cls, box = detr(X)
box.size()

self.num_classes:1
self.hidden_dim:256
self.num_heads:8
self.num_enc_layers:6
self.num_dec_layers:6
self.backbone:Sequential(
  (0): Conv2d(3, 64, kernel_size=(7, 7), stride=(2, 2), padding=(3, 3), bias=False)
  (1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
  (2): ReLU(inplace=True)
  (3): MaxPool2d(kernel_size=3, stride=2, padding=1, dilation=1, ceil_mode=False)
  (4): Sequential(
    (0): Bottleneck(
      (conv1): Conv2d(64, 64, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv2): Conv2d(64, 64, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
      (bn2): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (conv3): Conv2d(64, 256, kernel_size=(1, 1), stride=(1, 1), bias=False)
      (bn3): BatchNorm2d(256, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
      (relu): ReLU(inplace=

torch.Size([100, 1, 4])

## Bipartite matching loss

Let $\hat{y}=\{\hat{y}_i\}_{i=1}^N$ be the set of predictions where $\hat{y}_y=(\hat{c}_i,\hat{b}_i)$ is the tuple consisting of the predicted class (which can be the empty class) and a bounding box $\hat{b}_i=(\bar{x}_i,\bar{y}_i,w_i,h_i)$ where the bar notation represents the midpoint between endpoints, and $w_i$ and $h_i$ are the width and height of the box, respectively. <br>

# Bipartite matching & Hungarian Algorithm

Let $y$ denote the ground truth set. Suppose that the loss between $y$ and $\hat{y}$ is $L$, and the loss between each $y_i$ and $\hat{y}_i$ is $L_i$. Since we are working on the level of sets, the loss $L$ must be permutation invariant, meaning that we will get the same loss regardless of how we order the predictions. Thus, we want to find a permutation $\sigma\in S_N$ which maps the indices of the predictions to the indices of the ground truth targets. Mathematically, we are solving for

$$
\hat{\sigma}=\arg\min\limits_{\sigma\in S_N}^{} \sum\limits_{i=1}^N L_{i}(y_i, \hat{y}_{\sigma(i)}) \tag{1}
$$

Recall that our predictions consist of both a bounding box and a class. Let's now assume that the class prediction is actually a probability distribution over the set of classes (we take the softmax of the output to produce this). Then the total loss for the $i$th prediction will be the loss that is generated from class prediction and the loss generated from the bounding box prediction. The authors of [[1]](http://arxiv.org/abs/1906.05909) define this loss as the difference in the bounding box loss and the class prediction probability:

$$
\mathcal{L}_{match}(y_i,\hat{y}_{\sigma({i})})=
-\mathbb{I}_{\{c_i\neq \varnothing\}}\hat{p}_i(c_i) +
\mathbb{I}_{\{c_i\neq \varnothing\}}\mathcal{L}_{box}({b}_i,\hat{b}_{\sigma({i})}) \tag{2}
$$

where  𝑝̂ 𝑖(𝑐𝑖)  is the  argmax  of the logits from  𝑐𝑖  and  𝑏𝑜𝑥  is the loss resulting from the bounding box prediction. The above also states that the match loss is  0  if  𝑐𝑖=∅ . 

$$
\mathcal{L}_{box}({b}_i,\hat{b}_{\sigma({i})}) =
\lambda_{iou}\mathcal{L}_{iou}({b}_i,\hat{b}_{\sigma({i})}) +
 \lambda_{L1}\|b_{\sigma(i)}-\hat{b}_i\|_1\tag{3}
$$

The box loss is computed as a linear combination of the $L_1$ loss (displacement) and the **Generalized Intersection-Over-Union** (GIOU) loss between the predicted and ground truth bounding box. Also, if you imagine two bounding boxes which don't intersect, then the box error will not provide any meaningful context.

Where in the above equation the parameters  𝜆𝑖𝑜𝑢  and  𝜆𝐿1  are scalar hyperparameters. Notice that this sum is also a combination of errors generated from area and distance. Why does this make sense? It makes sense to think of equation  (3)  as a total cost associated with the prediction  𝑏̂ 𝜎(𝑖)  where the price of area errors is  𝜆𝑖𝑜𝑢  and the price of distance errors is  𝜆𝐿1 .

# GIOU:

$$
\mathcal{L}_{iou}({b}_i,\hat{b}_{\sigma({i})}) =
1-\Biggl(\frac{|b_{\sigma(i)}\cap \hat{b}_i|}{|b_{\sigma(i)}\cup\hat{b}_i|} - 
\frac{|B(b_{\sigma(i)},\hat{b}_i)\setminus b_{\sigma(i)}\cup \hat{b}_i|}{|B(b_{\sigma(i)},\hat{b}_i)|} \Biggr)\tag{4}
$$

The first term in the parenthesis is the **intersection over union** (IOU) function which is depicted below. The term $B(b_i, \hat{b}_i)$ denotes the *largest bounding box* containing $b_i$ and $\hat{b}_i$, and $|\cdot|$ represents area. 

# Hungarian Loss:

Since we are predicting classes from a given number of known classes, then class prediction is a classification problem, and thus we can use cross entropy loss for the class prediction error. We define the hungarian loss function as the the sum of each $N$ prediction losses:

$$\mathcal{L}_{Hungarian}(y,\hat{y})=
\sum_{i=0}^N\Bigl[-\log{ \hat{p}_i(c_i)} + 
\mathbb{I}_{\{c_i\neq \varnothing\}}\mathcal{L}_{box}({b}_i,\hat{b}_{\sigma({i})})\Bigr]\tag{5}$$

In [7]:
# There are various pretrained models available for Detr.