# Chapter- working of RetinaNet

- Topic-1 Introduction to RetinaNet
- Topic-2 Network Architecuture design
- Topic-3 Anchor box design
- Topic-4 Encoding Ground truth box 
- Topic-5 Calculating loss functions 

## Topic-1 Introduction to RetinaNet
RetinaNet is a one stage object detection framework prposed by Kaiming He and his colleuges from FAIR (Facebook AI Research) in 2018. In order to undestand the complete workings of the Retinanet students need to understand the following two papers written by the feam 
- [Feature Pyramid Networks for object detection](https://arxiv.org/pdf/1612.03144.pdf) - To solve localization of objects problem along with detecting large and small objects accurately. 
- [Focal Loss for Dense object detection](https://arxiv.org/pdf/1708.02002.pdf) - To solve the class imbalance problem discussed previously


Some of the key features of the Retinanet network include. 
- Using ResNext-101 as backbone Retinanet reaches 40.8 mAP@(0.5:0.95) , 61.1 mAP@0.5 and 44.1 mAP@0.75. 
- It is faster and has considerably much better mAP compared two stage framework network.

In this chapter, we will look into the deeper workings of RetinaNet.

## Topic-2 Basic Network Architecuture Design

We have already seen an architecture before and learned how object detection is performed on it, lets reiterate that here once again. Suppose we have an image of size (800, 800) and we subsampled the image size to (50x50) using a neural network as shown in the below diagram. The subsampling ratio here is 16 (800/50). Then there are two brances, one branch predicts the location of the anchor box and the other head predicts the classification of box.
- The backbone network here can be any image classification network like resnet, resnext, densenet, vgg etc. 
- We can see in the diagram that the output feature map has two parallel networks
**Regression head**:  
It is responsible for predicting the location of object wrt to the anchor box.  

**Classifciation head**:  
It is responsible for predicting the class of the object present the predicted box wrt to anchor box. There is no objectness score here because we use sigmoid. sigmoid tells wheather a particlar object is present or not. So instead of answering which object center is present in each anchor box, we answer weather a particular class (cow) is present or not in the image? when asking this question to all the objects. For example, suppose this is a 2 class object detector (cows and rabbit). Now if the threshold is 0.5 and the output probabilites after sigmoid (indenpendent classifiers) is [0.4, 0.7], we can say that rabbit is present in the predicted box with 0.7 confidence. Unlike softmax, these two terms doesn't sum to 1. If the output probablites is [0.1, 0.2], then the predicted anchor box doesn't contain any object. 
    
![retinanet network](../images/retina_01.png)

### Problems with basic architecuture.
Though convnets are robust to variance in scale, all the top entries in ImageNet or COCO challenges have used multi-scale testing on featurized image pyramids. For example, instead of sending only (800 x 800) image, say we send  (400 x 400) image, (1200 x 1200) image through the network. Now the feature map size for (400 x 400) image, 800 x 800 image and 1200 x 1200 image will be (50 x 50), (25 x 25) and (60 x 60) respectively. The researcher used this approach because most of the algorithms discussed above have miserable failed at detecting small objects and increasing/ decreasing the image size is one way of finding objects at different scale. As we have discussed in the first chapter,  This technique is nothing but image pyramids, though we have used only 3 scales here, we can use multiple scales. Image pyramids imporved the detection accuracy but the problem is that it increased the inference time.

![image pyramids](../images/retinanet_02.png)


### Feature Pyramids - Alternative effective solution for Image pyramids.
If we carefully observe ResNet architecure the size of the feature map reduces by half after every resblock. The below diagram shows the size of feature map size of Resnet50 after every resblock. Now one obvious question we have is why are we only taking features from resblock5 (subsample 16) and not from all other resblocks? We will answer the reason in short time but in RetinaNet, with minor modifications to the network, the authors have used other resblocks too as shown in the below diagram.

![Resnet](../images/retinanet_03.png)

Previously we have only used stride 16 to determine the location. Now by using stride 8, 16, 32, 64, 128 we have feature maps of different sizes. The Intuition is that the feature maps of earlier layers will be responsible for detecting small objects, the feature maps in the later layers will be effective in finding large objects. One problem is that, due to multiple convolution layers, the later layers looses the spatial information of the object but have strong semantics whereas the earlier layers have strong spatial information but don't have strong object semantics.  As discussed in the paper [Visualizing and Understanding Convolutional Networks](https://arxiv.org/pdf/1311.2901.pdf) In the earlier layers of the convolutional network, the model mostly learns edges and colors which will not help in detecting the objects exactly. In the later layers, the object semantics are very strong but due to multiple convolutions we loose the location information. The below diagram is extracted from  [Visualizing and Understanding Convolutional Networks](https://arxiv.org/pdf/1311.2901.pdf) which clearly shows this phenomeno.  

![Resnet](../images/retinanet_04.png)

So The earlier feature map with size (100 x 100) though has good localization capability will not have power to identify the object. In the later layers, suppose the feature map size of (13 x 13) has strong object semantics but has poor localization capabilities (As multiple convolutions were applied). So in-order to solve this problem, the authors came up with an approach called Lateral connections using top-down pathway


### Lateral Connections using top-down pathway.
- Since lower layers have strong localization features and layers higher in the network contains strong semantic features combaining these two layers would lead to strong localization, strong semantic features. This combination is done using lateral connections as shown in the diagram.   

![Resnet](../images/retinanet_05.png)

- The Green block shown in the diagram is a Resnet architecture. C3, C4, C5 are the outputs of Resblock3, Resblock4 and Resblock5 respectively. These have filter map sizes of 100 x 100, 50 x 50, 25 x 25 with feature maps of 512, 1024, 2048 respectively.
- C3, C4, C5 feature maps filters are reduced to 256 using 1 x 1 conv layer. Lets term these C3_reduced, C4_reduced, C5_reduced respectively. 
- A 3 x 3 conv is connected to C5_reduced which outputs **P5**. The P5 size of 25 x 25 x 256.
- C5_reduced is upsampled from 25 x 25 to 50 x 50. This is then added to C4_reduced.  A 3 x 3 conv is applied to the added matrix which outputs **P4**. The P4 size is 50 x 50 x 256
- C4_reduced is upsampled from 50 x 50 to 100 x 100. This is then added to C3_reduced.  A 3 x 3 conv is applied to the added matrix which outputs **P3**. The P4 size is 100 x 100 x 256
- A 3 x 3 conv block is applied to Resblock5 output which gives an output size of 13 x 13. This is termed as **P6**.
- A 3 x 3 conv block is applied to P6 output which gives an output size of 7 x 7. This is termed as **P7**
- The FPN layer finally gives [P3, P4, P5, P6, P7] as outputs

### RetinaNet - Final Network
There are two subnetworks called classification network and regression network which independently works on each of the FPN outputs P3, P4, P5, P6, P7. As shown in the diagram, the subnetworks consists of 4 3 x 3 conv layers where the feature map size is not altered (256 filter maps of P3 remains 256 filter maps at output).  The final 3x3 conv layer of conv classification network increases the number of filter maps to n_classes x n_anchors (In the diagram we have used n_classes = 80 and n_anchors = 9). The final 3x3 conv layer of regression layer increase the number of filter maps to 4 x n_anchors (Each box has x, y, w, h prediction so we have 4, n_anchors = 9). P3, P4, P5, P6, P7 outputs obtained from Classification and regression network are  reshaped and concatenated. The network now output 1200087 anchor box results, with classification network giving the output of class and regression network giving the output of location of the object.

![RetinaNet final network](../images/retinanet_06.png)

- below is the keras code to obtain the same
- We can use the following backends 


### Keras_retinanet
keras_retinanet is a github project by fizyr and team. As the name suggests it is a retinanet implementation in Keras. The [package](https://github.com/fizyr/keras-retinanet) has various backend networks. 

|Network name| Networks |
|--------| ------- |
| Densenet| densenet121, densenet169, densenet201 |
| mobilenet|mobilenet128, mobilenet160, mobilenet192, mobilenet224 |
| vgg | vgg16, vgg19|
| resnet| resnet50, resnet101, resnet152|



To install keras_retinanet in your system, use the following instructions.

- Clone the repository in your system. 
```bash
git clone https://github.com/fizyr/keras-retinanet
```
- Ensure numpy is installed using 
```bash
pip install numpy --user
```

- In the repository, execute 
```bash
pip install . --user. 
```
Note that due to inconsistencies with how tensorflow should be installed, this package does not define a dependency on tensorflow as it will try to install that (which at least on Arch Linux results in an incorrect installation). Please make sure tensorflow is installed as per your systems requirements.


Alternatively, you can run the code directly from the cloned repository, however you need to run 
```bash
python setup.py build_ext --inplace
```
to compile Cython code first.
Optionally, install pycocotools if you want to train / test on the MS COCO dataset by running 
```bash
pip install --user git+https://github.com/cocodataset/cocoapi.git#subdirectory=PythonAPI
```

In [1]:
from keras_retinanet import models
import keras 

backbone = models.backbone("resnet50")
inputs = keras.layers.Input(shape=(800, 800, 3))
retinanet = backbone.retinanet(80, num_anchors=None, modifier=None, inputs=inputs)
#print(retinanet.summary())

Using TensorFlow backend.


### Quiz
1) Total trainable params for densnet121 backend ?  
A) 18, 872, 628  
B) 18, 791, 028  (Ans)  
C) 38, 021, 812   
D) 37, 915, 572   

```python 
from keras_retinanet import models
import keras 

backbone = models.backbone("densenet121")
inputs = keras.layers.Input(shape=(800, 800, 3))
retinanet = backbone.retinanet(80, num_anchors=None, modifier=None, inputs=inputs)
print(retinanet.summary())
```

2) Which of the following statements are True about Retinanet?  
A) RetinaNet uses top-down pathway lateral connections  
B) RetinaNet uses bottom-up pathway lateral connections  
C) Neither uses top-down or bottom-up  
D) Doesn't use lateral connections  

Ans) A 


3) Which of the following statements are True about RetinaNet?  
A) P3 is responsible for detecting small objects compared to P4, P5, P6, P7  
B) P2 is computationally expensive and is ignored.  
C) p2 has semantically rich features but poor localized features   
D) p7 has locally rich features but poor semantically rich features  

Ans) A, B

## Topic-3 Anchor boxes. 

### 3.1 Anchor Boxes
As we have discussed previous chapters on the design of anchor boxes, lets reiterate here a bit. Instead of having standard sliding window anchor boxes, we modified anchor boxes a bit primarily for two reasons 

1) The objects come in different shapes . So having the same grid size will make it difficult for the network to learn and identify objects of different sizes and shapes. In the following image, the car aspect ratio (width/height) is different from human aspect ratio.  

![Images with car and human](../images/anchor_02.png)

2) The objects comes in different sizes. A small object, medium object and a large object (Comes with different sizes). For example, see the following image where cheetah is present at different distances from the camera. Now instead of 1 anchor, if we have 3 anchors  at each location, each one will be responsible for detecting small, medium and large objects respectively, which will make the network training simpler.
![cheetah](../images/cheetah.jpg)


So 3 different sizes along with 3 different aspect ratios, together a total of 9 anchors at each location will help us in detecting objects easily and also help the network to get trained faster with higher accuracy. Following this we will see how anchors are designed in RetinaNet.


### 3.2 RetinaNet anchor boxes 
- When designing anchor boxes as mentioned in the network, we have 5 stride values 8, 16, 32, 64, 128 on which we have to design anchors. The lower strides which comes with lower **sizes** are responsible for detecting small objects and the higher strides which comes with higher **sizes** are responsible for detecting large objects. 
- At each stride, using ratios and scales mentioned below we get 9 anchors. Since we have 5 strides we get a total of 45 anchors. Though each stride helps in identifying small, medium and large objects, we still use scales to further distinguish sizes within these small, medium and large objects to detect objects at all the scales with high accuracy. The following are the most important params for designing anchor boxes.


```python
import numpy as np
sizes = [32, 64, 128, 256, 512]
strides = [8, 16, 32, 64, 128]
ratios = np.array([0.5, 1, 2])
scales = np.asarray([2**0, 2**(1.0/3.0), 2**(2.0/3.0)])
```

RetinaNet uses 9 anchor box per grid cell instead of 1 which we mentioned before. At each location, we will have 9 anchor boxes of different ratios and scales. Lets generate anchors for **stride 8**. The feature map size is 100 * 100. So there should be a total of (90000) 100 * 100 * 9 anchors. The following is the procedure we follow to generate the anchors for all the grid cells.

- We select the feature map for generating anchors and based on that select stride and size of the anchors. For feature map of size (100, 100) the size is 32 and stride is 8.
- We generate anchors at cell_00
- For generating anchors at all the cells
- We generate anchors at all the grid cells and concatenate them to get one array.
- We repeat the same for other strides.


1) Lets get the shapes of feature maps 

```python
from keras_retinanet.utils.anchors import guess_shapes
image_shape = (800, 800, 3)
feature_map_shapes = guess_shapes(image_shape, [3, 4, 5, 6, 7])
print(feature_map_shapes)

#Out: [array([100, 100]), array([50, 50]), array([25, 25]), array([13, 13]), array([7, 7])]
````

2) We will choose the first element in the list. If we generate anchors for this feature map, we can run a for loop and generate feature map for others as well.

```python
fe_size = feature_map_shapes[0]
size = sizes[0]
stride = strides[0]
print(fe_size, size, stride)

# [100, 100], 32, 8
```


3) Generate anchors with Center (0, 0)

```python 
from keras_retinanet.utils.anchors import generate_anchors
anchors = generate_anchors(size, ratios, scales)
print(anchors)

# Out
# [[-22.627417   -11.3137085   22.627417    11.3137085 ]
#  [-28.50875898 -14.25437949  28.50875898  14.25437949]
#  [-35.91878555 -17.95939277  35.91878555  17.95939277]
#  [-16.         -16.          16.          16.        ]
#  [-20.1587368  -20.1587368   20.1587368   20.1587368 ]
#  [-25.39841683 -25.39841683  25.39841683  25.39841683]
#  [-11.3137085  -22.627417    11.3137085   22.627417  ]
#  [-14.25437949 -28.50875898  14.25437949  28.50875898]
#  [-17.95939277 -35.91878555  17.95939277  35.91878555]]
```


4) Visualizing the anchor boxes using matplotlib 

```python
from matplotlib import pyplot as plt
%matplotlib inline 

plt.figure(figsize=(12, 8))
for i in anchors:
    x1, y1, x2, y2 = i
    plt.hlines(y1, x1, x2)
    plt.hlines(y2, x1, x2)
    plt.vlines(x1, y1, y2)
    plt.vlines(x2, y1, y2)
plt.title("Anchors at location (0, 0)")
plt.show()

```

![anchos at location 0, 0](../images/anchor_at_loc00.png)


5) Generate anchors at all the windows. To do this we need to generate all the centers for each window on the feature map 

```python
shift_x = (np.arange(0, fe_size[1]) + 0.5)
shift_y = (np.arange(0, fe_size[0]) + 0.5) 
shift_x, shift_y = np.meshgrid(shift_x, shift_y)
shifts = np.vstack((
        shift_x.ravel(), shift_y.ravel(),
        shift_x.ravel(), shift_y.ravel())).transpose()
print(shifts)

# array([[ 0.5,  0.5,  0.5,  0.5],
#        [ 1.5,  0.5,  1.5,  0.5],
#        [ 2.5,  0.5,  2.5,  0.5],
#        ...,
#        [97.5, 99.5, 97.5, 99.5],
#        [98.5, 99.5, 98.5, 99.5],
#        [99.5, 99.5, 99.5, 99.5]])
```


6) Visualizing all the window centers of anchor boxes. Lets use pillow for the same 

```python
from PIL import Image
from PIL import ImageDraw
img = np.uint8(np.ones((800, 800, 3)))
img = Image.fromarray(img) 
draw = ImageDraw.Draw(img)
for i in shifts:
    x = i[0]*stride ## Multiplying with stride to visualize this on image
    y = i[1]*stride
    draw.point((x, y), fill="red")

```

![Window centers of anchor boxes](../images/centers_img.png)

7) Now your job is too move our 9 anchors to each of this center. To aggreate and do all of this keras_retinanet has a function called **shift**. Using that we get all the 90000 anchors.
```python 
from keras_retinanet.utils.anchors import shift
shifted_anchors = shift(fe_size, stride, anchors)
print(shifted_anchors.shape)

##Out: (90000, 4)
```
8) Lets visulaize some of these anchors using pillow 

```python
for i in shifted_anchors[5049*9:5050*9]:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=1)
```

![centers_with_anchors](../images/centers_with_anchors.png)

We can clearly see in the above image on how anchors are generated over the image. So now lets generate anchors on all the feature maps and combine them to one vector so that we can procced further. 


```python
from keras_retinanet.utils.anchors import AnchorParameters, guess_shapes
print(AnchorParameters.default.sizes)
print(AnchorParameters.default.scales)
print(AnchorParameters.default.ratios)
print(AnchorParameters.default.strides)

# Out
# [32, 64, 128, 256, 512]
# [1.        1.2599211 1.587401 ]
# [0.5 1.  2. ]
# [8, 16, 32, 64, 128]
```

- AnchorParamters contains all the default values we have talked about in this section
- guess_shape takes in image size as input and calculate the feature map sizes for this network. Note that if network changes, we need write our own guess_shape function. changing the network is out of scope for this course.

The keras_retinanet package has a function called anchors_for_shape, which generates all the anchors for a given image size. Though we have used (800, 800) as default here, We can use images of any size. keeping that in mind, lets code and see how it actually works.

```python
from keras_retinanet.utils.anchors import anchors_for_shape
```

The anchors for shape takes in the following params

| Arguments | Default | Description |
| ----------------|------------| -----------------|
| image_shape| - | Tuple, the shape of the input image|
| pyrmaid_levels| - | List of pyramids to use|
| anchor_params| None| If None, it internally uses default values mentioned in AnchorParameters|
| shapes_callback| None | Function which calculates the feature map sizes. If None uses default guess_shapes function|


```python
image_size = (800, 800)
pyramid = [3, 4, 5, 6, 7]
all_anchors = anchors_for_shape(image_size, pyramid, AnchorParameters.default, guess_shapes)
print(all_anchors.shape)
#out: (120087, 4)
```

Generation of anchors is done. We now need to assign ground truth to each anchor and create a classification_target and regression_target so that we can train the network. we will take up this in our next section **Encoding Ground Truths**

### Tasks
Consider we have an image of size (960, 960). Using default Anchor parameters as mentioned in keras.retinanet.utils.anchors.AnchorParameters.default. 

Using the parameters defined above. Answer the following question

Q1) What are the feature map sizes for P3, P4, P5?

Instructions:
- use guess_shapes in keras_retinanet.utils.anchors

Ans) [array([120, 120]), array([60, 60]), array([30, 30])]  

Solution)
```python
from keras_retinanet.utils.anchors import guess_shapes
image_shape = (960, 960, 3)
feature_map_shapes = guess_shapes(image_shape, [3, 4, 5])
print(feature_map_shapes)
```

How many anchors are generated on P3?

Instructions:
- Use only P3 feature map 
- Use anchors_for_shape in keras_retinanet.utils.anchors

Ans) 129600

Solution)
```python
from keras_retinanet.utils.anchors import anchors_for_shape
image_size = (960, 960)
pyramid = [3]
all_anchors = anchors_for_shape(image_size, pyramid, AnchorParameters.default, guess_shapes)
print(all_anchors.shape)
```

How many anchors are present if we decided not to use P2 ?

Ans) 43101

solution)

```python
from keras_retinanet.utils.anchors import anchors_for_shape
image_size = (960, 960)
pyramid = [4, 5, 6, 7]
all_anchors = anchors_for_shape(image_size, pyramid, AnchorParameters.default, guess_shapes)
print(all_anchors.shape)
```

## Topic-4 Encoding groud truth box

In the last module we have generated ~120087 anchors for an image with size (800, 800) which have 5 pyrmaids (p3, p4, p5, p6, p7). In this module we have to assign co-oordinate labels (4) and class labels (80 for coco) for each anchor. We will learn how we are going to do that.  Lets say we have an image of size (800, 800) and it has an object [200, 300, 400, 500]. 
Previously we have seen that, each grid cell has only one anchor and the ground truth is assigned to the grid cell which contains the object center. RetinaNet uses a much simpler approach. It calculates the intersection of each anchor with the object, it assigns the ground truth object to the anchor which has max intersection (IOU). 

```python
from keras_retinanet.utils.compute_overlap import compute_overlap
import numpy as np


from PIL import Image
from PIL import ImageDraw
img = np.uint8(np.ones((800, 800, 3)))
img = Image.fromarray(img) 
draw = ImageDraw.Draw(img)

gt_box = np.asarray([200, 300, 400, 500]).reshape(1, -1) # A dummy anchor box 
for i in gt_box:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=1, outline="red")
```

![image with bbox](../images/bbox_img.png)

Lets compute the iou between anchors and gt_boxes. We can visualize the same using matplotlib

```python
overlaps = compute_overlap(all_anchors.astype(np.float64), gt_box.astype(np.float64))
print("Max overlap with any anchor: ", overlaps.max())

## Out: Max overlap with any anchor:  0.9464433477433575
```

```python
plt.hist(overlaps)
plt.show()
```
![Histogram Overlap](../images/histogram_overlap.png)


We can see that majority of the anchor boxes are not intersecting with the object. The max overlap with any anchor is ~0.94. lets visulize the max_iou anchor 

```python
max_iou_anchor_box = all_anchors[np.argmax(overlaps, axis=0)]
print(max_iou_anchor_box)

# [[202.40633392 298.40633392 405.59366608 501.59366608]]

for i in max_iou_anchor_box:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=3, outline="green")
    
```

![Bbox ground truth anchor box](../images/bbox_gt_anchor.png)


As we can clearly see, the anchor box in green color has high iou with the object. Now, lets initialize two numpy arrays, one for classification and the other for regression.

```python
num_classes = 80 # consider it is a coco dataset
regression_batch  = np.zeros(( all_anchors.shape[0], 4 + 1))
labels_batch = np.zeros((all_anchors.shape[0], num_classes + 1))
print(regression_batch.shape, labels_batch.shape)

##(120087, 5) (120087, 81)
```

The regression batch contains 4 indices to predict x, y, w, h of the object. The final index tells whether the anchor should be used to calculate the loss (1) or not (-1), called the ignore index. The labels batch contain n_classes (row) and 1 row again to tell whether the anchor should be used to calculate the loss (1) or not (-1), called the ignore index.


### Rules for assigning gt_box to the ground truth
- if an anchor has iou >= 0.5, the anchor is positive anchor.
- if an anchor has 0.4 <= iou < 0.5, the anchor needs to be ignored ( we will shortly understand the reason)
- if an anchor has iou < 0.4, it is a negative anchor


#### Case-1: if an anchor has iou >= 0.5, the anchor is positive anchor.
Lets visualize the anchors which have pos_iou with the gt_box. 


```python
pos_index = (overlaps >= 0.5).reshape(-1)
pos_anchors = all_anchors[pos_index]
print(pos_anchors.shape)
#(49, 4)
```

visualizing the pos anchor boxes

```python
for i in pos_anchors:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=1, outline="yellow")
```

![pos boxes](../images/pos_boxes.png)

As we can see these all anchors have over lap with gt_box. In retinanet, the authors have considered iou >= 0.5 as pos anchor. means all these anchor boxes has the chance to contain an object. 


#### Case-2 if an anchor has 0.4 <= iou < 0.5, the anchor needs to be ignored

```python
ignore_index = (overlaps < 0.5).reshape(-1) & (overlaps > 0.4).reshape(-1)
ignore_anchors = all_anchors[ignore_index]
print(ignore_anchors.shape)
#out: (56, 4)
```

There are a total of 56 ignore anchors. We have to ignore these anchors while training because, they partly contain objects and therefore we can neither tell what object is present within the anchor nor we can discard that as background as partially there is some object present. So hence ignored. to visualize use the following code.

```python
img = np.uint8(np.ones((800, 800, 3)))
img3 = Image.fromarray(img).copy()
draw = ImageDraw.Draw(img3)

for i in ignore_anchors:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=1, outline="brown")

for i in gt_box:
    draw.rectangle([(i[0], i[1]), (i[2], i[3])], fill=None, width=5, outline="red")
```

![ignore boxes](../images/ignore_boxes.png)


#### Case-3 if an anchor has iou < 0.4, it is a negative anchor
- Anchors with iou < 0.4 are considered to be negative (background). Since the numpy arrays are initialized with zeros, all the anchors other than pos_anchors and ignore anchors are considered negative anchors


### How the classification and regression targets are created ?

we have the following things with us 


| Arguments | Shape | Description |
| ----------------|------------| -----------------|
| labels_batch| (120087, 81) | 80 classes for labels and 1 for indexing weather it is a pos, neg or ignore anchor|
| regression_batch| (120087, 5) | 4 coordinates and 1 for indexing weather it is a pos, neg or ignore anchor| 
| pos_index| -| True for the anchors which are positive|
| ignore_index| - | True for the anchors which needs to be ignored|

since we have initialized both **labels_batch** and **regression_batch** with zeros, we need to replace pos_anchors with 1 and ignore_anchors with -1.


```python
labels_batch[ignore_index, -1]       = -1
labels_batch[pos_index, -1]     = 1

regression_batch[ignore_index, -1]   = -1
regression_batch[pos_index, -1] = 1
```

suppose the class of the object is 20, then for the postive_indices, we need to replace the 20th row with 1's.

```python
labels_batch[pos_index, 20] = 1
```


For the regression outputs x1, y1, x2, y2 for each anchor are obtained relative to the anchor box and lets call these tx1, ty1, tx2, ty2.
```
anchor_height = ty2 - ty1
anchor_width = tx2 - tx1 

dx1 = (x1 - tx1)/ anchor_width
dx2 = (x2 - tx2)/ anchor_width
dy1 = (y1 - ty1)/ anchor_height
dy2 = (y2 - ty2)/ anchor_height

targets = np.stack((dx1, dy1, dx2, dy2))
targets = targets.T 

targets = (target - mean)/std ## normalizing the predictions.
```
In keras_retinanet there is a function called bbox_transform which actually does this for us.

```python
from keras_retinanet.utils.anchors import bbox_transform
regression_batch[:, :-1] = bbox_transform(all_anchors, gt_box)
```

Though we have calculated these for all the anchors, the last index will tell which anchors are positive, so while calculating loss functions we can use that index to ignore other labels.


The final two numpy arrays **labels_batch** and **regression_batch** are used as targets for training the network

## Tasks
For a ground truth object at [20, 30, 200, 300] with an image size of (960, 960) calculate the following things.

1) max_iou with any anchor (upto two decimals)  
2) anchor with max_iou  
3) total number of pos_anchors if pos_threshold is 0.5 
4) total number of negitive anchors (total_anchors - pos_anchors - ignore_anchors)  

Instructions: 
- Use default anchorparameters.

Solutions:  
1) 0.748
```python
from keras_retinanet.utils.anchors import anchors_for_shape
from keras_retinanet.utils.compute_overlap import compute_overlap
import numpy as np

image_size = (960, 960)
pyramid = [3, 4, 5, 6, 7]
all_anchors = anchors_for_shape(image_size, pyramid, AnchorParameters.default, guess_shapes)
print(all_anchors.shape)


x = [20, 30, 200, 300]
gt_box = np.asarray(x).reshape(1, -1)

overlaps = compute_overlap(all_anchors.astype(np.float64), gt_box.astype(np.float64))
print("Max overlap with any anchor: ", overlaps.max())

```

2) [ 40.16242979,  32.32485958, 183.83757021, 319.67514042]

```python
max_iou_anchors = all_anchors[overlaps.argmax()]
print(max_iou_anchors)
```

3) 35
```python
pos_index = (overlaps >= 0.5).reshape(-1)
pos_anchors = all_anchors[pos_index]
print(pos_anchors.shape)
```

4) 172607
```python
pos_index = (overlaps >= 0.5).reshape(-1)
pos_anchors = all_anchors[pos_index]
print(pos_anchors.shape)


ignore_index = (overlaps < 0.5).reshape(-1) & (overlaps > 0.4).reshape(-1)
ignore_anchors = all_anchors[ignore_index]
print(ignore_anchors.shape)

neg_anchors = all_anchors.shape[0] - pos_anchors.shape[0] - ignore_anchors.shape[0]
print(neg_anchors)
```

# Topic 5 - Loss functions
There are two loss functions, one for regression (coordinates) and the other for classification (labels)

## Focal loss for classification
- For classification, The authors have used a novel loss function called **Focal loss**, which is a remodification to the cross entropy loss function, allowing us to weight mis-classified and hard examples high compared to easily classified examples. 

Cross_entropy_loss = - log(p_{t})  
Focal loss = \alpha * (1-p_{t})^\gamma * log(p_{t})  


\begin{equation*}
Cross entropy loss = - log(p_{t})  
\end{equation*}

\begin{equation*}
Focal loss = \alpha * (1-p_{t})^\gamma * log(p_{t}) 
\end{equation*}

where  
\begin{equation*}
p_{t} = (p) \; if \; (y = 1) 
\end{equation*}

\begin{equation*}
p_{t} = (1- p) \; if \; (y =0)  
\end{equation*}


One of the fundamental problems with object detection is that there are too many background anchor boxes compared to foreground anchor boxes (49 pos anchors boxes in a total of ~120000 anchor boxes as seen previously). Since the negitive classes overwhelm the pos classes it is important to give more weight to hard examples rather than easily classifable background classes. The authors of RetinaNet paper have used Focal loss to solve this problem. Lets understand Focal loss by considering the following scenarios. we will use \gamma value 0.25 and and \alpha value 2.0.  

### Scenario-1: Easy correctly classified example

Say we have an easily classified foreground object with p=0.9. Now usual cross entropy loss for this example is


\begin{equation*}
CE(foreground) = -log(0.9) = 0.1053  
\end{equation*}

Now, consider easily classified background object with p=0.1. Now usual cross entropy loss for this example is again the same

\begin{equation*}
CE(background) = -log(1–0.1) = 0.1053
\end{equation*}


Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

\begin{equation*}
FL(foreground) = -1 x 0.25 x (1–0.9)^2 log(0.9) = 0.00026
\end{equation*}


\begin{equation*}
FL(background) = -1 x 0.25 x (1–(1–0.1))^2 log(1–0.1) = 0.00026.
\end{equation*}


### Scenario-2: misclassified example

Say we have an misclassified foreground object with p=0.1. Now usual cross entropy loss for this example is

\begin{equation*}
CE(foreground) = -log(0.1) = 2.3025
\end{equation*}

Now, consider misclassified background object with p=0.9. Now usual cross entropy loss for this example is again the same

\begin{equation*}
CE(background) = -log(1–0.9) = 2.3025
\end{equation*}

Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

\begin{equation*}
FL(foreground) = -1 x 0.25 x (1–0.1)^2 log(0.1) = 0.4667
\end{equation*}


\begin{equation*}
FL(background) = -1 x 0.25 x (1–(1–0.9))^2 log(1–0.9) = 0.4667
\end{equation*}


### Scenario-3: Very easily classified example

Say we have an easily classified foreground object with p=0.99. Now usual cross entropy loss for this example is

\begin{equation*}
CE(foreground) = -log(0.99) = 0.01
\end{equation*}

Now, consider easily classified background object with p=0.01. Now usual cross entropy loss for this example is again the same

\begin{equation*}
CE(background) = -log(1–0.01) = 0.1053
\end{equation*}


Now, consider focal loss for both the cases above. We will use alpha=0.25 and gamma = 2

\begin{equation*}
FL(foreground) = -1 x 0.25 x (1–0.99)^2 log(0.99) = 2.5*10^(-7)
\end{equation*}


\begin{equation*}
FL(background) = -1 x 0.25 x (1–(1–0.01))^2 log(1–0.01) = 2.5*10^(-7)
\end{equation*}


## Conclusion:

scenario-1: 0.1/0.00026 = 384 times smaller number

scenario-2: 2.3/0.4667 = 5 times smaller number

scenario-3: 0.01/0.00000025 = 40,000 times smaller number.

These three scenarios clearly show that Focal loss add very less weight to well classified examples and large weight to miss-classified or hard classified examples.

This is the basic intuition behind designing Focal loss. The authors have tested different values of alpha and gamma and final settled with the above mentioned values.


## Smooth L1 loss for Regression 
Smooth L1-loss can be interpreted as a combination of L1-loss and L2-loss. It behaves as L1-loss when the absolute value of the argument is high, and it behaves like L2-loss when the absolute value of the argument is close to zero. 

```
f(x) = 0.5 * (sigma * x)^2          if |x| < 1 / sigma / sigma
       |x| - 0.5 / sigma^2         otherwise
```
sigma is a hyperparameter. In this paper the authors have choosen sigma as 3. and x is the difference between true_positive and predicted coordinates. Smooth L1-loss combines the advantages of L1-loss (steady gradients for large values of 𝑥) and L2-loss (less oscillations during updates when 𝑥 is small).


Since, we understood the Network, anchors, encoding targets and loss functions, we can now go and proceed and train the network.


## Quiz

Q) Which loss function is good for classification loss in RetinaNet ?  
A) Focal loss   
B) Cross entropy loss   
C) Either A or B  
D) neither A nor B  


Ans) A