### This is the final lesson with implementation of YOLO-v4

In Yolo-v3, the backbone-derived feature maps are then processed using an FPN-like structure.

The main goal of such a solution is the Feature Pyramid Network's ability to collect as much information as possible from different levels of the backbone:

![](../img/fpn.png)

Thus, the pyramidal structure of the hierarchy of features of convolutional layers is preserved simultaneously with strong semantic semantics at different levels of object recognition.

FPN achieves the above benefits by combining low resolution feature maps but semantically strong features with high resolution feature maps but semantically weak features using top-down joins and side joins.

Yolo-v4 does not use the Feature Pyramid Network as in Yolo-v3. The new version of the neural network uses Spatial Pyramid Pooling and Path Aggregation Network instead of FPN technology.

These parts of the neural network are responsible for the correct final interpretation of the features received from the backbone to generate the most accurate predictions.

The SPP block is an important part of the Yolo-v4 neural network algorithm, since its use adds to the functionality of the neural network the ability to process input images or video stream of any resolution.

The input image format in Yolo-v3 is a resolution of 416x416 pixels. The described limitation is caused by the fact that at the stage of predicting the position of recognized objects in the image and their classes, fully connected layers of the neural network are used.

Unlike convolutional layers used in early layers of neural network architectures, fully connected layers have a fixed input size. Because of this, the image must be exactly 416x416 pixels, otherwise the algorithm will complete its work with an error.

To solve this problem, it is logical to either cut out the part of interest from the image (crop), or warp it to the desired resolution. Then you can observe the results of various deformations of the images.

The given crop and warp for adjusting the image resolution for use in a neural network can greatly distort the original information. In this regard, the neural network will be trained on a low quality dataset, which will noticeably deteriorate the quality of the final prediction.

![](../img/deformations.png)

Using the SPP block allows you to get rid of the problems described above. The SPP block is added after the last convolutional layer of the neural network.

The SPP layer combines the final feature maps and creates a fixed-length representation of them for fully connected neural network layers. Thus, the Spatial Pyramid Pooling block performs a kind of aggregation of information collected in deep layers of the neural network between convolutional layers and fully connected layers to get rid of the need to use low-quality image transformations (for example, crop or wrap) before being fed into the algorithm as input parameters.

![](../img/spp.png)

To implement the neck part in Yolo-v4 on the PyTorch framework, it is also necessary to pre-develop auxiliary modules that significantly simplify the architecture construction, reduce the amount of code and the possibility of making mistakes.

First, I will implement the Upsample class, which, like all classes in the backbone, will inherit from the parent nn.Module class.

The main purpose of this class is to reformat high-dimensional data matrices into a more workable format. Since the SPP block and other parts in the neck of Yolo-v4 often use the operations of gluing and concatenation, it is necessary to be able to quickly and correctly reformat the data dimension so that it matches the original feature map and the processed one.

This is especially true in residual layers.

#### Now let's implement utility class called UpSample

In this class we want to manually change the way data is placed

In [6]:
import torch
import torch.nn as nn


class Upsample(nn.Module):
    def __init__(self):
        super(Upsample, self).__init__()

    def forward(self, x, target_size, inference=False):
        """
        Reshape the values into right format
        x: torch.Tensor - data
        target_size: tuple(int[4]) - dimensions of new array
        """
        assert (x.data.dim() == 4)
        # _, _, tH, tW = target_size

        if inference:

            #B = x.data.size(0)
            #C = x.data.size(1)
            #H = x.data.size(2)
            #W = x.data.size(3)

            # YOUR CODE HERE

            pass
        else:
            return F.interpolate(x, size=(target_size[2], target_size[3]), mode='nearest')

It is worth noting that in this implementation, there are no convolutional or fully connected layers of the neural network in the __init__ structure, therefore this class has no trainable parameters.

However, the Upsample class can operate in two modes. The inference flag means that the input is used solely for making predictions. It also allows you to keep the original number of batches and channels of input data, thereby changing only the height and width of the array.

Batches are nothing more than the number of images processed at a time by the neural network. The parameter of the number of batches can take on an arbitrarily large value, however, it is necessary to take into account the power of the computer used. As a rule, the batch parameter is a power of two, since when computations are parallelized, the GPU will process 25 or 32 images at the same time.

The channels of the input data are a number that reflects the number of feature maps obtained in the previous step. The input image usually has three channels. These are the red, green, and blue color channels. Subsequently, the number of channels can be changed thanks to the convolutional layers of the neural network. The unchecked inference flag interpolates the input data to resize it into a more convenient format.

In addition to the SPP block in the neck part of the Yolo-v4 neural network algorithm, PANet is used.

The use of the PANet algorithm in Yolo-v4 is due to the fact that this algorithm is an improved version of the FPN algorithm. First, just like in the Feature Pyramid Network, a pyramidal structure is created for transmitting information from different convolutional layers of the neural network.

However, after connections from top to bottom, an additional block follows, which collects signs even more efficiently, but already from bottom to top. The described solution helps to significantly improve the subsequent localization of objects. After that, all the final feature maps are combined using a special Adaptive Feature Pooling block, which allows restoring the lost initial information about the localization of objects in the deep layers of the neural network.

The main difference from the FPN feature map collection architecture is that features from the image are collected exclusively from top to bottom, although this occurs at several levels. Path Aggregation Networks allow at the last level to collect features vice versa from bottom to top, thus preserving as much as possible information about the localization of objects from the first layers of the neural network, which have a weak representative ability, but have a large amount of information about the immediate boundaries of the object in the image. Further, the description of the architecture of the Path Aggregation Network is highlighted

![](../img/panet.png)

The implementation of the methods described above for aggregating information from different feature maps is presented in the Neck class, which also inherits the base nn.Module class of the PyTorch framework.

The functional description of the developed child class will be divided into 2 parts: the class constructor (__init__ method) and data processing (forward method).

In [9]:
class Neck(nn.Module):
    def __init__(self, inference=False):
        super().__init__()
        self.inference = inference
        
        # !!!Use LeakyReLU activation function here

        self.conv1 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')
        self.conv2 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')
        self.conv3 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')

        self.maxpool1 = nn.MaxPool2d(kernel_size=5, stride=1, padding=5 // 2)
        self.maxpool2 = nn.MaxPool2d(kernel_size=9, stride=1, padding=9 // 2)
        self.maxpool3 = nn.MaxPool2d(kernel_size=13, stride=1, padding=13 // 2)

        # ADD your layers
        self.conv4 = None  # 2048, 512
        self.conv5 = None  # 512, 1024
        self.conv6 = None  # 1024, 512
        self.conv7 = None  # 512, 256
        
        # Initiate Upsample class here
        self.upsample1 = None

        self.conv8 = None  # 512, 256

        self.conv9 = None  # 512, 256
        self.conv10 = None  # 256, 512
        self.conv11 = None  # 512, 256
        self.conv12 = None  # 256, 512
        self.conv13 = None  # 512, 256
        self.conv14 = None  # 256, 512

        self.upsample2 = None

        # Same stuff with 256 and 128 feature maps
        self.conv15 = None

        self.conv16 = None
        self.conv17 = None
        self.conv18 = None
        self.conv19 = None
        self.conv20 = None

    def forward(self, input, downsample4, downsample3, inference=False):
        x1 = self.conv1(input)
        x2 = self.conv2(x1)
        x3 = self.conv3(x2)

        m1 = self.maxpool1(x3)
        m2 = self.maxpool2(x3)
        m3 = self.maxpool3(x3)
        # spp operation here (concatenation)
        spp = None

        x4 = self.conv4(spp)
        x5 = self.conv5(x4)
        x6 = self.conv6(x5)
        x7 = self.conv7(x6)

        up = self.upsample1(x7, downsample4.size(), self.inference)

        x8 = self.conv8(downsample4)
        
        x8 = torch.cat([x8, up], dim=1)

        x9 = self.conv9(x8)
        x10 = self.conv10(x9)
        x11 = self.conv11(x10)
        x12 = self.conv12(x11)
        x13 = self.conv13(x12)
        x14 = self.conv14(x13)

        up = self.upsample2(x14, downsample3.size(), self.inference)

        x15 = self.conv15(downsample3)

        x15 = torch.cat([x15, up], dim=1)

        x16 = self.conv16(x15)
        x17 = self.conv17(x16)
        x18 = self.conv18(x17)
        x19 = self.conv19(x18)
        x20 = self.conv20(x19)
        return x20, x13, x6

In the description of the Neck class, you can immediately see the use of the Upsample utility class described above.

The use of the SPP architecture in the Yolo-v4 neck computational graph is indicated by self.conv1-7 convolutional layers and multiple max pooling operations. The latter allow many times to reduce the dimension of the output arrays by aggregating features within a 2D window.

Specifically, max pooling allocates the largest value in the window. The rest of the convolutional layers and objects of the Upsample class are used for more accurate aggregation of careers of features of different dimensions in Path Aggregation Networks.

It is also worth paying attention to the fact that in the description of the Neck part, the activation function has been changed from Mish to LeakyReLU, since when training the classifier, LeakyReLU works more stably and faster.

The arguments of this method differ from the description of the forward methods of the above classes.

In addition to the data itself, the method for processing also needs feature maps obtained as a result of the work of the previously presented DownSample3 and DownSample4 classes in order to preserve the hierarchical structure of features.

The concept of an SPP block and its logic are described when concatenating data processed by different convolutional layers at different levels. When data is glued, the original layer of the feature map and 3 individually processed feature maps are fed to the input of the operation after the corresponding max pooling functions.

Further, after the declaration of the first up variable and to the end of the forward method, there is a description of the Path Aggregation Network. Using the previously described Upsample functions and the results of the Downsample3 and Downsample4 classes, you can implement a hierarchical structure of collecting signs on the PANet principle.

For this, feature maps obtained after the SPP block and from the two previous levels are used. Directly in the script, the functionality written with the x4 variable and until the end of the forward method of the Neck class is responsible for this. As a result of the method, 3 processed feature maps are returned, the data and information from which will subsequently be used to classify objects and regress their coordinates in the image.

In neural network algorithms for object recognition and detection, under head is meant the final part of the algorithm, which, based on the generated backbone feature maps and the most representatively collected and aggregated neck data block, produces the final model prediction.

In the tasks of object recognition and detection, these are the classification of recognized objects and their position in the image.
Yolo-v3 uses a unique head part that does an excellent job. In this regard, in Yolo-v4, it was decided to use the development from the previous version of the algorithm.

The basic principle of operation of Yolo-v4 head is as follows: first, the image is divided into a grid, say 13x13 pixels, each cell contains n anchors, which are the basic answer of the algorithm for localizing objects.

In fact, each anchor describes a region of the image in which a potential object lies. Yolo-v4 head uses 3 anchors for each grid cell. Then, using a single convolution operation for each grid cell and for each anchor, 5 + C variables are predicted, where C is the number of classes that the neural network is trained to recognize. Each number among these C values ​​lies in the range of admissible values ​​[0, 1], and their sum is necessarily equal to one. Thus, each of these values ​​is nothing more than the probability of a potential object belonging to a certain class.

The remaining 5 values ​​are responsible for the x, y coordinates of the central point of the object, its dimensions in the image (length and width) and the probability of being within the boundaries of this particular cell and this particular anchor of the object. As a result the final dimension of the output array takes the following value: (13, 13, 3, 5 + C)

The described principle is repeated three times on different feature maps from different levels to simplify the prediction of objects of different sizes by the neural network.

Also, the classifier is much more successful in predicting objects, the regions of localization in the image of which strongly overlap.

To implement the head part of the Yolo-v4 neural network, we will first develop the YoloLayer helper class, which will also inherit from the parent class nn.Module of the PyTorch library.

The described class implements the functionality described earlier in this chapter for one of the feature maps. The implementation in the form of a script can be examined further: 

In [11]:
class Yolov4Head(nn.Module):
    def __init__(self, output_ch, n_classes, inference=False):
        super().__init__()
        self.inference = inference

        self.conv1 = Conv_Bn_Activation(128, 256, 3, 1, 'leaky')
        self.conv2 = Conv_Bn_Activation(256, output_ch, 1, 1, 'linear', bn=False, bias=True)

        self.yolo1 = YoloLayer(
                                anchor_mask=[0, 1, 2], num_classes=n_classes,
                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],
                                num_anchors=9, stride=8)

        self.conv3 = Conv_Bn_Activation(128, 256, 3, 2, 'leaky')

        self.conv4 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')
        self.conv5 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')
        self.conv6 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')
        self.conv7 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')
        self.conv8 = Conv_Bn_Activation(512, 256, 1, 1, 'leaky')
        self.conv9 = Conv_Bn_Activation(256, 512, 3, 1, 'leaky')
        self.conv10 = Conv_Bn_Activation(512, output_ch, 1, 1, 'linear', bn=False, bias=True)
        
        self.yolo2 = YoloLayer(
                                anchor_mask=[3, 4, 5], num_classes=n_classes,
                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],
                                num_anchors=9, stride=16)

        self.conv11 = Conv_Bn_Activation(256, 512, 3, 2, 'leaky')

        self.conv12 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')
        self.conv13 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')
        self.conv14 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')
        self.conv15 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')
        self.conv16 = Conv_Bn_Activation(1024, 512, 1, 1, 'leaky')
        self.conv17 = Conv_Bn_Activation(512, 1024, 3, 1, 'leaky')
        self.conv18 = Conv_Bn_Activation(1024, output_ch, 1, 1, 'linear', bn=False, bias=True)
        
        self.yolo3 = YoloLayer(
                                anchor_mask=[6, 7, 8], num_classes=n_classes,
                                anchors=[12, 16, 19, 36, 40, 28, 36, 75, 76, 55, 72, 146, 142, 110, 192, 243, 459, 401],
                                num_anchors=9, stride=32)

    def forward(self, input1, input2, input3):
        x1 = self.conv1(input1)
        x2 = self.conv2(x1)

        x3 = self.conv3(input1)

        x3 = torch.cat([x3, input2], dim=1)
        x4 = self.conv4(x3)
        x5 = self.conv5(x4)
        x6 = self.conv6(x5)
        x7 = self.conv7(x6)
        x8 = self.conv8(x7)
        x9 = self.conv9(x8)
        x10 = self.conv10(x9)

        x11 = self.conv11(x8)

        x11 = torch.cat([x11, input3], dim=1)

        x12 = self.conv12(x11)
        x13 = self.conv13(x12)
        x14 = self.conv14(x13)
        x15 = self.conv15(x14)
        x16 = self.conv16(x15)
        x17 = self.conv17(x16)
        x18 = self.conv18(x17)
        
        if self.inference:
            y1 = self.yolo1(x2)
            y2 = self.yolo2(x10)
            y3 = self.yolo3(x18)

            return get_region_boxes([y1, y2, y3])
        
        else:
            return [x2, x10, x18]

This class works directly only with those anchors that were created on a specific feature map.

Also, a threshold is used here, according to which it is considered whether a given object lies in a specific region or not. If the intersection (IoU or Intersection over Unit) is greater than the threshold, then the trained parameters of this region at a specific anchor on a specific feature map are updated to train the neural network.

Now, using the YoloLayer class, you can implement the Yolov4Head class, which contains all the above operations and outputs neural network predictions in the form of classes of recognized objects and their localization in the image.

Also, as in the case of the neck part of Yolo-v4, the head block will be represented as two methods: __init__ (constructor) and forward (data processing). Moreover, it also inherits the nn.Module base class. The implementation of the __init__ method is presented

In the description of the class attributes, there are 3 objects of the YoloLayer class for predicting objects at different levels of the received feature maps. They differ in the input data and the size of the final localization of objects. The stride parameter increases with the depth of the feature maps.

Thus, deeper feature maps are better at predicting the localization of small objects in the image, while earlier feature maps facilitate accurate prediction of large objects. It is also worth noting that the head block uses the LeakyReLU activation function as in the neck part, and not the Mish activation function as in the backbone. In addition, before the final prediction of the neural network at each of the levels, a linear function of activation of the identity display is used.

In the implementation of the method, the previously described inference flag is used, which allows you to change the nature of the method depending on the type of neural network operation. When training, it is enough to return the obtained result for the operation of the error function and the backpropagation method; in turn, when the algorithm works directly in real-time prediction, it is necessary to correctly process the results for visual demonstration and further data processing. Also in this class, the Bottom Up part of the PANet architecture is visible when predicting features of different dimensions for several maps.

So, the classes backbone, neck and head of the Yolo-v4 neural network algorithm were developed. Next comes the implementation of the final class Yolov4, which contains all the above classes and is a convenient add-on for further work with the neural network.

In [12]:
class Yolov4(nn.Module):
    def __init__(self, yolov4conv137weight=None, n_classes=80, inference=False):
        super().__init__()

        output_ch = (4 + 1 + n_classes) * 3

        # backbone
        # Downsample classes
        self.down1 = None 
        self.down2 = None
        self.down3 = None
        self.down4 = None
        self.down5 = None

        # neck
        self.neck = None

        if yolov4conv137weight:
            _model = nn.Sequential(self.down1, self.down2, self.down3, self.down4, self.down5, self.neck)
            pretrained_dict = torch.load(yolov4conv137weight)

            model_dict = _model.state_dict()

            pretrained_dict = {k1: v for (k, v), k1 in zip(pretrained_dict.items(), model_dict)}

            model_dict.update(pretrained_dict)
            _model.load_state_dict(model_dict)
        
        # YOLO-v4 class is here
        self.head = None


    def forward(self, input):
        d1 = self.down1(input)
        d2 = self.down2(d1)
        d3 = self.down3(d2)
        d4 = self.down4(d3)
        d5 = self.down5(d4)

        x20, x13, x6 = self.neck(d5, d4, d3)

        output = self.head(x20, x13, x6)
        return output

This class uses objects of the backbone, neck and head classes.

Also, a functional was developed for loading the weights of a pre-trained neural network to implement the transfer learning process, in which the model is trained not from a random point, but after training on a similar task.

Thus, it is possible to achieve good recognition quality with a small amount of data. Also, the training of the neural network will be noticeably faster.