# Yolo V3 - Pytorch

En este notebook vamos a contruir el modelo de Yolo V3 en Pytorch e implementar los pesos pre-entrenados de forma que tengamos el modelo listo para ser usado en inferencia.

## Descarga del archivo de configuración

El código oficial de YOLO (darknet) utiliza un archivo de configuración para describr la topología de la red, bloque a bloque. Vamos a descargar el archivo para poder parsearlo y construir el modelo en Pytorch.

In [42]:
!wget https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg

--2018-05-22 17:34:45--  https://raw.githubusercontent.com/pjreddie/darknet/master/cfg/yolov3.cfg
Resolviendo raw.githubusercontent.com (raw.githubusercontent.com)... 151.101.132.133
Conectando con raw.githubusercontent.com (raw.githubusercontent.com)[151.101.132.133]:443... conectado.
Petición HTTP enviada, esperando respuesta... 200 OK
Longitud: 8342 (8,1K) [text/plain]
Grabando a: “yolov3.cfg.1”


2018-05-22 17:34:46 (18,2 MB/s) - “yolov3.cfg.1” guardado [8342/8342]



## Parseado del archivo de configuración

Si abrimos el archivo de configuración veremos los diferentes bloques que forman la red: capas convolucionales, capas residuales, upsampling, capas de concatenación y la capa de detección.

In [43]:
from __future__ import division

import torch 
import torch.nn as nn
import torch.nn.functional as F 
from torch.autograd import Variable
import numpy as np
import cv2

In [44]:
def parse_cfg(cfgfile):
    # abrimos el archivo
    file = open(cfgfile, 'r')
    # guardar las líneas sin líneas en blanco ni comentarios
    lines = file.read().split('\n')                        
    lines = [x for x in lines if len(x) > 0]                
    lines = [x for x in lines if x[0] != '#']              
    lines = [x.rstrip().lstrip() for x in lines]
    
    block = {}
    blocks = []
    # loop sobre las líneas del archivo
    for line in lines:
        # nuevo bloque
        if line[0] == "[":               
            if len(block) != 0:          
                blocks.append(block)     
                block = {}               
            block["type"] = line[1:-1].rstrip()     
        else:
            key,value = line.split("=") 
            block[key.rstrip()] = value.lstrip()
    blocks.append(block)

    return blocks

Podemos ver la información de cada bloque cambiando el id.

In [45]:
blocks = parse_cfg("yolov3.cfg")

In [46]:
block_id = 48
blocks[block_id]

{'activation': 'leaky',
 'batch_normalize': '1',
 'filters': '256',
 'pad': '1',
 'size': '1',
 'stride': '1',
 'type': 'convolutional'}

## Contruyendo los bloques

Ahora tenemos una lista con todos los bloques de la red. Usaremos esta lista para construir cada bloque con Pytorch. De las distintas capas del modelo, Pytorch contiene los tipos convolucional y upsampling. El resto de bloques los tenemos que definir.

In [47]:
class EmptyLayer(nn.Module):
    # para la capa router, haremos el concat en el forward de la red !
    def __init__(self):
        super(EmptyLayer, self).__init__()
        
class DetectionLayer(nn.Module):
    def __init__(self, anchors):
        super(DetectionLayer, self).__init__()
        self.anchors = anchors

Ahora iteramos sobre la lista creando los bloques y añadiéndolos al modelo.

In [48]:
def create_modules(blocks):
    net_info = blocks[0]     
    # modelos -> lista con los bloques
    module_list = nn.ModuleList()
    prev_filters = 3
    output_filters = []
    
    for index, x in enumerate(blocks[1:]):
        # cada bloque es un sequential (son varias capas: batchnorm, relu, ...)
        module = nn.Sequential()
        
        # capa convolcional
        if (x["type"] == "convolutional"):

            activation = x["activation"]
            try:
                batch_normalize = int(x["batch_normalize"])
                bias = False
            except:
                batch_normalize = 0
                bias = True

            filters= int(x["filters"])
            padding = int(x["pad"])
            kernel_size = int(x["size"])
            stride = int(x["stride"])

            if padding:
                pad = (kernel_size - 1) // 2
            else:
                pad = 0

            conv = nn.Conv2d(prev_filters, filters, kernel_size, stride, pad, bias = bias)
            module.add_module("conv_{0}".format(index), conv)

            if batch_normalize:
                bn = nn.BatchNorm2d(filters)
                module.add_module("batch_norm_{0}".format(index), bn)

            if activation == "leaky":
                activn = nn.LeakyReLU(0.1, inplace = True)
                module.add_module("leaky_{0}".format(index), activn)

        # capa upsampling
        elif (x["type"] == "upsample"):
            stride = int(x["stride"])
            upsample = nn.Upsample(scale_factor = 2, mode = "bilinear")
            module.add_module("upsample_{}".format(index), upsample)
        
        # capa router
        elif (x["type"] == "route"):
            x["layers"] = x["layers"].split(',')
            #Start
            start = int(x["layers"][0])
            #end
            try:
                end = int(x["layers"][1])
            except:
                end = 0
            
            if start > 0: 
                start = start - index
            if end > 0:
                end = end - index
            route = EmptyLayer()
            module.add_module("route_{0}".format(index), route)
            if end < 0:
                filters = output_filters[index + start] + output_filters[index + end]
            else:
                filters= output_filters[index + start]

        # capa residual
        elif x["type"] == "shortcut":
            shortcut = EmptyLayer()
            module.add_module("shortcut_{}".format(index), shortcut)
        
        # capa de detección
        elif x["type"] == "yolo":
            mask = x["mask"].split(",")
            mask = [int(x) for x in mask]

            anchors = x["anchors"].split(",")
            anchors = [int(a) for a in anchors]
            anchors = [(anchors[i], anchors[i+1]) for i in range(0, len(anchors),2)]
            anchors = [anchors[i] for i in mask]

            detection = DetectionLayer(anchors)
            module.add_module("Detection_{}".format(index), detection)
        
        module_list.append(module)
        prev_filters = filters
        output_filters.append(filters)
        
    return (net_info, module_list)

Veamos la lista de bloques creada:

In [49]:
blocks = parse_cfg("yolov3.cfg")
print(create_modules(blocks))

({'type': 'net', 'batch': '1', 'subdivisions': '1', 'width': '416', 'height': '416', 'channels': '3', 'momentum': '0.9', 'decay': '0.0005', 'angle': '0', 'saturation': '1.5', 'exposure': '1.5', 'hue': '.1', 'learning_rate': '0.001', 'burn_in': '1000', 'max_batches': '500200', 'policy': 'steps', 'steps': '400000,450000', 'scales': '.1,.1'}, ModuleList(
  (0): Sequential(
    (conv_0): Conv2d(3, 32, kernel_size=(3, 3), stride=(1, 1), padding=(1, 1), bias=False)
    (batch_norm_0): BatchNorm2d(32, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_0): LeakyReLU(negative_slope=0.1, inplace)
  )
  (1): Sequential(
    (conv_1): Conv2d(32, 64, kernel_size=(3, 3), stride=(2, 2), padding=(1, 1), bias=False)
    (batch_norm_1): BatchNorm2d(64, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True)
    (leaky_1): LeakyReLU(negative_slope=0.1, inplace)
  )
  (2): Sequential(
    (conv_2): Conv2d(64, 32, kernel_size=(1, 1), stride=(1, 1), bias=False)
    (batch_nor

## Definiendo la red

YOLO lleva a cabo detecciones en tres escalas distintas. Para poder procesarlas de forma similar usaremos la siguiente función:

In [50]:
def predict_transform(prediction, inp_dim, anchors, num_classes, CUDA = True):
    batch_size = prediction.size(0)
    stride =  inp_dim // prediction.size(2)
    grid_size = inp_dim // stride
    bbox_attrs = 5 + num_classes
    num_anchors = len(anchors)
    
    prediction = prediction.view(batch_size, bbox_attrs*num_anchors, grid_size*grid_size)
    prediction = prediction.transpose(1,2).contiguous()
    prediction = prediction.view(batch_size, grid_size*grid_size*num_anchors, bbox_attrs)

    anchors = [(a[0]/stride, a[1]/stride) for a in anchors]

    #Sigmoid the  centre_X, centre_Y. and object confidencce
    prediction[:,:,0] = torch.sigmoid(prediction[:,:,0])
    prediction[:,:,1] = torch.sigmoid(prediction[:,:,1])
    prediction[:,:,4] = torch.sigmoid(prediction[:,:,4])

    #Add the center offsets
    grid = np.arange(grid_size)
    a,b = np.meshgrid(grid, grid)

    x_offset = torch.FloatTensor(a).view(-1,1)
    y_offset = torch.FloatTensor(b).view(-1,1)

    if CUDA:
        x_offset = x_offset.cuda()
        y_offset = y_offset.cuda()

    x_y_offset = torch.cat((x_offset, y_offset), 1).repeat(1,num_anchors).view(-1,2).unsqueeze(0)

    prediction[:,:,:2] += x_y_offset
    
    #log space transform height and the width
    anchors = torch.FloatTensor(anchors)

    if CUDA:
        anchors = anchors.cuda()

    anchors = anchors.repeat(grid_size*grid_size, 1).unsqueeze(0)
    prediction[:,:,2:4] = torch.exp(prediction[:,:,2:4])*anchors
    
    prediction[:,:,5: 5 + num_classes] = torch.sigmoid((prediction[:,:, 5 : 5 + num_classes]))

    prediction[:,:,:4] *= stride

    return prediction

En la red, definimos el forward pass así como la función para cargar los pesos.

In [51]:
class Darknet(nn.Module):
    def __init__(self, cfgfile):
        super(Darknet, self).__init__()
        self.blocks = parse_cfg(cfgfile)
        self.net_info, self.module_list = create_modules(self.blocks)
    def forward(self, x, CUDA):
        modules = self.blocks[1:]
        outputs = {} 
        write = 0     
        for i, module in enumerate(modules):        
            module_type = (module["type"])

            if module_type == "convolutional" or module_type == "upsample":
                x = self.module_list[i](x)
            
            elif module_type == "route":
                layers = module["layers"]
                layers = [int(a) for a in layers]

                if (layers[0]) > 0:
                    layers[0] = layers[0] - i

                if len(layers) == 1:
                    x = outputs[i + (layers[0])]

                else:
                    if (layers[1]) > 0:
                        layers[1] = layers[1] - i

                    map1 = outputs[i + layers[0]]
                    map2 = outputs[i + layers[1]]

                    x = torch.cat((map1, map2), 1)

            elif  module_type == "shortcut":
                from_ = int(module["from"])
                x = outputs[i-1] + outputs[i+from_]

            elif module_type == 'yolo':        
                anchors = self.module_list[i][0].anchors
                inp_dim = int (self.net_info["height"])
                num_classes = int (module["classes"])
                x = x.data
                x = predict_transform(x, inp_dim, anchors, num_classes, CUDA)
                if not write:              
                    detections = x
                    write = 1
                else:       
                    detections = torch.cat((detections, x), 1)

            outputs[i] = x
                           
        return detections
    
    def load_weights(self, weightfile):
        fp = open(weightfile, "rb")
        header = np.fromfile(fp, dtype = np.int32, count = 5)
        self.header = torch.from_numpy(header)
        self.seen = self.header[3]

        weights = np.fromfile(fp, dtype = np.float32)

        ptr = 0
        for i in range(len(self.module_list)):
            module_type = self.blocks[i + 1]["type"]

            if module_type == "convolutional":
                model = self.module_list[i]
                try:
                    batch_normalize = int(self.blocks[i+1]["batch_normalize"])
                except:
                    batch_normalize = 0

                conv = model[0]

                if (batch_normalize):
                    bn = model[1]

                    num_bn_biases = bn.bias.numel()

                    bn_biases = torch.from_numpy(weights[ptr:ptr + num_bn_biases])
                    ptr += num_bn_biases

                    bn_weights = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases

                    bn_running_mean = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases

                    bn_running_var = torch.from_numpy(weights[ptr: ptr + num_bn_biases])
                    ptr  += num_bn_biases

                    bn_biases = bn_biases.view_as(bn.bias.data)
                    bn_weights = bn_weights.view_as(bn.weight.data)
                    bn_running_mean = bn_running_mean.view_as(bn.running_mean)
                    bn_running_var = bn_running_var.view_as(bn.running_var)

                    bn.bias.data.copy_(bn_biases)
                    bn.weight.data.copy_(bn_weights)
                    bn.running_mean.copy_(bn_running_mean)
                    bn.running_var.copy_(bn_running_var)

                else:
                    num_biases = conv.bias.numel()

                    conv_biases = torch.from_numpy(weights[ptr: ptr + num_biases])
                    ptr = ptr + num_biases

                    conv_biases = conv_biases.view_as(conv.bias.data)

                    conv.bias.data.copy_(conv_biases)

                num_weights = conv.weight.numel()

                conv_weights = torch.from_numpy(weights[ptr:ptr+num_weights])
                ptr = ptr + num_weights

                conv_weights = conv_weights.view_as(conv.weight.data)
                conv.weight.data.copy_(conv_weights)

El archivo con los pesos pre-entrenados lo podemos descargar [aquí](https://pjreddie.com/media/files/yolov3.weights). Se trata de un archivo binario que únicamente contiene los pesos guardados en serie como floats. Para poder cargar el archivo correctamente tenemos que tener en cuenta la forma en la que se guardaron. El orden corresponde con el del archivo de configuración.

In [52]:
!wget https://pjreddie.com/media/files/yolov3.weights

## Probando la red

Ahora podemos cargar el modelo con los pesos y probar con una imágen.

In [53]:
model = Darknet("yolov3.cfg")
model.load_weights("yolov3.weights")

img = cv2.imread("caprica.JPG")
img = cv2.resize(img, (416,416))          
img =  img[:,:,::-1].transpose((2,0,1))  # BGR -> RGB | H X W C -> C X H X W 
img = img[np.newaxis,:,:,:]/255.0       #Add a channel at 0 (for batch) | Normalise
img = torch.from_numpy(img).float()     
img = Variable(img)      
pred = model(img, torch.cuda.is_available())
print (pred)

  "See the documentation of nn.Upsample for details.".format(mode))


tensor([[[ 1.5565e+01,  1.5118e+01,  8.5422e+01,  ...,  7.2766e-07,
           5.6406e-08,  1.2622e-07],
         [ 1.7614e+01,  5.0516e+00,  1.0757e+02,  ...,  6.9028e-08,
           3.8804e-07,  9.4948e-07],
         [ 1.2173e+01,  1.1296e+00,  4.8232e+02,  ...,  1.2411e-06,
           5.0001e-06,  6.6611e-06],
         ...,
         [ 4.1122e+02,  4.1259e+02,  7.0144e+00,  ...,  1.4776e-03,
           8.1804e-04,  1.7272e-03],
         [ 4.1265e+02,  4.1080e+02,  6.3995e+00,  ...,  3.8574e-03,
           2.4466e-03,  2.1756e-03],
         [ 4.1100e+02,  4.1288e+02,  9.2589e+01,  ...,  2.0659e-03,
           1.4971e-03,  2.6143e-03]]])


Obtenemos un tensor de tamaño (batch size x 10647 x 85). Por cada imágen en el batch tenemos 10647 predicciones con los atributos de la bounding box (4 transformaciones sobre el anchor), la probabilidad de que haya un objeto en la caja y 80 valores que corresponden a la probabilidad de que el objeto detectado corresponda a cada una de las 80 clases en el COCO dataset (4+1+80=85).

In [54]:
pred.shape

torch.Size([1, 10647, 85])

## Non-maximal supression

Como hemos visto, YOLO nos da 10647 predicciones para una imágen de 416x416. Sin embargo, si en una imágen sólo tenemos uno o dos objetos, sólo queremos tener dos detecciones. 

Para ello, en primer lugar, vamos a establecer un valor mínimo para la probabilidad de que haya un objeto en la caja. Todas las predicciones que no superen este umbral, serán descartadas.

En segundo lugar usaremos el concepto de IoU (intersection over union) para quedarnos con una sola caja por cada objeto.

In [55]:
def unique(tensor):
    tensor_np = tensor.cpu().numpy()
    unique_np = np.unique(tensor_np)
    unique_tensor = torch.from_numpy(unique_np)
    
    tensor_res = tensor.new(unique_tensor.shape)
    tensor_res.copy_(unique_tensor)
    return tensor_res

def bbox_iou(box1, box2):

    #Get the coordinates of bounding boxes
    b1_x1, b1_y1, b1_x2, b1_y2 = box1[:,0], box1[:,1], box1[:,2], box1[:,3]
    b2_x1, b2_y1, b2_x2, b2_y2 = box2[:,0], box2[:,1], box2[:,2], box2[:,3]
    
    #get the corrdinates of the intersection rectangle
    inter_rect_x1 =  torch.max(b1_x1, b2_x1)
    inter_rect_y1 =  torch.max(b1_y1, b2_y1)
    inter_rect_x2 =  torch.min(b1_x2, b2_x2)
    inter_rect_y2 =  torch.min(b1_y2, b2_y2)
    
    #Intersection area
    inter_area = torch.clamp(inter_rect_x2 - inter_rect_x1 + 1, min=0) * torch.clamp(inter_rect_y2 - inter_rect_y1 + 1, min=0)
 
    #Union Area
    b1_area = (b1_x2 - b1_x1 + 1)*(b1_y2 - b1_y1 + 1)
    b2_area = (b2_x2 - b2_x1 + 1)*(b2_y2 - b2_y1 + 1)
    
    iou = inter_area / (b1_area + b2_area - inter_area)
    
    return iou

def write_results(prediction, confidence, num_classes, nms_conf = 0.4):
    conf_mask = (prediction[:,:,4] > confidence).float().unsqueeze(2)
    prediction = prediction*conf_mask
    
    box_corner = prediction.new(prediction.shape)
    box_corner[:,:,0] = (prediction[:,:,0] - prediction[:,:,2]/2)
    box_corner[:,:,1] = (prediction[:,:,1] - prediction[:,:,3]/2)
    box_corner[:,:,2] = (prediction[:,:,0] + prediction[:,:,2]/2) 
    box_corner[:,:,3] = (prediction[:,:,1] + prediction[:,:,3]/2)
    prediction[:,:,:4] = box_corner[:,:,:4]

    batch_size = prediction.size(0)

    write = False

    for ind in range(batch_size):
        image_pred = prediction[ind]          #image Tensor
           #confidence threshholding 
           #NMS
        
        max_conf, max_conf_score = torch.max(image_pred[:,5:5+ num_classes], 1)
        max_conf = max_conf.float().unsqueeze(1)
        max_conf_score = max_conf_score.float().unsqueeze(1)
        seq = (image_pred[:,:5], max_conf, max_conf_score)
        image_pred = torch.cat(seq, 1)
        
        non_zero_ind =  (torch.nonzero(image_pred[:,4]))
        try:
            image_pred_ = image_pred[non_zero_ind.squeeze(),:].view(-1,7)
        except:
            continue
        
        #For PyTorch 0.4 compatibility
        #Since the above code with not raise exception for no detection 
        #as scalars are supported in PyTorch 0.4
        if image_pred_.shape[0] == 0:
            continue 
            
        #Get the various classes detected in the image
        img_classes = unique(image_pred_[:,-1]) # -1 index holds the class index
        
        for cls in img_classes:
            #perform NMS
            
            #get the detections with one particular class
            cls_mask = image_pred_*(image_pred_[:,-1] == cls).float().unsqueeze(1)
            class_mask_ind = torch.nonzero(cls_mask[:,-2]).squeeze()
            image_pred_class = image_pred_[class_mask_ind].view(-1,7)

            #sort the detections such that the entry with the maximum objectness
            #confidence is at the top
            conf_sort_index = torch.sort(image_pred_class[:,4], descending = True )[1]
            image_pred_class = image_pred_class[conf_sort_index]
            idx = image_pred_class.size(0)   #Number of detections
            
            for i in range(idx):
                #Get the IOUs of all boxes that come after the one we are looking at 
                #in the loop
                try:
                    ious = bbox_iou(image_pred_class[i].unsqueeze(0), image_pred_class[i+1:])
                except ValueError:
                    break

                except IndexError:
                    break

                #Zero out all the detections that have IoU > treshhold
                iou_mask = (ious < nms_conf).float().unsqueeze(1)
                image_pred_class[i+1:] *= iou_mask       

                #Remove the non-zero entries
                non_zero_ind = torch.nonzero(image_pred_class[:,4]).squeeze()
                image_pred_class = image_pred_class[non_zero_ind].view(-1,7)
                
            batch_ind = image_pred_class.new(image_pred_class.size(0), 1).fill_(ind)      
            #Repeat the batch_id for as many detections of the class cls in the image
            seq = batch_ind, image_pred_class

            if not write:
                output = torch.cat(seq,1)
                write = True
            else:
                out = torch.cat(seq,1)
                output = torch.cat((output,out))
                
    try:
        return output
    except:
        return 0

Ahora ya tenemos todo lo necesario para usar la red.