


# Implementation Of Proposed Method in 
#"Arbitrary Shape Scene Text Detection with Adaptive Text Region Representation"

**Author: Maryam Sadat Hshemi , Sara Aein**

**Download papre : https://arxiv.org/abs/1905.05980**


In [0]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3aietf%3awg%3aoauth%3a2.0%3aoob&response_type=code&scope=email%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdocs.test%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive%20https%3a%2f%2fwww.googleapis.com%2fauth%2fdrive.photos.readonly%20https%3a%2f%2fwww.googleapis.com%2fauth%2fpeopleapi.readonly

Enter your authorization code:
··········
Mounted at /content/drive


<div dir = rtl>
شبکه این مقاله دارای دو قسمت است. در قسمت اول یک شبکه text_RPN است که پروپوزال های مکان متن را پیشنهاد می دهد و شبکه دوم یک شبکه LSTM است که این پیشنهادات را بهبود می دهد.

</div>

## 1.Text_RPN

### Import Prerequesties

In [0]:
from keras import backend as K
import keras
from keras.models import Model
from keras.layers import Input, Dense, Conv2D, MaxPool2D, GlobalAveragePooling2D, multiply
import numpy as np

Using TensorFlow backend.


### Squeeze and Excitation Block

<div dir = rtl>
شبکه اصلی بخش اول، یک شبکه SE_VGG16 است که همان شبکه VGG16 می باشد با این تفاوت که بعد از هر لایه pooling، یک بلوک SE قرار می دهد.
</div>

In [0]:
def SE_Block(in_block, name = '', ratio=16):
    shape = in_block.shape.as_list()
    filters = shape[-1]

    x = GlobalAveragePooling2D(name = name + '_GlobalAvgPool')(in_block)
    x = Dense(filters // ratio, activation='relu',use_bias= False, name = name + '_1')(x)
    x = Dense(filters, activation='sigmoid',use_bias= False, name = name + '_2')(x)
    x = multiply([in_block,x],name = name + '_multiply')
    return x

### SE_VGG16
This is a backbone network of Text_RPN. 

In [0]:
def SE_VGG16(input_tensor = None):

  input_shape = (None, None, 3)

  if input_tensor is None:
    img_input = Input(shape=input_shape)
  else:
    if not K.is_keras_tensor(input_tensor):
      img_input = Input(tensor=input_tensor, shape=input_shape)
    else:
      img_input = input_tensor

  conv1_1 = Conv2D(filters = 64, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv1_1')(img_input)
  conv1_2 = Conv2D(filters = 64, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv1_2')(conv1_1)
  pool1 = MaxPool2D(pool_size = (2,2), strides = (2,2), name = 'pool1')(conv1_2)
  se1 = SE_Block(pool1,'se1')

  conv2_1 = Conv2D(filters = 128, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv2_1')(se1)
  conv2_2 = Conv2D(filters = 128, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv2_2')(conv2_1)
  pool2 = MaxPool2D(pool_size = (2,2), strides = (2,2), name = 'pool2')(conv2_2)
  se2 = SE_Block(pool2,'se2')

  conv3_1 = Conv2D(filters = 256, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv3_1')(se2)
  conv3_2 = Conv2D(filters = 256, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv3_2')(conv3_1)
  conv3_3 = Conv2D(filters = 256, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv3_3')(conv3_2)
  pool3 = MaxPool2D(pool_size = (2,2), strides = (2,2), name = 'pool3')(conv3_3)
  se3 = SE_Block(pool3, 'se3')

  conv4_1 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv4_1')(se3)
  conv4_2 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv4_2')(conv4_1)
  conv4_3 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv4_3')(conv4_2)
  pool4 = MaxPool2D(pool_size = (2,2), strides = (2,2), name = 'pool4')(conv4_3)
  se4 = SE_Block(pool4, 'se4')

  conv5_1 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv5_1')(se4)
  conv5_2 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv5_2')(conv5_1)
  conv5_3 = Conv2D(filters = 512, kernel_size =(3,3), padding = 'same', activation ='relu', name = 'conv5_3')(conv5_2)
  pool5 = MaxPool2D(pool_size = (2,2), strides = (2,2), name = 'pool4')(conv4_3)
  se5 = SE_Block(pool5,'se5')

  return se5

### Generate Anchor Boxes


### Region Proposal Network
<div dir = rtl>
بعد از شبکه اصلی SE_VGG16، یک شبکه RPN باید آموزش داده شود تا پروپوزال های مکان متن را به دست بیاورد.
ساختار اRPN در اینجا همانند شبکه RPN موجود در مقاله Faster RCNN است با این تفاوت که اندازه anchor های آن متفاوت است.
در این بخش قسمت اصلی شبکه RPN پیاده سازی شده که به عنوان خروجی، محدوده مکانی متن و نوع کلاس هر anchor را می دهد.
کلاس در این مسئله منظور کلاس وجود متن و یا نبود آن است.
</div>

In [0]:
def rpn_layer(base_layers, num_anchors):
    x = Conv2D(512, (3, 3), padding='same', activation='relu', kernel_initializer='normal', name='rpn_conv1')(base_layers)
    
    x_class = Conv2D(num_anchors, (1, 1), activation='sigmoid', kernel_initializer='uniform', name='rpn_out_class')(x)
    x_regr = Conv2D(num_anchors * 4, (1, 1), activation='linear', kernel_initializer='zero', name='rpn_out_regress')(x)

    return [x_class, x_regr, base_layers]

### ROI Pooling
<dir dir=rtl>
خروجی شبکه RPN را باید به یک لایه ROI Pooling بدهیم که پرروپوزال های مختلف به دست آمده از شبکه قبل را به یک اندازه مشخص تبدیل کن.
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)

</div>

In [0]:
from keras.layers import Input, Conv2D, MaxPooling2D
from keras.layers import TimeDistributed
from keras.layers import Flatten, Dense, Dropout
from keras.layers import Conv2D
from keras.engine import Layer
from keras import backend as K
import tensorflow as tf


class RoiPoolingConv(Layer):
    """
    ROI pooling layer for 2D inputs.
    See Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition,
    K. He, X. Zhang, S. Ren, J. Sun
    # Arguments
        pool_size: int
            Size of pooling region to use. pool_size = 7 will result in a 7x7 region.
        num_rois: number of regions of interest to be used
    # Input shape
        list of two 4D tensors [X_img,X_roi] with shape:
        X_img:
        `(1, rows, cols, channels)`
        X_roi:
        `(1,num_rois,4)` list of rois, with ordering (x,y,w,h)
    # Output shape
        3D tensor with shape:
        `(1, num_rois, channels, pool_size, pool_size)`
    """
    def __init__(self, pool_size, num_rois, **kwargs):
        self.dim_ordering = K.image_dim_ordering()
        self.pool_size = pool_size
        self.num_rois = num_rois

        super(RoiPoolingConv, self).__init__(**kwargs)
    def build(self, input_shape):
        self.nb_channels = input_shape[0][3]
    def compute_output_shape(self, input_shape):
        return None, self.num_rois, self.pool_size, self.pool_size, self.nb_channels
    def call(self, x, mask=None):
        assert (len(x) == 2)

        # x[0] is image with shape (rows, cols, channels)
        img = x[0]

        # x[1] is roi with shape (num_rois,4) with ordering (x,y,w,h)
        rois = x[1]

        input_shape = K.shape(img)

        outputs = []

        for roi_idx in range(self.num_rois):
            x = rois[0, roi_idx, 0]
            y = rois[0, roi_idx, 1]
            w = rois[0, roi_idx, 2]
            h = rois[0, roi_idx, 3]

            x = K.cast(x, 'int32')
            y = K.cast(y, 'int32')
            w = K.cast(w, 'int32')
            h = K.cast(h, 'int32')

            # Resized roi of the image to pooling size (7x7)
            rs = tf.image.resize_images(img[:, y:y + h, x:x + w, :], (self.pool_size, self.pool_size))
            outputs.append(rs)

        final_output = K.concatenate(outputs, axis=0)

        # Reshape to (1, num_rois, pool_size, pool_size, nb_channels)
        # Might be (1, 4, 7, 7, 3)
        final_output = K.reshape(final_output, (1, self.num_rois, self.pool_size, self.pool_size, self.nb_channels))

        # permute_dimensions is similar to transpose
        final_output = K.permute_dimensions(final_output, (0, 1, 2, 3, 4))

        return final_output
        
    def get_config(self):
        config = {'pool_size': self.pool_size,
                  'num_rois': self.num_rois}
        base_config = super(RoiPoolingConv, self).get_config()
        return dict(list(base_config.items()) + list(config.items()))

def get_img_output_length(width, height):
    def get_output_length(input_length):
        return input_length // 16

    return get_output_length(width), get_output_length(height)


## 2.Text Refinement Network
<div dir=rtl>
در این قسمت چون میدانیم تعداد جفت نقاط هر متن نامشخص است، از یک شبکه RNN باید استفاده کنیم که در مقاله از شبکه LSTM استفاده شده است.
</div>

In [0]:
def LSTM(roi_input):
  return keras.layers.LSTM(roi_input)

## 3.Setting Config
<div dir=rtl>
در این بخش اطلاعات کلی شبکه اعم از نوع شبکه اصلی، تعداد anchor ها و مقیاس های آن ها، اندازه تصویر ورودی، میزان گام در RPN، تعداد و برچسب کلاس ها و ... پیکربندی می شود.
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)
</div>

In [0]:
class Config:

	def __init__(self):

		# Name of base network
		self.network = 'se_vgg16'

		# Anchor box scales
    # Note that if im_size is smaller, anchor_box_scales should be scaled
    # Original anchor_box_scales in the paper is [32,64, 128, 256, 512]
		self.anchor_box_scales = [32, 64, 128, 256, 512] 

		# Anchor box ratios
		self.anchor_box_ratios = [[0.5, 0.5], [1, 1], [2, 2]]

		# Size to resize the smallest side of the image
		# Original setting in paper is 600.
		self.im_size = 300		

		# image channel-wise mean to subtract
		self.img_channel_mean = [103.939, 116.779, 123.68]
		self.img_scaling_factor = 1.0

		# number of ROIs at once
		self.num_rois = 4

		# stride at the RPN (this depends on the network configuration)
		self.rpn_stride = 16

		self.balanced_classes = False

		# scaling the stdev
		self.std_scaling = 4.0
		self.classifier_regr_std = [8.0, 8.0, 4.0, 4.0]

		# overlaps for RPN
		self.rpn_min_overlap = 0.3
		self.rpn_max_overlap = 0.7

		# overlaps for classifier ROIs
		self.classifier_min_overlap = 0.1
		self.classifier_max_overlap = 0.5

		# placeholder for the class mapping, automatically generated by the parser
		self.class_mapping = {'text': 1, 'bg': 0}

		self.model_path = None
		self.training_annotation = None
		self.cfg_save_path = None
		self.classes_count = None


## 4.Data

### Data Processing
<div dir=rtl>
در این قسمت یک سری عمیلیات کلی که قرار است روی داده ها انجام بگیرد تعریف شده است. اعم از گرفتن داده ها و قرار دادن آن ها به ترتیب مشخص در لیست، تغییر اندازه تصاویر و ... .
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)
</div>

In [0]:
import copy
import cv2
import numpy as np
import os
import shutil

def get_new_img_size(width, height, img_min_side=300):
    if width <= height:
        f = float(img_min_side) / width
        resized_height = int(f * height)
        resized_width = img_min_side
    else:
        f = float(img_min_side) / height
        resized_width = int(f * width)
        resized_height = img_min_side

    return resized_width, resized_height

def GetData(annotation_file_path, class_mapping):
    """
    annotation_file should be like the following:
    image_path x1,y1,x2,y2,cls1_id x1,y1,x2,y2,cls2_id
    for example: pic1.png 0,0,100,100,0 200,200,350,300,1
    :param annotation_file_path:
    :return:
    """

    classes_count = {}
    class_mapping2 = {}
    for key in class_mapping:
        classes_count[key] = 0
        class_mapping2[class_mapping[key]] = key

    all_data = []
    file = open(annotation_file_path, 'r')
    for line in file.readlines():
        line = line.strip().split()
        bboxes = []
        for bbox in line[2:]:
            x1, y1, x2, y2, cls_id = map(int, bbox.split(','))
            cls = class_mapping2[cls_id]
            bboxes.append({'class': cls, 'x1': x1, 'y1': y1, 'x2': x2, 'y2': y2})
            classes_count[cls] += 1

        all_data.append({
            'filepath': line[0],
            'height': int(line[1][0]),
            'width': int(line[1][1]),
            'bboxes': bboxes
        })
    file.close()
    return all_data, classes_count, class_mapping

def create_dir(root, delete=False):
    if os.path.exists(root):
        if not delete:
            return
        shutil.rmtree(root)
    os.makedirs(root)

def format_img_size(img, C):
    """ formats the image size based on config """
    img_min_side = float(C.im_size)
    (height, width, _) = img.shape

    if width <= height:
        ratio = img_min_side / width
        new_height = int(ratio * height)
        new_width = int(img_min_side)
    else:
        ratio = img_min_side / height
        new_width = int(ratio * width)
        new_height = int(img_min_side)
    img = cv2.resize(img, (new_width, new_height), interpolation=cv2.INTER_CUBIC)
    return img, ratio

def get_real_coordinates(ratio, x1, y1, x2, y2):
    real_x1 = int(round(x1 // ratio))
    real_y1 = int(round(y1 // ratio))
    real_x2 = int(round(x2 // ratio))
    real_y2 = int(round(y2 // ratio))
    return real_x1, real_y1, real_x2, real_y2

### Make annotation file.
<div dir= rtl>
در این قسمت دادگان ICDAR2015 گرفته شده و طبق فرمت خاصی، annotation هی آن ها در فایل هایی ذخیره می شود تا در ادامه قابل استفاده باشد.
در این مجموعه دادگان، مکان هر کلمه با چهارنقطه چهار گوشه آن برچسب گذاری شده اند که ما فقط نقاط گوشه چپ و بالا و رات و پایین را به عنوان annotation آن استفاده می کنیم. در انتهای ثبت شدن دو نقطه گوشه متن نیز کلاس آن را که وجود متن است، با 1 مشخص کرده ایم.
</div>

In [0]:
import glob

annotation_file_path = '/content/drive/My Drive/ICDAR2015/annotation_file.txt'
annotation_file = open(annotation_file_path,"w+")
imagePathes = '/content/drive/My Drive/ICDAR2015/images/'
gtPathes = '/content/drive/My Drive/ICDAR2015/gt/'

files = glob.glob(imagePathes + '*.png')
for imgPath in files:
    imgName = imgPath[:-4]
    gtPath =  'gt_'+imgName+'.txt'
    gtFile = open(gtPathes+gtPath,"r")

    annotationLine = imgPath

    i = 0
    for line in gtFile.readlines():
        line = line.strip().split()
        bboxes = []
        for bbox in line:
            if i == 0:
                
                i = 1
              
            x1, y1 = map(int, bbox.split(','))[0:1]
            x2, y2 = map(int, bbox.split(','))[4:5]
            cls_id = 1

            annotationLine += ' '+x1+','+y1+','+x2+','+y2+','+cls_id
    annotation_file.write(annotationLine)
annotation_file.close()

<div dir=rtl>
در این  قسمت تابع GetData فراخوانی شده و به عنوان خروجی دادگان آموزشی و تعداد داده های هر کلاس و نوع نگاشتی که برای کلاس ها در نظر گرفته شده را می دهد.
و سپس این اطلاعات را به عنوان اطلاعات پیکربندی ذخیره می کند.
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)
 
</div>

In [0]:
def get_data():

    training_images, classes_count, class_mapping = GetData(C.training_annotation, C.class_mapping)
    C.class_mapping = class_mapping
    C.classes_count = classes_count

    # Shuffle the images with seed
    random.seed(1)
    random.shuffle(training_images)

    # Get train data generator which generate X, Y, image_data
    data_gen_train = anchor.get_anchor_gt(training_images, cfg, net.get_img_output_length, mode='train')
    return data_gen_train

## 5.Training

### Anchor
<div dir=rtl>
در این بخش ground truth هر anchor به عنوان برچسب مشخص می شود.
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)
</div>

In [0]:
import numpy as np
import cv2


def get_anchor_gt(all_img_data, C, img_length_calc_function, mode='train'):
    """ Yield the ground-truth anchors as Y (labels)
    Args:
        all_img_data: list(filepath, width, height, list(bboxes))
        C: config
        img_length_calc_function: function to calculate final layer's feature map (of base model) size according to input image size
        mode: 'train' or 'test'; 'train' mode need augmentation
    Returns:
        x_img: image data after resized and scaling (smallest size = 300px)
        Y: [y_rpn_cls, y_rpn_regr]
        img_data_aug: augmented image data (original image with augmentation)
        debug_img: show image for debug
        num_pos: show number of positive anchors for debug
    """
    while True:

        for img_data in all_img_data:
            try:

                # read in image, and optionally add augmentation

                if mode == 'train':
                    img_data_aug, x_img = augment(img_data, C, augment=True)
                else:
                    img_data_aug, x_img = augment(img_data, C, augment=False)

                (width, height) = (img_data_aug['width'], img_data_aug['height'])
                (rows, cols, _) = x_img.shape

                assert cols == width
                assert rows == height

                # get image dimensions for resizing
                (resized_width, resized_height) = get_new_img_size(width, height, C.im_size)

                # resize the image so that smalles side is length = 300px
                x_img = cv2.resize(x_img, (resized_width, resized_height), interpolation=cv2.INTER_CUBIC)
                debug_img = x_img.copy()

                try:
                    y_rpn_cls, y_rpn_regr, num_pos = calc_rpn(C, img_data_aug, width, height, resized_width,
                                                              resized_height, img_length_calc_function)
                except:
                    continue

                # Zero-center by mean pixel, and preprocess image

                x_img = x_img[:, :, (2, 1, 0)]  # BGR -> RGB
                x_img = x_img.astype(np.float32)
                x_img[:, :, 0] -= C.img_channel_mean[0]
                x_img[:, :, 1] -= C.img_channel_mean[1]
                x_img[:, :, 2] -= C.img_channel_mean[2]
                x_img /= C.img_scaling_factor

                x_img = np.transpose(x_img, (2, 0, 1))
                x_img = np.expand_dims(x_img, axis=0)

                y_rpn_regr[:, y_rpn_regr.shape[1] // 2:, :, :] *= C.std_scaling

                x_img = np.transpose(x_img, (0, 2, 3, 1))
                y_rpn_cls = np.transpose(y_rpn_cls, (0, 2, 3, 1))
                y_rpn_regr = np.transpose(y_rpn_regr, (0, 2, 3, 1))

                yield np.copy(x_img), [np.copy(y_rpn_cls), np.copy(y_rpn_regr)], img_data_aug, debug_img, num_pos

            except Exception as e:
                print(e)
                continue

### IOU
<div dir=rtl>

برای وقتی که بخواهیم کلاس یک ROI را مشخص کنیم، نیاز است که محدوده آن را با groun truth که به عنوان برچسب داریم مقایسه کنیم. اگر میزان IOU آن ها از حدی بیشتر بود به صورت احتمالاتی به عنوان تشخیص درست درنظر گرفته می شوند.
در اینجا تابع IOU به این منظور پیاده سازی شده است.
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)

</div>

In [0]:
import copy
import numpy as np


def union(au, bu, area_intersection):
    area_a = (au[2] - au[0]) * (au[3] - au[1])
    area_b = (bu[2] - bu[0]) * (bu[3] - bu[1])
    area_union = area_a + area_b - area_intersection
    return area_union

def intersection(ai, bi):
    x = max(ai[0], bi[0])
    y = max(ai[1], bi[1])
    w = min(ai[2], bi[2]) - x
    h = min(ai[3], bi[3]) - y
    if w < 0 or h < 0:
        return 0
    return w * h

def iou(a, b):
    # a and b should be (x1,y1,x2,y2)

    if a[0] >= a[2] or a[1] >= a[3] or b[0] >= b[2] or b[1] >= b[3]:
        return 0.0

    area_i = intersection(a, b)
    area_u = union(a, b, area_i)

    return float(area_i) / float(area_u + 1e-6)

def calc_iou(R, img_data, C, class_mapping):
    """Converts from (x1,y1,x2,y2) to (x,y,w,h) format
    Args:
        R: bboxes, probs
    """
    bboxes = img_data['bboxes']
    (width, height) = (img_data['width'], img_data['height'])
    # get image dimensions for resizing
    (resized_width, resized_height) = get_new_img_size(width, height, C.im_size)

    gta = np.zeros((len(bboxes), 4))

    for bbox_num, bbox in enumerate(bboxes):
        # get the GT box coordinates, and resize to account for image resizing
        # gta[bbox_num, 0] = (40 * (600 / 800)) / 16 = int(round(1.875)) = 2 (x in feature map)
        gta[bbox_num, 0] = int(round(bbox['x1'] * (resized_width / float(width)) / C.rpn_stride))
        gta[bbox_num, 1] = int(round(bbox['x2'] * (resized_width / float(width)) / C.rpn_stride))
        gta[bbox_num, 2] = int(round(bbox['y1'] * (resized_height / float(height)) / C.rpn_stride))
        gta[bbox_num, 3] = int(round(bbox['y2'] * (resized_height / float(height)) / C.rpn_stride))

    x_roi = []
    y_class_num = []
    y_class_regr_coords = []
    y_class_regr_label = []
    IoUs = []  # for debugging only

    # R.shape[0]: number of bboxes (=300 from non_max_suppression)
    for ix in range(R.shape[0]):
        (x1, y1, x2, y2) = R[ix, :]
        x1 = int(round(x1))
        y1 = int(round(y1))
        x2 = int(round(x2))
        y2 = int(round(y2))

        best_iou = 0.0
        best_bbox = -1
        # Iterate through all the ground-truth bboxes to calculate the iou
        for bbox_num in range(len(bboxes)):
            curr_iou = iou([gta[bbox_num, 0], gta[bbox_num, 2], gta[bbox_num, 1], gta[bbox_num, 3]], [x1, y1, x2, y2])

            # Find out the corresponding ground-truth bbox_num with larget iou
            if curr_iou > best_iou:
                best_iou = curr_iou
                best_bbox = bbox_num

        if best_iou < C.classifier_min_overlap:
            continue
        else:
            w = x2 - x1
            h = y2 - y1
            x_roi.append([x1, y1, w, h])
            IoUs.append(best_iou)

            if C.classifier_min_overlap <= best_iou < C.classifier_max_overlap:
                # hard negative example
                cls_name = 'bg'
            elif C.classifier_max_overlap <= best_iou:
                cls_name = bboxes[best_bbox]['class']
                cxg = (gta[best_bbox, 0] + gta[best_bbox, 1]) / 2.0
                cyg = (gta[best_bbox, 2] + gta[best_bbox, 3]) / 2.0

                cx = x1 + w / 2.0
                cy = y1 + h / 2.0

                tx = (cxg - cx) / float(w)
                ty = (cyg - cy) / float(h)
                tw = np.log((gta[best_bbox, 1] - gta[best_bbox, 0]) / float(w))
                th = np.log((gta[best_bbox, 3] - gta[best_bbox, 2]) / float(h))
            else:
                print('roi = {}'.format(best_iou))
                raise RuntimeError

        class_num = class_mapping[cls_name]
        class_label = len(class_mapping) * [0]
        class_label[class_num] = 1
        y_class_num.append(copy.deepcopy(class_label))
        coords = [0] * 4 * (len(class_mapping) - 1)
        labels = [0] * 4 * (len(class_mapping) - 1)
        if cls_name != 'bg':
            label_pos = 4 * class_num
            sx, sy, sw, sh = C.classifier_regr_std
            coords[label_pos:4 + label_pos] = [sx * tx, sy * ty, sw * tw, sh * th]
            labels[label_pos:4 + label_pos] = [1, 1, 1, 1]
            y_class_regr_coords.append(copy.deepcopy(coords))
            y_class_regr_label.append(copy.deepcopy(labels))
        else:
            y_class_regr_coords.append(copy.deepcopy(coords))
            y_class_regr_label.append(copy.deepcopy(labels))

    if len(x_roi) == 0:
        return None, None, None, None

    # bboxes that iou > C.classifier_min_overlap for all gt bboxes in 300 non_max_suppression bboxes
    X = np.array(x_roi)
    # one hot code for bboxes from above => x_roi (X)
    Y1 = np.array(y_class_num)
    # corresponding labels and corresponding gt bboxes
    Y2 = np.concatenate([np.array(y_class_regr_label), np.array(y_class_regr_coords)], axis=1)

    return np.expand_dims(X, axis=0), np.expand_dims(Y1, axis=0), np.expand_dims(Y2, axis=0), IoUs

### Loss
<div dir=rtl>
این قسمت از سایت زیر گرفته شده‏ است:

[faster-rcnn-keras](https://github.com/shadow12138/faster-rcnn-keras)
</div>

In [0]:
from keras import backend as K
import tensorflow as tf
from keras.objectives import categorical_crossentropy

lambda_rpn_regr = 1.0
lambda_rpn_class = 1.0

lambda_cls_regr = 1.0
lambda_cls_class = 1.0

epsilon = 1e-4

# train.py
def rpn_loss_regr(num_anchors):
    """Loss function for rpn regression
    Args:
        num_anchors: number of anchors (9 in here)
    Returns:
        Smooth L1 loss function
                           0.5*x*x (if x_abs < 1)
                           x_abx - 0.5 (otherwise)
    """

    def rpn_loss_regr_fixed_num(y_true, y_pred):
        # x is the difference between true value and predicted vaue
        x = y_true[:, :, :, 4 * num_anchors:] - y_pred

        # absolute value of x
        x_abs = K.abs(x)

        # If x_abs <= 1.0, x_bool = 1
        x_bool = K.cast(K.less_equal(x_abs, 1.0), tf.float32)

        return lambda_rpn_regr * K.sum(
            y_true[:, :, :, :4 * num_anchors] * (x_bool * (0.5 * x * x) + (1 - x_bool) * (x_abs - 0.5))) / K.sum(
            epsilon + y_true[:, :, :, :4 * num_anchors])

    return rpn_loss_regr_fixed_num

# train.py
def rpn_loss_cls(num_anchors):
    """Loss function for rpn classification
    Args:
        num_anchors: number of anchors (9 in here)
        y_true[:, :, :, :9]: [0,1,0,0,0,0,0,1,0] means only the second and the eighth box is valid which contains pos or neg anchor => isValid
        y_true[:, :, :, 9:]: [0,1,0,0,0,0,0,0,0] means the second box is pos and eighth box is negative
    Returns:
        lambda * sum((binary_crossentropy(isValid*y_pred,y_true))) / N
    """

    def rpn_loss_cls_fixed_num(y_true, y_pred):
        return lambda_rpn_class * K.sum(y_true[:, :, :, :num_anchors] * K.binary_crossentropy(y_pred[:, :, :, :],
                                                                                              y_true[:, :, :,
                                                                                              num_anchors:])) / K.sum(
            epsilon + y_true[:, :, :, :num_anchors])

    return rpn_loss_cls_fixed_num

# train.py
def class_loss_regr(num_classes):
    """Loss function for rpn regression
    Args:
        num_anchors: number of anchors (9 in here)
    Returns:
        Smooth L1 loss function
                           0.5*x*x (if x_abs < 1)
                           x_abx - 0.5 (otherwise)
    """

    def class_loss_regr_fixed_num(y_true, y_pred):
        x = y_true[:, :, 4 * num_classes:] - y_pred
        x_abs = K.abs(x)
        x_bool = K.cast(K.less_equal(x_abs, 1.0), 'float32')
        return lambda_cls_regr * K.sum(
            y_true[:, :, :4 * num_classes] * (x_bool * (0.5 * x * x) + (1 - x_bool) * (x_abs - 0.5))) / K.sum(
            epsilon + y_true[:, :, :4 * num_classes])

    return class_loss_regr_fixed_num


def class_loss_cls(y_true, y_pred):
    return lambda_cls_class * K.mean(categorical_crossentropy(y_true[0, :, :], y_pred[0, :, :]))

<div dir=rtl>
در این بخش مراحل آموزش پیاده سازی شده است. که ایتدا مدل شبکه اصلی ساخته شده و سپس RPN و در نهایت شبکه LSTM. و سپس هر کدام به صورت جدا و همچنین ترکیبی compile شده اند.
</div>

In [0]:
data_gen_train = get_data()
input_shape_img = (None, None, 3)

img_input = Input(shape=input_shape_img)
roi_input = Input(shape=(None, 4))

# define the base network
shared_layers = SE_VGG16(img_input)

# define the RPN, built on the base layers
num_anchors = len(C.anchor_box_scales) * len(C.anchor_box_ratios) # 15
rpn = rpn_layer(shared_layers, num_anchors)

rn = LSTM(roi_input)

model_rpn = Model(img_input, rpn[:2])
model_rn = Model([img_input, roi_input], rn)

model_all = Model([img_input, roi_input], rpn[:2] + rn)

optimizer = Adam(lr=0.001)
optimizer_rn = Adam(lr=1e-5)

model_rpn.compile(optimizer=optimizer, loss=[rpn_loss_cls(num_anchors), rpn_loss_regr(num_anchors)])
# model_rn.compile(optimizer=optimizer_rn, loss=,metrics={'dense_class_{}'.format(len(cfg.classes_count)): 'accuracy'})
# model_all.compile(optimizer='sgd', loss='mae')