# Fast Oriented Text Spotting with a Unified Network (FOTS)
These slides concern both the paper and my implementation of it for a knowledge sharing session.

## Overview

<img src="figs/fots-overview.png"></img>

## Data
* In this paper (and other similar ones) the rotated bounding boxes are represented in the RBOX format.
* An RBOX coordinate have 5 dimensions $(t, b, l, r, \theta)$ which are distances to top, bottom, left, right and an angle.
* An $(x,y)$ location coupled with a $(t, b, l, r, \theta)$ rbox coordinate fully decribes a bounding box.
* For FOTS the RBOX labels are needed at every pixel. We also need pixel-wise labels for whether a pixel is part of a text area or not.

<img src="figs/fots-labels1.png" width="45%" align="left"/>
<img src="figs/fots-labels2.png" width="45%" align="left"/>

## Shared Features
* In the paper the shared features are computed by first applying a resnet 50 backbone network and then scaling up the feature maps from 1/32 to 1/4 the size of the original image size.
* The upscaling is done with bilinear resizing.
* It also combines lower level features with higher level features in a U-NET style.

```python
def build_shared_features_network(x, is_training):
    with tf.variable_scope('backbone'):
        x, block_groups = resnet_v1(data_format='channels_last')(x, is_training)

    def deconv_merge(x, y):
        h = tf.shape(y)[1]
        w = tf.shape(y)[2]
        c = y.shape[3]

        x = tf.layers.conv2d(x, c, 3, 2, padding='same')
        x = tf.image.resize_bilinear(x, (h, w))
        x = tf.layers.batch_normalization(x, training=is_training)
        x = tf.nn.relu(x)
        x = x + y

        return x

    x = deconv_merge(x, block_groups[-2])
    x = deconv_merge(x, block_groups[-3])
    x = deconv_merge(x, block_groups[-4])

    return x
```

## Text Detection Branch
* The text detection branch makes pixel-wise (downscaled) predictions of whether a pixel is part of a text area (classification) and predictions of the RBOX coordinates of those pixels (bounding box regression).

```python
def build_text_detection_network(x, max_detection_dist, is_training):
    conv = functools.partial(tf.layers.conv2d, padding='same')

    score_map_logits = conv(x, 1, 1, 1)

    dists = conv(x, 4, 1, 1)
    dists = tf.nn.sigmoid(dists) * max_detection_dist
    angle = conv(x, 1, 1, 1)
    angle = tf.nn.tanh(angle) * math.pi / 2  # [-90, 90].
    geom_map = tf.concat((dists, angle), axis=3)

    return score_map_logits, geom_map
```

### Text Detection Loss
* Sigmoid cross entropy for pixel classification loss.
* $1 - cos(\theta_{pred} - \theta_{gt})$ for rotation regression loss.
* Intersection over union loss for predicted rbox coordinates (without considering the rotation).
* Mask loss from pixels that we don't care about.
    * For classification we don't care about the pixels between the bounding box and the bounding box's corresponding "shrunk area".
    * For the regression losses we don't care about the pixels outside bounding boxes.

## RoIRotate
* RoIRotate is the operation that transforms the predicted text regions into axis-aligned feature maps that are then passed into the text recognition branch.
* RoIRotate takes the shared featuremap along with a $(x,y)$ coordinate and a $(t, b, l, r, \theta)$ RBOX geometry coordinate as input and returns an axis-aligned (unrotated) crop from the shared featuremap.
* During training we use the ground truth locations and geometries of text regions as input to RoIRotate when extracting the axis-aligned feature maps to be used with the text recognition branch.

<img src="figs/fots-roi-rotate-visualization.png" width="30%"></img>

### How it works
* Essentially, for each pixel location in the extracted axis-aligned feature map we need to know which pixel location in the source feature map (the shared feature space) to look at and sample from.
* Thus we need a mapping between these coordinate systems.
$$
\mathbf{M}
\begin{pmatrix}
    x_s \\
    y_s \\
    1
\end{pmatrix}
=
\begin{pmatrix}
    x_t \\
    y_t \\
    1
\end{pmatrix}
$$
* $\mathbf{M}$ is constructed from the location in shared feature space, the RBOX geometry and a target height of the cropped out feature map.
* $\mathbf{M}$ encodes a series of transformations to map between these coordinate systems.
* When cropping out the axis aligned feature map, we need $\mathbf{M}^{-1}$ because we know the target locations and need to figure out the source locations.
* In practice, it's possible (and maybe more stable / faster ?) to construct $\mathbf{M}^{-1}$ directly by taking the inverse of each separate transformation matrix which is easy because they're just rotations, translations and scaling matrices.

## Locality Aware NMS
* Since we have dense predictions with FOTS, there will be a lot of candidate bounding boxes for every actual box.
* NMS is a way to merge together bounding boxes if they are deemed similar enough, i.e. if they have an IoU over a certain threshold.
* This is done iteratively by going through the candidate boxes row by row.
* In another paper (EAST) they propose a small change to NMS to make it a bit faster which is what is implemented here.
* The current implementation is in tensorflow, but possibly there are performance gains to be had by rewriting it as a custom c++ op.
* The current implementation also uses a general algorithm for computing intersection areas between two polygons. In this application we will always have convex quadrilaterals for which there might be faster algorithms.

<img src="figs/fots-nms.png" width="65%" />

## Text Recognition Branch
* The text recognition branch takes the axis-aligned feature maps and first applies a cnn with heightwise pooling layers to compute a height=1 sequence feature map.
* This feature map is then fed into a two layer bidirectional LSTM.
* The computed LSTM states are summed and fed into a dense classification layer for framewise predictions.

```python
def build_text_recognition_network(x, widths, charset_size, dropout_rate, is_training):
    def conv_bn_relu(x, d, k, s):
        x = tf.layers.conv2d(x, d, k, s, padding='same')
        x = tf.layers.batch_normalization(x, training=is_training, axis=3)
        x = tf.nn.relu(x)
        return x

    def height_max_pool(x):
        return tf.layers.max_pooling2d(x, (2, 1), (2, 1), padding='valid')

    x = conv_bn_relu(x, 64, 3, 1)
    x = conv_bn_relu(x, 64, 3, 1)
    x = height_max_pool(x)
    x = conv_bn_relu(x, 128, 3, 1)
    x = conv_bn_relu(x, 128, 3, 1)
    x = height_max_pool(x)
    x = conv_bn_relu(x, 256, 3, 1)
    x = conv_bn_relu(x, 256, 3, 1)
    x = height_max_pool(x)
    x = tf.squeeze(x, axis=1)
    
    ...
    
    (x_fw, x_bw), _ = tf.nn.bidirectional_dynamic_rnn(
        tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(256),
        tf.contrib.cudnn_rnn.CudnnCompatibleLSTMCell(256),
        x,
        sequence_length=sequence_lengths,
        dtype=tf.float32)
    
    x = x_fw + x_bw
    x = tf.layers.dropout(x, rate=dropout_rate, training=is_training)
    x = tf.layers.dense(x, charset_size + 1)  # Last label for blank ctc prediction.

    return x, sequence_lengths
```

### Text Recognition Loss
* The paper uses CTC loss.
* CTC loss is a way to compute a loss (and make predictions) for a sequence of predictions where the labels are unaligned. I.e. the prediction is a sequence and label is another sequence but they can be of different lengths and we don't which label should correspond to which prediction.
* This is good because text comes in many different shapes and forms and each frame might have different levels of information.
* CTC adds a blank token and then (intuitively) considers different sequences of actual tokens and the blank token which will see which alignment of predictions and labels give the best match.

### Text Recognition Prediction
* Similarly, the paper uses CTC beam search decoding.
* E.g. for a very widely stretched 'a' there might be many predictions in the prediction sequence saying 'a' but in the output sequence it should only be one.
* The beam search then considers many such sequences to make the final best prediction sequence.

## Evaluation
* Currently I'm computing the precision and recall (and f-measure) to measure the performance of the text detection. Note that this recall metric should not be confused with the recall measured where the model was used in conjunction with the regexes defined for the Corotos marketplace. 
* Hits for precision and recall of text detection are defined (in most papers) as having an iou > 0.5 with it's ground truth box. I've done the same in this case.
    * There are some complexities involved with this discussed on the next subslide.
* For text recognition, i.e. predicting the text in a bounding box, I'm evaluating this with normalized Levenshtein edit distance.
* The normalized edit distance is probably what we want to use as the metric to look at to guide experiments.

### Difficulty
* The final predicted bounding boxes and the ground truth bounding boxes are both in an undefined order.
* In order to compute the previously mentioned metrics, we need to first match predictions with ground truths.
* This is a combinatorial optimization problem and I've done this with the *Hungarian algorithm* based on IoUs between each prediction ground truth pair.
    * TODO: Add image explaining it.

## Experiments

## Current State / Discussion
* We have little real world data to evaluate on.
    * None if we consider the metrics dealing with text detection.
* The current evaluation task is based on how good the model plus some regexes is at finding matches.
* For the data we have generated, the model reaches f-measures over 0.9.
    * However, this might be considered too easy since the it's only a random split rather than for example keeping certain fonts out of the training set and then evaluating on images containing these.
    * But it probably says something about what the model is capable of.
* On actual ad images, qualitatively, it's seems a bit worse. But more analysis is needed.
* I think there are "quick" wins by improving how we generate data by trying to mimic real world data better.