![Image Introduce](3.TensorFlow-STN-Part1_files/ai.jpg)
<br><center>[Image Courtesy](https://www.technologyreview.com/s/601519/how-to-create-a-malevolent-artificial-intelligence/)</center>

In last blog post, we introduced two very important concepts: **affine transformations** and **bilinear interpolation** and mentioned that they would prove crucial in understanding Spatial Transformer Networks.

Today, we'll provide a detailed, section-by-section summary of the [Spatial Transformer Networks](https://arxiv.org/abs/1506.02025) paper, a concept originally introduced by researchers _Max Jaderberg, Karen Simonyan, Andrew Zisserman and Koray Kavukcuoglu_ of Google Deepmind.

Hopefully, it'll give you a clear understanding of the module and prove useful for its implementation in TensorFlow.

### Table of Contents

* [Motivation](#Motivation)
* [Pooling Operator](#Pooling-Operator)
* [Spatial Transformer Network](#Spatial-Transformer-Network)
 * [Localisation Network](#Localisation-Network)
 * [Parameterised Sampling Grid](#Parameterised-Sampling-Grid)
 * [Differentiable Image Sampling](#Differentiable-Image-Sampling)
* [Fun with STNs](#Fun-with-STNs)
 * [Distorted MNIST](#Distorted-MNIST)
 * [GTSRB dataset](#GTSRB-dataset)
* [Summary](#Summary)
* [References](#References)

## Motivation

When working on a classification task, it is usually desirable that our system be **robust** to input variations. By this, we mean to say that should an input undergo a certain "transformation" so to speak, our classification model should in theory spit out the same class label as before that transformation. A few examples of the "challenges" our image classification model may face include:

* **scale variation**: variations in size both in the real world and in the image.
* **viewpoint variation**: different object orientation with respect to the viewer.
* **deformation**: non rigid bodies can be deformed and twisted in unusual shapes.

![Variation I](3.TensorFlow-STN-Part1_files/var1.png)![Variation II](3.TensorFlow-STN-Part1_files/var2.png)
<br><center>[Image Courtesy](http://cs231n.github.io/classification/)</center>

For illustration purposes, take a look at the images above. While the task of classifying them may seem trivial to a human being, recall that our computer algorithms only work with raw 3D arrays of brightness values so a tiny change in an input image can alter every single pixel value in the corresponding array. Hence, our ideal image classification model should in theory be able to disentangle object pose and deformation from texture and shape.

For a different type of intuition, let's again take a look at the following cat images.

<table>
<tr>
<td><img src=3.TensorFlow-STN-Part1_files/cat3.jpg width=200 height=200/></td><td><img src=3.TensorFlow-STN-Part1_files/cat3_.jpg width=150 height=150/></td>
</tr>
<tr>
<td><img src=3.TensorFlow-STN-Part1_files/cat4.jpg width=200 height=200/></td><td><img src=3.TensorFlow-STN-Part1_files/cat4_.jpg width=150 height=150/></td>
</tr>    
</table>
<center><font color="gray">Left: Cat images which may present classification challenges. Right: Transformed images which yield a simplified classification pipeline.</font></center>

Would it not be extremely describe if our model could go from left to right using some sort of crop and scale-normalize combination so as to simplify the subsequent classification task?

## Pooling Operator

It turns out that the pooling layers we use in our neural network architectures actually endow our models with a certain degree of spatial invariance. Recall that the pooling operator acts as a sort of downsampling mechanism. It progressively reduces the spatial size of the feature map along the depth dimension, cutting down the amount of parameters and computational cost.

<table>
<tr><td><img src=3.TensorFlow-STN-Part1_files/pool.jpeg width=200 height=200/></td><td><img src=3.TensorFlow-STN-Part1_files/maxpool.jpeg width=400 height=200/></td></tr>
</table>
<center><font color="gray">Pooling layer downsamples the volume spatially. <b>Left</b>: In this example, the input volume of size [224x224x64] is pooled with filter size 2, stride 2 into output volume of size [112x112x64]. <b>Right</b>: 2x2 max pooling.</font>&emsp;[Image Courtesy](http://cs231n.github.io/convolutional-networks/#pool)</center>

**How exactly does it provide invariance?** Well think of it this way. The idea behind pooling is to take a complex input, split it up into cells, and "pool" the information from these complex cells to produce a set of simpler cells that describe the output. So for example, say we have 3 images of the number 7, each in a different orientation. A pool over a small grid in each image would detect the number 7 regardless of its position in that grid since we'd be capturing approximately the same information by aggregating pixel values.

Now there are a few downsides to pooling which make it an undesirable operator. For one, pooling is **destructive**. It discards 75% of feature activations when it is used, meaning we are guaranteed to lose exact positional information. Now you may be wondering why this is bad since we mentioned earlier that it endowed our network with some spatial robustness. Well the thing is that positional information is invaluable in visual recognition tasks. Think of our cat classifier above. It may be important to know where the position of the whiskers are relative to, say the snout. This can't be achieved when it is this sort of information we throw away when we use max pooling.

Another limitation of pooling is that it is **local and predefined**. With a small receptive field, the effects of a pooling operator are only felt towards deeper layers of the network meaning intermediate feature maps may suffer from large input distortions. And remember, we can't just increase the receptive field arbitrarily because then that would downsample our feature map too aggressively.

The main takeaway is that ConvNets are not invariant to relatively large input distortions. This limitation is due to having only a restricted, pre-defined pooling mechanism for dealing with spatial variation of the data. This is where Spatial Transformer Networks come into play!

<font color="gray">The pooling operation used in convolutional neural networks is a big mistake and the fact that it works so well is a disaster. (Geoffrey Hinton, Reddit AMA)</font>

## Spatial Transformer Network

### Localisation Network

### Parameterised Sampling Grid

### Differentiable Image  Sampling

## Fun with STNs

### Distorted MNIST

### GTSRB dataset

## Summary

## References