<a href="https://colab.research.google.com/github/ldjlammers/Hide-and-Seek-CS4240/blob/master/BlogPost.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

![alt text](./in-text-images/HideandSeekVisual.jpg)

# The Hide and Seek method for classifaction and localisation

*By: Laurens Lammers and Mink van Oosterhout*

## 1. Introduction

In this blog post we try to reproduce a method called "Hide and Seek", proposed by Singh and Lee in their 2017 paper '[Hide-and-Seek](https://arxiv.org/abs/1704.04232): Forcing a network to be Meticulous for weakly-supervised Object and Action Localization'. In particular we focus on object localization in images.

This blog post is ...


---

## 2. What is Hide and Seek?

'Hide and Seek' is a weakly-supervised framework that intents to improve object localization in images and action localization in videos. Instead of making algoritmic changes, or relying on external data, the Hide and Seek method makes changes to the input images. The key idea is to hide patches of the training images rondomly, forcing the network to seek other relevant parts when the most discriminative part is hidden. This principle is visualised in the image above at the top of the blog. The advantage of such an approach is that it can be applied to any network which is designed for object localisation. 

Most weakly-supervised localization methods identify discriminative patterns in the training data. These disciminative patterns are usually areas which occor frequently in one class and hardly ever in other classes. Due to variations within classes and leaning to much on just classification, these approaches frequently do not succeed in identifying the whole extent of an object. Instead, they only localize the most discriminative part of the object. Two examples of this can be seen in the image below. Because the head of the rabbit and the cilinder of the revolver distinguish them the most relative to other classes, the classiefer (over)focusses on these regions.

![alt text](./in-text-images/discriminativeparts.jpg)

To adress this problem, [Zhou et al.](https://arxiv.org/pdf/1512.04150.pdf) (2016) replaced max pooling after the final convolutional layer in a classification network with global average pooling. In this configuration, a very small maximum can no longer dictate the activation of an entire feature maps. Instead, it forces the network to look beyond the most discriminative parts, in order to achieve activation for a certain feature map. However, this approach did not solve the problem entirely. The network can still avoid learing less discriminative parts by finding more than one high-discriminative ones

The authors of the Hide and Seek paper take a 'radically different' approach. Instead of making changes to a network architecture or (hyper)parameters, they modify the input image. By *hiding* patches of the input image, they force the network to *seek* less discriminative regions. Hence, they propose the 'Hide and Seek' method. By hiding random patches, it may happen that the most discriminative part of an image (e.g. face or characteristic shape) is invisible to the model. By hiding different patches in each training epoch, the model sees different parts of an image each epoch and will thus have to focus on multiple discriminative regions, instead of just one.

Finally, an important aspect of the Hide and Seek method is the fact that no patches are hidden during validation. Consequently, the distribution of the training data will be dissimilar to the distribution of the test data. But how how will such a model generalize then? We will find out in the next section...

---

## 3. The paper's approach

In this section the Hide and Seek method is described in more detail. We start by laying out the general framework of the approach. After this, the specific aspects are explained one by one.

### Weakly-supervised object localization

Given a set of images $I_{set} = \{ I_1, I_2, ..., I_N\}$, the goal of weakly-supervised object localisation is to learn an object localizer that can predict both the category label, as well as the bounding box for that same object, in a new test image. This is achiebed by training a convolutional neural network with just the image-level labels. That is, no ground truth bounding boxes are used during training. 

![alt text](./in-text-images/approach.jpg)


### Hiding random patches
The objective of hiding patches of the training images is to display many parts of the image to the network, while simultaneously training it for the classification task. By hiding the patches randomly, the network is forced to explore the less discriminative parts of the image, immproving the localization abilities. 

Given an input image $I$ of size $W x H x 3$, it is divided into a grid with fixed patch size of $S x S x 3$. Subsequently, each individual patch is hidden with probability $p_{hide} = 0.5$. This process is shown in the left part of the figure above. The new image $I'$ is fed into the CNN for classification. In addition, for each image, a different random set of patches is hidden and also, for the same image, a new random set of patches is hidden every epoch.

During testing, the entire images is given as input to the network. This can be seen in the right part of the image above. Since the network has already learned to focus on multiple parts of the image, it is not necessary to hide patches. 


### Setting hidden pixel values
By hiding certain patches of the training images, a difference arises between the training and testing data distributions. For the trained network to generalize well to the test data, the distributions should be approximately equal. 

![alt text](./in-text-images/hiddenpixels.jpg)

### Architectures

![alt text](./in-text-images/AlexNetScheme.jpg)
![alt text](./in-text-images/GoogLeNetScheme.png)