## Humpback Whale challenge  -- RESULTS!

Proposed solution to solve [this challenge](https://www.kaggle.com/c/humpback-whale-identification).

Prithvi Raju [email](nprihviraj24@gmail.com)

[Github Project](https://github.com/nprithviraj24/deep-learning/tree/master/few-shot-learning) <br />

#### ArXiv links to Reference papers: <br />

[In Defense of the Triplet Loss for Person Re-Identification](https://arxiv.org/abs/1703.07737) <br />    

[FaceNet](https://arxiv.org/abs/1503.03832)

<strong>Objective<strong>: identify individual whales in images. 
    
Constituents of dataset: 

    - train.zip - a folder containing the training images
    - train.csv - maps the training Image to the appropriate whale Id. Whales that are not predicted to have a label identified in the training data should be labeled as new_whale.
    - test.zip - a folder containing the test images to predict the whale Id
    - sample_submission.csv - a sample submission file in the correct format
    
    

### Evaluating the data

Before building the model, lets understand the data first. 

In [25]:
import pandas as pd
file = pd.read_csv('train.csv')
ids = file.Id
print(" Number of training samples: ", ids.shape)
uni = file.Id.value_counts()
gt1 = uni[uni>1]
print(" Number of unique whales with pictures more than one: ", gt1)


 Number of training samples:  (9850,)
 Number of unique whales with pictures more than one:  new_whale    810
w_1287fbc     34
w_98baff9     27
w_7554f44     26
w_1eafe46     23
            ... 
w_80c692d      2
w_eb44149      2
w_73cbacd      2
w_17a3581      2
w_0466071      2
Name: Id, Length: 2031, dtype: int64


In [28]:
import os
path, dirs, files = next(os.walk("train/train"))
file_count = len(files)
print("Number of images in train folder: ", file_count)

file.loc[1]

Number of images in train folder:  9850


Image    000466c4.jpg
Id          w_1287fbc
Name: 1, dtype: object

We can conclude that there are 4251 classes for only 9850 images. Most of the "class" have only one training image.

## Few-shot learning using CNN block from Siamese network and  Batch hard strategy for optimization

<br />

<h2> Pipeline to create a maching network<h2>

>  Data preprocessing and augmentation 

    I will be using preprocessing steps done in popular notebooks. I will try to reason why certain preprocessing steps are crucial.

>  Matching Network
    
A method used to represent discrete variables in data manifold $ \mathbb{R} $ as continuous vectors.
   
>> Build an Encoder Network  

>> Generate "image" embeddings 

>> Pairwise distance between query samples and support sets.

>> Calculating predictions by taking weighted average of the support set labels with the normalised distance.


>  Batch hard strategy for addressing loss functions 

            By using Online triplet mining.




Building the model, I incorporate the ResNet block, wonderfully explained in this [blog](http://teleported.in/posts/decoding-resnet-architecture/) post. 

Why ResNet?

The rule of thumb suggests that ResNet has has worked in this case.

`` The idea is to form a subblock with a 1x1 convolution reducing the number of features, a 3x3 convolution and another 1x1 convolution to restore the number of features to the original. The output of these convolutions is then added to the original tensor (bypass connection). I use 4 such subblocks by block, plus a single 1x1 convolution to increase the feature count after each pooling layer. ``



### PyTorch model creation

The branch model is composed of 6 blocks, each block processing maps with smaller and smaller resolution,, with intermediate pooling layers.

    Block 1 - 384x384
    Block 2 - 96x96
    Block 3 - 48x48
    Block 4 - 24x24
    Block 5 - 12x12
    Block 6 - 6x6

> Block 1 has a single convolution layer with stride 2 followed by 2x2 max pooling. Because of the high resolution, it uses a lot of memory, so a minimum of work is done here to save memory for subsequent blocks.

> Block 2 has two 3x3 convolutions similar to VGG. These convolutions are less memory intensive then the subsequent ResNet blocks, and are used to save memory. Note that after this, the tensor has dimension 96x96x64, the same volume as the initial 384x384x1 image, thus we can assume no significant information has been lost.

> Blocks 3 to 6 perform ResNet like convolution. I suggest reading the original paper, but the idea is to form a subblock with a 1x1 convolution reducing the number of features, a 3x3 convolution and another 1x1 convolution to restore the number of features to the original. The output of these convolutions is then added to the original tensor (bypass connection). I use 4 such subblocks by block, plus a single 1x1 convolution to increase the feature count after each pooling layer.

> The final step of the branch model is a global max pooling, which makes the model robust to fluke not being always well centered.


I'm acknowledging this [notebook](https://www.kaggle.com/martinpiotte/whale-recognition-model-with-score-0-78563) because I incorporate few preprocessing techniques to make it work. When it comes to data preprocessing, I look what rule of thumb suggests because they make life so much easier. 

### Other interesting methods:

<br />

#### Generic One shot siamese network.
[Refer this](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=1&ved=2ahUKEwjq1YPx57_mAhWd63MBHX7HAEAQFjAAegQIARAC&url=https%3A%2F%2Fwww.cs.cmu.edu%2F~rsalakhu%2Fpapers%2Foneshot1.pdf&usg=AOvVaw0gKET0McCdIoco9UX2KcsE)
- Learning a vector representation of a complex input, like an image, is an example of dimensionality reduction. 
- Taking <strong> Contrastive loss </strong> where Distance between two embeddings of similar class are optimized by bringing it closer.  


## Defining the Encoder Network Architecture 

I've tried to keep everything succinct and detailed. 

An encoder CNN architecture. Instead of an MLP, which uses linear, fully-connected layers, we can instead use:
* [Convolutional layers](https://pytorch.org/docs/stable/nn.html#conv2d), which can be thought of as stack of filtered images.
* [Maxpooling layers](https://pytorch.org/docs/stable/nn.html#maxpool2d), which reduce the x-y size of an input, keeping only the most _active_ pixels from the previous layer.
* The usual Linear + Dropout layers to avoid overfitting and produce a 10-dim output.
* Batch Normalization layer: The motivation behind it is purely statistical: it has been shown that normalized data, i.e., data with zero mean and unit variance, allows networks to converge much faster. So we want to normalize our mini-batch data, but, after applying a convolution, our data may not still have a zero mean and unit variance anymore. So we apply this batch normalization after each convolutional layer.

### What about selecting the right kernel size?
We always prefer to use smaller filters, like 3×3 or 5×5 or 7×7, but which ones of theses works the best? 

<br />


#### Now it is seldom used in practice to create your own encoder network uniquely from scratch because there are so many  architecture  that will do the job. And these architectures are implemented in different frameworks.

Example: 

ResNet18, VGG-16 etc. Slight modification to these networks will do the job. I used ResNet variations with a dense layer connected at the end.

#### A Sample encoder architecture from famous notebook in the same challenge is given below


In [5]:
import torchvision
import torch.utils.data as utils
from torchvision import datasets
import torchvision.transforms as transforms
from torch.optim import lr_scheduler
import os
from PIL import Image
import torch
from torch.autograd import Variable
import PIL.ImageOps    
import torch.nn as nn
from torch import optim
import torch.nn.functional as F
from torch.utils.data import DataLoader,Dataset
from torch.autograd import Variable
import matplotlib.pyplot as plt
import torchvision.utils

class SiameseNetwork(nn.Module):
    def __init__(self):
        super(SiameseNetwork, self).__init__()
        
        # Setting up the Sequential of CNN Layers
        self.cnn1 = nn.Sequential(
            
            nn.Conv2d(1, 96, kernel_size=11,stride=1),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(5,alpha=0.0001,beta=0.75,k=2),
            nn.MaxPool2d(3, stride=2),
            
            nn.Conv2d(96, 256, kernel_size=5,stride=1,padding=2),
            nn.ReLU(inplace=True),
            nn.LocalResponseNorm(5,alpha=0.0001,beta=0.75,k=2),
            nn.MaxPool2d(3, stride=2),
            nn.Dropout2d(p=0.3),

            nn.Conv2d(256,384 , kernel_size=3,stride=1,padding=1),
            nn.ReLU(inplace=True),
            nn.Conv2d(384,256 , kernel_size=3,stride=1,padding=1),
            nn.ReLU(inplace=True),
            nn.MaxPool2d(3, stride=2),
            nn.Dropout2d(p=0.3),

        )
        
        # Defining the fully connected layers
        self.fc1 = nn.Sequential(
            nn.Linear(30976, 1024),
            nn.ReLU(inplace=True),
            nn.Dropout2d(p=0.5),
            
            nn.Linear(1024, 128),
            nn.ReLU(inplace=True),
            
            nn.Linear(128,2))
        
    def forward_once(self, x):
        # Forward pass 
        output = self.cnn1(x)
        output = output.view(output.size()[0], -1)
        output = self.fc1(output)
        return output

    def forward(self, input1, input2):
        # forward pass of input  as embedding
        output1 = self.forward_once(input1)
        return output1


Please Note: This notebook only briefs about how I've build the model which is yet to be tested. Theoretically, this model should solve the problem optimally. 

### Approach to the solution:
    
[You can find my solution is this paper as well.](https://github.com/nprithviraj24/deep-learning/blob/master/few-shot-learning/Few-shot-learning-with-online-triplet-mining.pdf)
    
    
To train our model, we need to first make it learn which class does a one belong to. 
       
     - Initially we take an image (class/label Z), call it as anchor image. After encoding, we represent this image somewhere in Euclidean space with D dimensions, let's assume the location is A.
     - We take the another image of same class Z, call it as positive. We represent this image somewhere in the same Euclidean space with D dimensions,  say B.
     - A different third image is picked with different class, say Y, and call it as negative, represent in same space at point C. The below picture captures Anchor, positie and negative beautifully.

Now our objective is to train the model such that <strong>same</strong> class images should be close that different class images. In short, let's consider $d$ as function of distance (generally, $L_2$ because it gives the squared value), then

<center>$d(anchor, negative) > d(anchor, postive)$</center>
and it should be at least by a margin. 

 
    
![Obama](obama.png)

### Why Triplet loss?

[Reference](https://www.youtube.com/watch?v=d2XB5-tuCWU)

Rather than calculating loss based on two examples ( contrastive loss ), triplet loss involves an anchor example and one positive or matching example (same class) and one negative or non-matching example (differing class).

The loss function penalizes the model such that the distance between the matching examples is reduced and the distance between the non-matching examples is increased. Explanation: 
For some distance on the embedding space d, the loss of a triplet (a,p,n) is:
<center> $L=max(d(a,p)−d(a,n)+margin,0)$ </center>

We minimize this loss, which pushes $d(a,p)$ to 0 and $d(a,n)$ to be greater than $d(a,p)+margin$. As soon as n becomes an “easy negative”, the loss becomes zero.



### So how do I build such a loss function optimally?

#### Keyword: Optimally.

Let me first acknowledge this [paper](https://arxiv.org/abs/1703.07737). 

            Images are now referred as embeddings.
            
Before calculating the loss,  we need sample only the relevant triplets (i.e anchor, positive and negative). To explain it much better, let's categorize the triplets in three different categories:
- Easy negative:  The one where $d(negative, anchor) >> d(positive, anchor)$, if this is the case, $L$ will be zero (recall from previous cell). So implicitly, there wont be gradient that will be propagated backwards.
- Hard negative: The case where $d(negative, anchor) < d(positive, anchor.)$. This means that network performed poorly, and there will be a significant gradient calculated to modify the weights (based on optimization).
- Semi-hard: tiplets where the negative is not closer to the anchor than the positive, but which still have positive loss: $d(a,p)<d(a,n)<d(a,p)+margin$

<br />

##### Offline triplet mining

- The first way to produce triplets is to find them offline, at the beginning of each epoch for instance. We compute all the embeddings on the training set, and then only select hard or semi-hard triplets. We can then train one epoch on these triplets.
- Concretely, we would produce a list of triplets (i,j,k). We would then create batches of these triplets of size B, which means we will have to compute 3B embeddings to get the B triplets, compute the loss of these B triplets and then backpropagate into the network.
- Overall this technique is not very efficient since we need to do a full pass on the training set to generate triplets. It also requires to update the offline mined triplets regularly.


##### Online triplet mining

The idea here is to compute useful triplets on the fly, for each batch of inputs. Given a batch of B examples (for instance B images of faces), we compute the B embeddings and we then can find a maximum of B3 triplets. Of course, most of these triplets are not valid (i.e. they don’t have 2 positives and 1 negative).

Suppose that you have a batch of whale flukes as input of size B=PK, composed of P different flukes with K images each. A typical value is K=4
    
. The two strategies are:

   batch all: 
    - select all the valid triplets, and average the loss on the hard and semi-hard triplets.
    - a crucial point here is to not take into account the easy triplets (those with loss 0), as averaging on them would make the overall loss very small this produces a total of PK(K−1)(PK−K) triplets (PK anchors, K−1 possible positives per anchor, PK−K possible negatives)

   batch hard: 
    - for each anchor, select the hardest positive (biggest distance d(a,p)) and the hardest negative among the batch this produces PK triplets 
    - the selected triplets are the hardest among the batch


### Important Note:

As a machine learning practitioner, it is believed in community that one way of implementing an algorithm on a dataset will not always yield a similar result for different dataset. So naturally, we should always explore different options. 

<br />

Since, I prefer to write in PyTorch, there's an equivalent code for implementing batch-hard-strategy <strong>criterion</strong>.

## UPDATE!!
The cells below will explain the code and approach on how this specific model is built in PyTorch. 

#### Please Note: It is assumed that this notebook is read along with ```train.py``` file.

### Data Loading and DataLoader

WhaleDataset class: Extended from torch's Dataset class.
<strong>\__init__()</strong>  and <strong>\__getitem()__</strong> : are to be defined.

### Splitting
Training data is further split into Training and Validation.
Validation: Where I only considered the data which had more than 3 classes. I only chose one of them to be part of validation.

### Encoder:
Converting a (224,224) image to 500 D/1000 D needs a very deep convolutional neural network. Problem with Deep CNNs are vanishing gradient. So I had to chose a robust model such as ResNet which is immune to such complications. I have included a dense layer at the end to map the feature values to embedding spaces with __D__ dimensions.

### Defining the Criterion (Loss) :  TripletLossCosine
CosDistance = 1 - CosineSimilarity.
The class gets anchor, posities and negatives instances and TripletLoss is calculated using cosine Distance (which is 1 - CosSimilarity)

        def forward(self, anchor, positive, negative):
            dist_to_positive = 1 - F.cosine_similarity(anchor, positive)
            dist_to_negative = 1 - F.cosine_similarity(anchor, negative)
            loss = F.relu(dist_to_positive - dist_to_negative + self.MARGIN)
            loss = loss.mean()
            return loss
            
 ### Train the model. 
 
 ``` Please Note: Selections of hard positives and hard negatives doesn't comply with what I proposed earlier. It is often found that if we chose hard positives and negatives, we might achieve faster convergence but the model is aloso prove to get stuck at local minima (if we are dealing with non-convex loss function.  ```
 - Negative of an anchor is chosen randomly from all the images in training set that doesnt share the same label as the anchor.
 - Positive of an anchor is chosen randomly from images sharing same label as the anchor. If there are none, then anchor's positive instance is itself.
 
 ### Hyperparameters: 
 Due to limited usage, I could only train 15 images on 5 different models.
 The results of different models and hyperparameters are discussed in next cell.
 
 ### Testing
 First of all, we calculate all the cosine_similarities of each test image with each training image. The idea is to select the closest (i.e near k neighbours) neighbours.
 
         sklearn 's  cosine similarity function is used to calculate cosine similarities. This yields a 2D kernel matrix with (n_samples_X, n_samples_Y)
  
  To test on an image, ```FindTopK()``` is used to get it's nearest classes.

### RESULTS

Please find this [Tensorboard](https://tensorboard.dev/experiment/oDvUYRgeTjWOJhowpJs58w/#scalars) for results.

##### In Scalars: Runs signify:

```res18-ep_15-TIME_2019-11-29 15:35:11.639278 ```

<strong>res18</strong> >> architecture used. <br />
<strong>ep_15</strong> >> number of epochs. <br />
<strong>TIME_2019-11-29 15:35:11.639278</strong>  >> Program initiation TIMESTAMP.


NOTE: If I explicity mention ```D-1000``` then it indicates that images are projected to 1000 D Euclidean space. Otherwise, 500D Euclidean space.