# 1. One-Shot Learning

Solving the one-shot problem represents a challenge of face recognition task. This means that for most face recognition applications we need to recognize a person having only a single image or given just one example of that person’s face. Typically, deep learning algorithms don’t work well if there is only one training example. However, we will show how this problem can be tackled. 

Let’s say that we have a database of 4 pictures of employees in one organization. Let’s say someone shows up at the office and we need to detect who has arrived.

<img src="figures/face_recog_example.png" style="width:400px">

So, this system with only one image, has to determine whether this person is in a database. That is, it will check all images in the database and it has to match it with the “same” person. On the other hand, if that image is matched with any other person, the system has to produce output “different” person. This is depicted in the Figure above for our mini database of 4 persons. 

The one-shot learning problem is a problem where a system has to learn from just one example and has to recognize that person again. 

One approach we could try is to input the image of the person, feed it to a 𝑐𝑜𝑛𝑣𝑛𝑒𝑡, and have it output a label 𝑦 using a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 unit. This really doesn’t work well, because if we have such a small training set, it is really not enough data to train a robust neural network for this task. Also, what if a new person joins our team? Then, we have one more person to recognize, so there should be one more output in the 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 output unit. Do we have to retrain the 𝑐𝑜𝑛𝑣𝑛𝑒𝑡 every time? That just doesn’t seem like a good approach.

<img src="figures/face_recog_convnet.jpg" style="width:700px">

**Instead, we are going to use a “similarity” function and use it as our cost training function. In particular, we want a neural network to learn a function (d). That function inputs two images and it outputs the degree of difference between the two images.**

<img src="figures/similarity_fn.png" style="width:600px">


If we have two images of the same person, we want a value of 𝑑 to be a small number. In contrast, if the images are of two different people we want the function to be a large number. During recognition time, if the degree of difference between them is less than some threshold called 𝜏, which is a hyper-parameter, then we would predict that these two pictures are of the same person. On the other hand, if it is greater than 𝜏, we would predict that these are different persons. This is how we address the face verification problem.

To use this idea for a recognition task, for a new picture we would use a function 𝑑 to compare two images. It is possible that the output is a very large number. Let’s say 10 for this example. Next, we then compare this picture with the second image in our database. Because these two are the same person, our output will be a very small number. Then, we do this for other images in our database, and based on this we will figure out that this is actually that person.

<img src="figures/similarity_fn_example.png" style="width:700px">

In contrast, if someone who is not in our database shows up as well, we will use the function 𝑑 to make all of these parallel comparisons. Hopefully, a function 𝑑 will output a very large number for all four parallel comparisons, and then, we say that this is not a person from the database. A great advantage of a 𝑑 function is that a new person may join the database easily, without the need to retrain the neural network.


# 2. Siamese Network

The job of the function 𝑑, which we presented in the previous post, is to use two faces and to tell us how similar or how different they are. A good way to accomplish this is to use a Siamese network.

We get used to see pictures of 𝑐𝑜𝑛𝑣𝑛𝑒𝑡𝑠, like these two networks in the picture below. We have an input image, denoted with 𝑥(1), and through a sequence of 𝐶𝑜𝑛𝑣𝑜𝑙𝑢𝑡𝑖𝑜𝑛𝑎𝑙, 𝑃𝑜𝑜𝑙𝑙𝑖𝑛𝑔 and 𝐹𝑢𝑙𝑙𝑦 𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layers we end up with a feature vector.

Sometimes this output below, is fed to a 𝑠𝑜𝑓𝑡𝑚𝑎𝑥 unit to make a classification, but we are not going to use that approach in this post. Instead, we are going to focus on this vector of 128 numbers computed by some 𝐹𝑢𝑙𝑙𝑦𝑐𝑜𝑛𝑛𝑒𝑐𝑡𝑒𝑑 layer that is deeper in the network. We are going to give this list of 128 numbers a name 𝑓(𝑥(1)), and we should think of 𝑓(𝑥(1)) as an encoding of the input image 𝑥(1). That means that we have taken the input image and represented it as a vector of 128 numbers. Next, to build a face-recognition system we need to compare two pictures. Let’s say, the first picture with the second picture below. To do this, we can feed the second picture to the same neural network with the same parameters and get a different vector of 128 numbers. This will be our representation of the the second picture. We also say that the picture is encoded in this way. 

<img src="figures/siamese_network.png" style="width:700px">

We will call the encoding of the second picture 𝑓(𝑥(2)). Note, here we are using 𝑥(1) and 𝑥(2) just to denote two input images. They don’t necessarily have to be the first and second examples in our training sets. Finally, if it turns out that these encodings are a good representation of these two images, we can find the   distance 𝑑 between 𝑥(1) and 𝑥(2). A common way is to use a norm of the difference between the encoding of these two images.

$$d\left ( x^{\left ( 1 \right )},x^{\left ( 2 \right )} \right )=\left \| f\left ( x^{\left ( 1 \right )} \right )-f\left ( x^{\left ( 2 \right )} \right ) \right \|_{2}^{2}$$

This idea of running two identical convolutional neural networks on two different inputs and then comparing them is called a Siamese neural network architecture.

How do we train this Siamese neural network? First, these two neural networks have the same parameters. Then, we want to train a neural network, so that the encoding that it computes results in a function 𝑑. Finally, it will tell us when two pictures are of the same person.

To put it more formally, the parameters of the neural network define an encoding 𝑓(𝑥(𝑖)). So, given any input image 𝑥(𝑖) the neural network outputs a 128 dimensional encoding 𝑓(𝑥(𝑖)). What we want to do is to learn parameters so that if two pictures 𝑥(𝑖) and 𝑥(𝑗) are of the same person, then the distance between their encodings should be small. At the beginning of this post we used  𝑥(1) and 𝑥(2), but there can be any pair 𝑥(𝑖) and 𝑥(𝑗)  from our training set.

<img src="figures/siamese_network_learning.png" style="width:700px">

In contrast, if 𝑥(𝑖) and 𝑥(𝑗) are of different persons, then we want that distance between their encodings to be large. 

Now, we have a sense of what we want the neural network to output for us in terms of what would make a good encoding. However, we still don’t know how to actually define an objective function to make our neural network.

# 3. Triplet Loss

One way to learn the parameters of the neural network, which gives us a good encoding for our pictures of faces, is to define and apply gradient descent on the Triplet loss function. Let’s see what that means.

To apply the triplet loss we need to compare pairs of images. For example, given the pair of images on the left in the picture below, we want their encodings to be similar because these are the same person. On the other hand, given a pair of images on the right, we want their encodings to be quite different because these are different persons.

<img src="figures/learning_objective.png" style="width:700px">

In the terminology of the Triplet loss we always look at one anchor image and then we want the distance between the anchor (A) and a positive image (P), to be a small, that this is the same person. In contrast, we want the anchor image when compared to the negative example to be much further apart (or to have a larger distance). So, we are always looking at three images at a time. That is, we will be looking at an anchor image, a positive image, as well as a negative image.

To formalize this, we want the parameters of our neural network, or our encodings, to have the following property (think of 𝑑 as a distance function ):

$$\left \| f\left ( A \right )-f\left ( P \right ) \right \|^{2}\leq \left \| f\left ( A \right )-f\left ( N \right ) \right \|^{2}$$

$$d\left ( A,P \right ) \leq d\left ( A,N \right )$$

Now we are going to make a slight change to this expression:

$$\left \| f\left ( A \right )-f\left ( P \right ) \right \|^{2} –  \left \| f\left ( A \right )-f\left ( N \right ) \right \|^{2}  \leq 0$$

If 𝑓 always outputs 0, these two norms (distances) are 0−0=0 and 0−0=0, and by saying 𝑓(𝑖𝑚𝑔)=0⃗ , we can almost trivially satisfy this equation. We need to make sure that the neural network doesn’t just output 0 for all the encoding – that it doesn’t set all the encodings equal to each other. One way for the neural network to give a trivial output is if the encoding for every image was identical to the encoding to every other image, in which case we again get 0−0=0. To prevent our network from doing that, we are going to modify this objective so that it doesn’t need to be just less than or equal to 0, it needs to be a bit smaller than 0. In particular we say this needs to be less than –𝛼  where −𝛼  is another hyper parameter (it is also called a margin).

$$| f ( A  )-f( P )  |^{2}-  | f ( A  )-f  ( N  )  |^{2} \leq 0 – \alpha$$

This prevents the neural network from outputting the trivial solutions, and by convention usually we write plus +𝛼 on the left side of equation. 

$$\| f ( A  )-f ( P  )  \|^{2}-  \| f ( A t )-f ( N  )  \|^{2}+ \alpha  \leq 0$$

So, we want 𝑑(𝐴,𝑁) to be much bigger than 𝑑(𝐴,𝑃). To achieve this, we could either push 𝑑(𝐴,𝑃) up or push 𝑑(𝐴,𝑁) down, so that there is this gap of this hyper parameter 𝛼 between the distance between the anchor and the positive versus the anchor and the negative. This is the role of a margin parameter. Let’s define the Triplet loss function.

The Triplet loss function is defined on triples of images. The positive examples are of the same person as the anchor, but the negative are of a different person than the anchor. Now, we are going to define the loss as follows:

$$L ( A,P,N  )=max( \| f ( A )-f( P )  \|^{2}-  \| f ( A) -f( N )  \|^{2} + \alpha, 0)$$

As long as we achieve the goal of making $\| f ( A )-f( P )  \|^{2}-  \| f ( A) -f( N )  \|^{2} + \alpha$ less than or equal to zero, then the loss on this example is equal to zero. On the other hand, if this is greater than zero then we take the max so we get a positive loss.

This is how we define the loss on a single triplet, and the overall cost function for our neural network can be a sum over a training set of these individual losses on different triplets:

$$J=\sum_{i=1}^{m}h\left ( A^{\left ( i \right )},P^{\left ( i \right )},N^{\left ( i \right )} \right )$$


Let’s imagine that we have a training set of 10,000 pictures with a 1000 different persons. Then, we have to take our 10,000 pictures and use them to generate triplets. Next, we train our learning algorithm using a gradient descent on cost function that we have defined previously. Notice that in order to define this data set of triplets we do need some pairs of A and P, pairs of pictures of the same person, so for the purpose of training our system we do need a data sets where we have multiple pictures of the same person, at least for some people in our training set so that we can have pairs of anchor and positive images. If we had just one picture of each person then we can’t actually train this system. Naturally, after training this system we can apply it to our one-shot learning problem where for our face recognition system maybe we have only a single picture of someone we might be trying to recognize. 

## 3.1. How do we actually choose these triplets to form our training set?

The first thing that comes to mind for most people is what if we choose A, P and N randomly from our training set. And the answer is that if we choose them sort of at random, then this constraint is very easy to satisfy, because given two randomly chosen pictures of people chances are that A and N are “more different” than A and P, so this is a problem.

If A and N are two randomly chosen different persons then there is a very high chance that $\| f ( A )-f( P )  \|^{2}-  \| f ( A) -f( N )  \|^{2}$ will be much bigger than margin 𝛼 then that term on the left, and so the neural network won’t learn much from it. Moreover, we want to construct our training set by choosing triplets A, P and N so that they represent “hard cases” to train on.

In particular, what we want is for all triplets that this constraint be satisfied, so a triplet that is hard would be when we choose values so that 𝑑(𝐴,𝑃) is actually quite close to 𝑑(𝐴,𝑁).

In that case the learning algorithm has to try extra hard to take this 𝑑(𝐴,𝑁) and to try to push it up. The effect of choosing these triplets is that it increases the computational efficiency of our learning algorithm. If we choose the triplets randomly then too many triplets would be really easy, and so gradient descent won’t do anything because neural network will just get them right pretty much all the time. So, with choosing hard triplets, the gradient descent procedure has to do some work to try to push 𝑑(𝐴,𝑃) further away from 𝑑(𝐴,𝑁). 

## 3.2. Training set using triplet loss

To train on triplet loss, we need to take our training set and map it to a lot of triples.

Here is a nice example. The Anchor and Positive are the same person, but the Anchor and Negative are different persons. We need to find the training set of Anchor, Positive and Negative triples. Then, we will use gradient descent to try to minimize the cost function 𝐽 that we defined earlier. It will have the effect of back propagating to all the parameters of the neural network in order to learn an encoding. Hence, a function 𝑑 of two images will be small when these two images are of the same person. However, they will be large when these are two images of different persons. 

<img src="figures/triplets_example.png" style="width:700px">

That is it for the triplet loss and how we can use it to train a neural network to output a good encoding for face recognition task. It turns out that today’s face recognition systems, especially the large scale commercial face recognition systems are trained on very large data sets. Datasets with more than a million images is not uncommon, so this is one domain where often it might be useful to download someone else’s pretrained model, rather than doing everything from scratch. 