# Face Recognition

**Summary:**

- one-shot learning
- siamese network
- triplet loss
- face verification and binary classification

Face recognition + liveness detection = real-time face recognition

Face verification vs. face recognition

verification:
- input image + label
- output: whether input image is that of the claimed person

recognition:
- has a database of K persons
- get an input image
- output ID if the image is any of the K persons

## One Shot Learning

most face-recognition applications need to be able to recognize a person given one single training example.

image -> CNN -> softmax (C + 1)  
doesn't work well

one shot learning: only get one chance for model to make correct prediction

we don't want to retrain the model every time

learn similarity function to solve the problem of one shot learning

## Siamese Network

Instead of feeding encoding into softmax output layer, we can design NN to define a vector encoding $f(x^{(1)})$ and learn parameters such that if $x^{(i)}$ and $x^{(j)}$ are the same person, the distance $d$ between the encoding vectors is small:

$$d\!\left(x^{(i)}, x^{(j)}\right) = \big\|f\big(x^{(1)}\big) - f\big(x^{(j)}\big)\big\|^{2}_{2}$$

It follows that if $x^{(i)}$ and $x^{(j)}$ are not the same person, you'd want $d$ to be large. You can then use back-propagation in order to make sure these conditions are met.

## Triplet Loss

One way to learn the parameters of this NN is to define an applied gradient descent on the *triplet loss* function. Let's define three different images:

- **A**: the anchor image -- a ground-truth image on which we make comparisons
- **P**: a positive example
- **N**: a negative example

From here, we want the the distance between **A** and **P** to be small, and **A** and **N** to be large:

$$
d\!\left(\mathbf{A}, \mathbf{P}\right) \leq d\!\left(\mathbf{A}, \mathbf{N}\right)
$$

However, consider a case where a NN encodes everything to some number $n$; that would still be able to satisfy the condition:

$$
d\!\left(\mathbf{A}, \mathbf{P}\right) - d\!\left(\mathbf{A}, \mathbf{N}\right) \leq 0
$$

To prevent this, we can add a *margin* hyperparameter $\alpha$ such that:

$$
d\!\left(\mathbf{A}, \mathbf{P}\right) - d\!\left(\mathbf{A}, \mathbf{N}\right) + \alpha \leq 0
$$

More formally, we want to define a loss function such that we minimize:

$$
\mathcal{L}(\mathbf{A}, \mathbf{P}, \mathbf{N}) = \mathrm{max}\big(d\!\left(\mathbf{A}, \mathbf{P}\right) - d\!\left(\mathbf{A}, \mathbf{N}\right) + \alpha, 0\big)$$$$
\mathcal{J} = \sum_{i=1}^{m} \mathcal{L}\big(\mathbf{A^{(i)}}, \mathbf{P^{(i)}}, \mathbf{N^{(i)}}\big)
$$

For the purpose of training, you need multiple images of the same person.

During training, if **A**, **P**, and **N** are chosen randomly, then the training condition is easily satisfied. You want to choose such that $d\!\left(\mathbf{A}, \mathbf{P}\right) \approx d\!\left(\mathbf{A}, \mathbf{N}\right)$

Modern face recognition devices are trained on very large datasets, on the order of millions or even hundreds of millions of images.

## Face Verification and Binary Classification

An alternative to the triplet loss training function would be for the NN to learn the similarity function:

$$
\hat{y} = \sigma \big( \sum_{k=1}^{n} w_k \, \big | f \big(x^{(1)}\big)_k - f\big(x^{(j)}\big)_k\big | + b \big)
$$

There are other variations, such as the $\chi$ squared formula

When a new facial input comes in, compare new encoding to pre-computed encoding, then run logistic function.