#Computer vision project
## Mahsa Bahri - 98243011

In [None]:
from google.colab import drive
drive.mount('/content/drive')
import sys
sys.path.insert(0,'/content/drive/My Drive/ComputerVisionProject/FaceRecognition')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


#Face verification:

**Face recognition** is the technology that allows computers and machines to match images containing people's faces and their identities.

<div align="center"><img src=https://drive.google.com/uc?export=view&id=1xl32yoGng2q6YZq0yRr4n3hjQjjmd1Vo >
</div>

**Face verification**: For example, at some airports, you can pass through customs by letting a system scan your passport and then verifying that you (the person carrying the passport) are the correct person. A mobile phone that unlocks using your face is also using face verification. This is a 1:1 matching problem.


Face recognition can be divided into multiple steps. The image below shows an example of a face recognition pipeline.

1.   Face detection — Detecting one or more faces in an image.
2.   Feature extraction — Extracting the most important features from an image of the face.
3. Face classification — Classifying the face based on extracted features.



#FaceNet:
We are going to build a face recognition system. Our idea is based on FaceNet model. but what is FaceNet. but, what is faceNet exactly?

**FaceNet** is a deep neural network used for extracting features from an image of a person’s face.

How does FaceNet work?

<div align="center"><img src=https://arsfutura-production.s3.us-east-1.amazonaws.com/magazine/2019/10/face_recognition/facenet-brki.png width = 500></div>

FaceNet takes an image of the person’s face as input and outputs a vector of 128 numbers which represent the most important features of a face. In machine learning, this vector is called embedding.

Ok, what do we do with these embeddings? How do we recognise a person using an embedding?

Embeddings are vectors and we can interpret vectors as points in the Cartesian coordinate system. That means we can plot an image of a face in the coordinate system using its embeddings.

<div align="center"><img src= https://arsfutura-production.s3.us-east-1.amazonaws.com/magazine/2019/10/face_recognition/facenet-brki-ana.png width = 500></div>

One possible way of recognising a person on an unseen image would be to calculate its embedding, calculate distances to images of known people and if the face embedding is close enough to embeddings of person A, we say that this image contains the face of person A.




#### Channels-first notation

* In this exercise, we will be using a pre-trained model which represents ConvNet activations using a **"channels first"** convention.

## 1 - Encoding face images into a 128-dimensional vector

### 1.1 - Using a ConvNet  to compute encodings

The FaceNet model takes a lot of data and a long time to train. So following common practice in applied deep learning, let's  load weights that someone else has already trained.
<br></br>

Let's start with importing packages:


In [None]:
from keras.models import Sequential, Model
from keras.layers import Conv2D, ZeroPadding2D, Activation, Input, concatenate
from tensorflow.keras.layers import BatchNormalization, Layer
from keras.layers.pooling import MaxPooling2D, AveragePooling2D
from keras.layers.core import Lambda, Flatten, Dense
from keras.initializers import glorot_uniform
from keras import backend as K
K.set_image_data_format('channels_first')
import cv2
import os
import sys
import numpy as np
from numpy import genfromtxt
import pandas as pd
import tensorflow as tf
from fr_utils import *
from inception_blocks_v2 import *

%matplotlib inline
%load_ext autoreload
%autoreload 2

np.set_printoptions(threshold=sys.maxsize)

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


The key things you need to know are:

- This network uses 96x96 dimensional RGB images as its input. Specifically, inputs a face image (or batch of $m$ face images) as a tensor of shape $(m, n_C, n_H, n_W) = (m, 3, 96, 96)$
- It outputs a matrix of shape $(m, 128)$ that encodes each input face image into a 128-dimensional vector

Run the cell below to create the model for face images.

In [None]:
FRmodel = faceRecoModel(input_shape=(3, 96, 96))
print("Total Params:", FRmodel.count_params())

Total Params: 3743280


By using a 128-neuron fully connected layer as its last layer, the model ensures that the output is an encoding vector of size 128. You then use the encodings to compare two face images as follows:


<img src= "https://drive.google.com/uc?export=view&id=1g1SiAAOZRVYTgVq0EfDl7rO2ZsI-NGP0" style="width:380px;height:150px;">

By computing the distance between two encodings and thresholding, you can determine if the two pictures represent the same person

So, an encoding is a good one if:
- The encodings of two images of the same person are quite similar to each other.
- The encodings of two images of different persons are very different.

The triplet loss function formalizes this, and tries to "push" the encodings of two images of the same person (Anchor and Positive) closer together, while "pulling" the encodings of two images of different persons (Anchor, Negative) further apart.

### 1.2 - The Triplet Loss

For an image $x$, we denote its encoding $f(x)$, where $f$ is the function computed by the neural network.

<img src= "https://drive.google.com/uc?export=view&id=1JJBID2gNhK08omJpURxd_KWncm1tK3dF" style="width:380px;height:150px;">


<!--
We will also add a normalization step at the end of our model so that $\mid \mid f(x) \mid \mid_2 = 1$ (means the vector of encoding should be of norm 1).
!-->

Training will use triplets of images $(A, P, N)$:  

- A is an "Anchor" image--a picture of a person.
- P is a "Positive" image--a picture of the same person as the Anchor image.
- N is a "Negative" image--a picture of a different person than the Anchor image.

These triplets are picked from our training dataset. We will write $(A^{(i)}, P^{(i)}, N^{(i)})$ to denote the $i$-th training example.

You'd like to make sure that an image $A^{(i)}$ of an individual is closer to the Positive $P^{(i)}$ than to the Negative image $N^{(i)}$) by at least a margin $\alpha$:

$$\mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2 + \alpha < \mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2$$

You would thus like to minimize the following "triplet cost":

$$\mathcal{J} = \sum^{m}_{i=1} \large[ \small \underbrace{\mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2}_\text{(1)} - \underbrace{\mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2}_\text{(2)} + \alpha \large ] \small_+ \tag{3}$$

Here, we are using the notation "$[z]_+$" to denote $max(z,0)$.  

Notes:
- The term (1) is the squared distance between the anchor "A" and the positive "P" for a given triplet; you want this to be small.
- The term (2) is the squared distance between the anchor "A" and the negative "N" for a given triplet, you want this to be relatively large. It has a minus sign preceding it because minimizing the negative of the term is the same as maximizing that term.
- $\alpha$ is called the margin. It is a hyperparameter that you pick manually. We will use $\alpha = 0.2$.

Most implementations also rescale the encoding vectors to haven L2 norm equal to one (i.e., $\mid \mid f(img)\mid \mid_2$=1); you won't have to worry about that in this assignment.

**Exercise**: Implement the triplet loss as defined by formula (3). Here are the 4 steps:
1. Compute the distance between the encodings of "anchor" and "positive": $\mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2$
2. Compute the distance between the encodings of "anchor" and "negative": $\mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2$
3. Compute the formula per training example: $ \mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2 - \mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2 + \alpha$
3. Compute the full formula by taking the max with zero and summing over the training examples:
$$\mathcal{J} = \sum^{m}_{i=1} \large[ \small \mid \mid f(A^{(i)}) - f(P^{(i)}) \mid \mid_2^2 - \mid \mid f(A^{(i)}) - f(N^{(i)}) \mid \mid_2^2+ \alpha \large ] \small_+ \tag{3}$$



In [None]:
def triplet_loss(y_true, y_pred, alpha = 0.2):
    anchor, positive, negative = y_pred[0], y_pred[1], y_pred[2]
    pos_dist = tf.reduce_sum(tf.square(anchor - positive), axis = -1)
    neg_dist = tf.reduce_sum(tf.square(anchor - negative), axis = -1)
    basic_loss = pos_dist- neg_dist + alpha
    loss = tf.reduce_sum(tf.maximum(basic_loss, 0.0))
    return loss

## 2 - Loading the pre-trained model

FaceNet is trained by minimizing the triplet loss. But since training requires a lot of data and a lot of computation, we won't train it from scratch here. Instead, we load a previously trained model. Load a model using the following cell; this might take a couple of minutes to run.

In [None]:
FRmodel.compile(optimizer = 'adam', loss = triplet_loss, metrics = ['accuracy'])
load_weights_from_FaceNet(FRmodel)

Let's build a database containing one encoding vector for each person who is allowed to enter the office. To generate the encoding we use `img_to_encoding(image_path, model)`, which runs the forward propagation of the model on the specified image.

Run the following code to build the database (represented as a python dictionary). This database maps each person's name to a 128-dimensional encoding of their face.

In [None]:
def img_to_encoding(image_path, model):
    img = tf.keras.preprocessing.image.load_img(image_path)
    img = np.around(np.transpose(img, (2,0,1))/255.0, decimals=12)
    x_train = np.expand_dims(img, axis=0)
    embedding = model.predict_on_batch(x_train)
    return embedding

Let's build a database containing one encoding vector for each person

In [None]:
database = {}
base_image_path = "/content/drive/My Drive/ComputerVisionProject/FaceRecognition/images/"

database["Messi1"] = img_to_encoding(base_image_path+"Messi1.jpeg", FRmodel)
database["Messi2"] = img_to_encoding(base_image_path+"Messi2.jpeg", FRmodel)
database["Brad Pitt1"] = img_to_encoding(base_image_path+"Brad_Pitt1.jpeg", FRmodel)
database["Brad Pitt2"] = img_to_encoding(base_image_path+"Brad_Pitt2.jpeg", FRmodel)
database["Dicaprio1"] = img_to_encoding(base_image_path+"dicaprio1.jpeg", FRmodel)
database["Dicaprio2"] = img_to_encoding(base_image_path+"dicaprio2.jpg", FRmodel)
database["Tom Hardy1"] = img_to_encoding(base_image_path+"Tom_Hardy1.jpg", FRmodel)
database["Tom Hardy2"] = img_to_encoding(base_image_path+"Tom_Hardy2.jpg", FRmodel)


## 3 - Verify

Implementation of verify() function:


1.  Compute the encoding of the image from `image_path`.
2. Compute the distance between this encoding and the encoding of the identity image stored in the database.
3. The result will be "Same person", if the distance is less than 0.7, else "Different persons".



In [None]:
def verify_by_name(image_path, identity, database, model):
    encoding = img_to_encoding(image_path, model)
    dist = np.linalg.norm(encoding - database[identity])
    if dist < 0.8:
        print("Same person")
    else:
        print("Different persons")
    return dist

def verify_by_iamge(src_image_path, dest_image_path, database, model):
    src_encoding = img_to_encoding(src_image_path, model)
    dest_encoding = img_to_encoding(dest_image_path, model)
    dist = np.linalg.norm(src_encoding - dest_encoding)
    if dist < 0.8:
        print("Same person")
    else:
        print("Different persons")
    return dist


<img src="https://drive.google.com/uc?export=view&id=1Vhjmgzobmysgu3xe-_aKFCOfVqelo_7v" >
<img src="https://drive.google.com/uc?export=view&id=12NlFynmoW0posFko96Dhe4y6EX8ligmQ">

In [None]:
verify_by_iamge(base_image_path+"Messi1.jpeg", base_image_path+"Messi2.jpeg", database, FRmodel)

(1, 3, 96, 96)
(1, 3, 96, 96)
Same person


0.7961557


<img src="https://drive.google.com/uc?export=view&id=1GsfY09FrIoslko4NU8Ipl-n_S3_1E9-b" >

<img src="https://drive.google.com/uc?export=view&id=17EQ_2fY_wunhu1QI5yDdhFfr3dJhfO7p">




In [None]:
verify_by_iamge(base_image_path+"Tom_Hardy1.jpg", base_image_path+"Tom_Hardy2.jpg", database, FRmodel)

(1, 3, 96, 96)
(1, 3, 96, 96)
Same person


0.7357289

<img src="https://drive.google.com/uc?export=view&id=1GsfY09FrIoslko4NU8Ipl-n_S3_1E9-b" >
<img src="https://drive.google.com/uc?export=view&id=1Vhjmgzobmysgu3xe-_aKFCOfVqelo_7v" >


In [None]:
verify_by_iamge(base_image_path+"Tom_Hardy1.jpg", base_image_path+"Messi1.jpeg", database, FRmodel)

(1, 3, 96, 96)
(1, 3, 96, 96)
Different persons


0.8464964


<img src="https://drive.google.com/uc?export=view&id=1H_SuC0pDQ60YX1SHsc_UuHqDeS_5AxUN" >
<img src="https://drive.google.com/uc?export=view&id=1LnVLtRsqULwK-MmiNN5NIO4MjtzTbQI0" >


In [None]:
verify_by_iamge(base_image_path+"dicaprio1.jpeg", base_image_path+"dicaprio2.jpg", database, FRmodel)

(1, 3, 96, 96)
(1, 3, 96, 96)
Same person


0.52137023


<img src="https://drive.google.com/uc?export=view&id=1H_SuC0pDQ60YX1SHsc_UuHqDeS_5AxUN" >
<img src="https://drive.google.com/uc?export=view&id=1GsfY09FrIoslko4NU8Ipl-n_S3_1E9-b" >

In [None]:
verify_by_iamge(base_image_path+"dicaprio1.jpeg", base_image_path+"Tom_Hardy1.jpg", database, FRmodel)

Different persons


0.88185465


<img src="https://drive.google.com/uc?export=view&id=18NCBEe7sdlpRyGDHAU9nZdxHirMmlnTe" >
<img src="https://drive.google.com/uc?export=view&id=1-IPNEIAAwJK5P4qXcD840sU1pizI9HcJ" >

In [None]:
verify_by_iamge(base_image_path+"Brad_Pitt1.jpeg", base_image_path+"Brad_Pitt2.jpeg", database, FRmodel)

Same person


0.6201238

<img src="https://drive.google.com/uc?export=view&id=1Vhjmgzobmysgu3xe-_aKFCOfVqelo_7v" >
<img src="https://drive.google.com/uc?export=view&id=1H_SuC0pDQ60YX1SHsc_UuHqDeS_5AxUN" >

In [None]:
verify_by_iamge(base_image_path+"Messi1.jpeg", base_image_path+"dicaprio1.jpeg", database, FRmodel)

Different persons


0.8456635

# Alternative Approaches to Similarity Learning

### Ranking Loss Functions: Metric Learning
Unlike other loss functions, such as Cross-Entropy Loss or Mean Square Error Loss, whose objective is to learn to predict directly a label, a value, or a set or values given an input, the objective of Ranking Losses is to predict relative distances between inputs. This task if often called metric learning.

Ranking Losses functions are very flexible in terms of training data: We just need a similarity score between data points to use them. That score can be binary (similar / dissimilar). As an example, imagine a face verification dataset, where we know which face images belong to the same person (similar), and which not (dissimilar). Using a Ranking Loss function, we can train a CNN to infer if two face images belong to the same person or not.
To use a Ranking Loss function we first extract features from two (or three) input data points and get an embedded representation for each of them. Then, we define a metric function to measure the similarity between those representations, for instance euclidian distance. Finally, we train the feature extractors to produce similar representations for both inputs, in case the inputs are similar, or distant representations for the two inputs, in case they are dissimilar.
We don’t even care about the values of the representations, only about the distances between them. However, this training methodology has demonstrated to produce powerful representations for different tasks.

###Pairwise Ranking Loss


###ArcFace Loss and the Angle Margin Penalty
There are many alternatives to the triplet loss, one of them is the ArcFace Loss. This is a loss based on the cross-entropy loss aiming to maximize the decision boundary between classes thus grouping similar data points closer together.
The idea behind ArcFace is that it maximizes the angle between interclass and minimizes the angle between intraclass on a hypersphere. We then add the angular margin penalty which is inserted between the weight of the true logit and the embedding. This adds a angle penalty to the original angle between the logit and the embedding.

The angle margin penalty helps in penalizing the embedding vectors that goes far and help in bringing the embedding features of a certain class come more closer.

# What are Vector Embeddings?
ML algorithms, like most software algorithms, need numbers to work with. Sometimes we have a dataset with columns of numeric values or values that can be translated into them (ordinal, categorical, etc). Other times we come across something more abstract like an entire document of text. We create vector embeddings, which are just lists of numbers, for data like this to perform various operations with them. A whole paragraph of text or any other object can be reduced to a vector. Even numerical data can be turned into vectors for easier operations.

<img src= "https://d33wubrfki0l68.cloudfront.net/4b4ae6760dab99a18438671111f77e28498a2fb4/1093c/images/vector_embeddings.jpg" style="width:380px;height:150px;">

###Creating Vector Embeddings
One way of creating vector embeddings is to engineer the vector values using domain knowledge. This is known as feature engineering. For example, in medical imaging, we use medical expertise to quantify a set of features such as shape, color, and regions in an image that capture the semantics. However, engineering vector embeddings requires domain knowledge, and it is too expensive to scale.

Instead of engineering vector embeddings, we often train models to translate objects to vectors. A deep neural network is a common tool for training such models. The resulting embeddings are usually high dimensional (up to two thousand dimensions) and dense (all values are non-zero). For text data, models such as Word2Vec, GLoVE, and BERT transform words, sentences, or paragraphs into vector embeddings.


### References:

- The explenation of FaceNet took from this [website](https://arsfutura.com/magazine/face-recognition-with-facenet-and-mtcnn/).
- I got more information about similarity learning from these two websites: [website1](https://gombru.github.io/2019/04/03/ranking_loss/) ,  [website2](https://towardsdatascience.com/novel-approaches-to-similarity-learning-e680c61d53cd#:~:text=Alternatives%3A%20ArcFace%20Loss%20and%20the,similar%20data%20points%20closer%20together.).
-You can more about vector embedding, [here](https://www.pinecone.io/learn/vector-embeddings/).
- The pretrained model we use is inspired by Victor Sy Wang's implementation and was loaded using his [code](https://github.com/iwantooxxoox/Keras-OpenFace).
- Our implementation also took a lot of inspiration from this [repository](https://github.com/amanchadha/coursera-deep-learning-specialization/blob/master/C4%20-%20Convolutional%20Neural%20Networks/Week%204/Face%20Recognition/Face_Recognition_v3a.ipynb)
