<a href="https://colab.research.google.com/github/marcelarosalesj/e2e-vision-apps/blob/main/Week_3_Project_Face_ID.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).

# Week 3 Project: Building a Facial Identity Recognition System

Welcome to the third week's project for *Building Computer Vision Applications*!

In this week, we are going to get familiar with the key steps of machine learning, with a particular focus on image embedding. Specifically, we will cover:

* finding pretrained image embedding models and using them on our own data 👾
* building an image dataset and uploading it to the Hugging Face Hub 📖
* measuring the performance of an image embedding model on test data and the real world 📈
* building a facial identity recognition app you can run on your phone or laptop 📷


# Introduction

[Face ID](https://en.wikipedia.org/wiki/Face_ID) was introduced by Apple in 2017 as an alternative to fingerprint-based authentication for iPhones. The way that Face ID works is that it uses infrared projectors that shine around 30,000 infrared dots onto a user's face. Then an infrared camera reads the reflections to come up with an infrared "image" of a person's face. Using neural networks, Face ID predicts if the recorded infrared image is similar enough to a stored profile, in which case the phone unlocks.

In this project, we will recreate the last part of this process -- building an application that can recognize if two faces belong to the same person, based on optical pictures (i.e. regular images, not infrared images) of their face. This is quite a difficult problem because it requires us to simultaneously perform two tasks: (1) tell two people (who may look quite similar) apart (2) recognize that two photos of the same person (potentially taken in very different lighting, clothing, or other conditions) are of the same person. In order to do this project, we will use models for *image embedding*, which can convert any image to a numerical vector called an embedding. These embeddings can be used to compare images more easily, as computing distances between different embeddings can be a meaningful signal of how similar the respective images are.

By the end of this project, you'll have built an app that takes in two pictures from your webcam and will predict whether they are the same person or not. This can form the basis of facial identity recognition software. It will look something like this:

![](https://i.ibb.co/T0cDVLs/image.png)

# Step 0: Hardware Setup & Software Libraries

We will be utilizing GPUs to train our machine learning model, so we will need to make sure that our colab notebook is set up correctly. Go to the menu bar and click on Runtime > Change runtime type > Hardware accelerator and **make sure it is set to GPU**. Your colab notebook may restart once you make the change.

We're going to be using some fantastic open-source Python libraries to upload our dataset (`datasets`), load our model (`sentence-transformers`), evaluate our model (`scikit-learn`), and build a demo of our model (`gradio`). So let's go ahead and install all of these libraries. 

In [None]:
!pip install datasets huggingface_hub sentence-transformers gradio 

In Week 2, you uploaded a model to your Hugging Face account programmatically. This week, you'll be uploading a dataset! The first step is to log in using your Hugging Face token:

In [None]:
from huggingface_hub import notebook_login

In [None]:
notebook_login()

# Step 1: Loading Pretrained Image Embedding Models

In this project, we will be loading several pretrained image embedding models and comparing their performance. In particular, we will compare:

* https://huggingface.co/sentence-transformers/clip-ViT-B-16
* https://huggingface.co/sentence-transformers/clip-ViT-B-32
* https://huggingface.co/sentence-transformers/clip-ViT-L-14

* **1a. Compare the models**

When considering which machine learning model to use for a particular task, there are several things to consider:
* The metrics that are relevant to you
* The size of the model
* The inference time of the model

Which of these models has the best reported performance on the model card? Which has the worst reported performance? [ANSWER HERE]


What task was the performance reported on? What does this task mean? [ANSWER HERE]


Which of these models has the the largest size on disk? Which is the smallest? *Hint*: look for the PyTorch binary file. [ANSWER HERE]


What model do you expect to run the fastest? The slowest? [ANSWER HERE]



* **1b. Load one of the models**

Pick one of the three models above and load it using the [Sentence Transformers](https://www.sbert.net/) library.

In [None]:
from sentence_transformers import SentenceTransformer

model = # ANSWER HERE

* **1c. Use the model to embed a few photos with faces**

The following code downloads and displays 3 images from the web. We will use the `SentenceTransformer` you downloaded above to embed these images.

In [None]:
from PIL import Image
from io import BytesIO
import requests

urls = {
    "https://live.staticflickr.com/5551/14616229927_7ed70f7836_b.jpg": "Robert Downey Jr",
    "https://live.staticflickr.com/3849/14800476884_6dbda11c8c_b.jpg": "Robert Downey Jr",
    "https://live.staticflickr.com/8187/8138909428_2d9e94332a.jpg": "Gwyneth Paltrow"
}

image_list = []

for index, (url, label) in enumerate(urls.items()):
  response = requests.get(url)
  img = Image.open(BytesIO(response.content))
  image_list.append(img)
  print(label)
  display(img)

Use the `SentenceTransformer` model to embed these images.

In [None]:
img_emb = # ANSWER HERE

* **1d. Explore the embeddings**

* What are the dimensions of the images we downloaded? 

In [None]:
# ANSWER HERE

* What is the dimensionality of the embedding for each image?


In [None]:
# ANSWER HERE



* Do the dimensions of the images affect the dimensionality of the embedding? 

[ANSWER HERE]

* Finally, let's compare how similar the image embeddings are to each other. We will use [*cosine similarity*](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html), the metric we discussed in lecture to compare image similarity.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

# ANSWER HERE

According to the embedding model:
* What is the cosine similarity between image 1 and image 2? [ANSWER HERE]
* What is the cosine similarity between image 1 and image 3? [ANSWER HERE]

Is this what you expected? [ANSWER HERE]

# Step 2: Finding and Uploading a Dataset

In order to test our embedding models more systematically, we'll need an entire dataset, not just a few samples. For this assignment, you will build your OWN dataset by downloading images of celebrities' faces. The purpose of this step is to think about how to build a representative dataset

Here are some things to consider as you build your own dataset:

* Dataset diversity: choose **at least 6 different celebrities** (Can you choose celebrities of different ages, ethnicities, genders? What other considerations are important here?) 
* Dataset size: Since we are not training an image embedding model from scratch, but simply evaluating different models, we will not require a particularly large dataset. Please have **at least 3 images per celebrity** (so your total dataset size should be at least 18 images.)
* Dataset consistency: all of the images in the dataset should consist primarily of **celebrity faces** only
* Dataset balance: you may want to have a dataset that is relatively balanced among the different celebrities
* Dataset license: you should make sure to use images under a permissive license, such as Creative Commons.

We suggest using either [Openverse](https://wordpress.org/openverse/) or [Flickr](https://flickr.com/) to easily find images that are under a Creative Commons license. 

First, create a dictionary, whose keys are image URLs and whose labels are the name of the celebrity (similar to the `urls` dictionary in step 1b):

In [None]:
urls = {
    # image url 1: label 1
    # image url 2: label 2
    # ...    
}

Then, run the following code to download the images and save them into organized folders:

In [None]:
from PIL import Image
from io import BytesIO
import requests
import os

image_list = []
root_path = "celebrity_images/train"

def strip_invalid_filename_characters(filename) -> str:
    filename = filename.replace(" ", "_")
    return ("".join([char for char in filename if char.isalnum() or char in "._- "]))

# Save the images into an organized directory structure
for index, (url, label) in enumerate(urls.items()):
  class_path = os.path.join(root_path, strip_invalid_filename_characters(label))
  os.makedirs(class_path, exist_ok=True)

  response = requests.get(url)
  img = Image.open(BytesIO(response.content))
  filename = url.split("/")[-1]
  img.save(os.path.join(class_path, filename))

  image_list.append(img)

Now, load this dataset using the `datasets` library using the `ImageFolder` dataset builder. You might find this reference useful: https://huggingface.co/docs/datasets/image_load#imagefolder

In [None]:
from datasets import load_dataset

dataset = # ANSWER HERE

In [None]:
dataset['train']

Answer a few questions about the dataset you've built:

* Dataset size: How many images are in your dataset? How many different "labels" are there? [ANSWER HERE] 

* Dataset diversity: How what diversity / representativeness considerations did you take when building your dataset? [ANSWER HERE] 


* **Push the Dataset to the Hugging Face Hub**

Now that you have a dataset, upload it to the Hugging Face Hub so that you can share it with others! Here is some information about uploading datasets to the Hub: https://huggingface.co/docs/datasets/upload_dataset#upload-with-python


In [None]:
dataset.push_to_hub("celeb-identities")

Once you've uploaded your dataset, you should be able to preview the dataset, and see the number of samples and the labels of the each sample! 

What is the URL to your dataset: [ANSWER HERE]

Please make sure that the dataset is **public**

# Step 3: Evaluating the Embedding Models on your Dataset

Now, let's evaluate each of the models on the dataset you've built. For this example, we will evaluate how good the embeddings from the same celebrity. "cluster" together. 

So we now have to decide on a *metric* we will use to measure the performance for our machine learning models. There are many different metrics that allow you to assess the quality of clustering results. We will use the silhouette score (SS). This measure has a range of [-1, 1] and is calculated using the mean intra-cluster distance ($a$) and the mean nearest-cluster distance ($b$) for each sample. The SS for a _single sample_ is $(b - a) / \text{max}(a, b)$, where $b$ is the distance between a sample and the nearest cluster that the sample is not a part of.

![](https://uploads-ssl.webflow.com/5f5148a709e16c7d368ea080/5f7dea907b8e8c7769e769c8_5f7c9650bc3b1ed0ad2247eb_silhouette_formula.jpg)

We then take the average of value of the SS across all samples to get a single SS for our entire dataset.

In this case, we will be using the `sklearn.metrics.silhouette_score` function which takes in two required parameters: (1) a matrix consisting of the embeddings of a list of samples and (2) a list of labels. 

Using each of these three models:
* https://huggingface.co/sentence-transformers/clip-ViT-B-16
* https://huggingface.co/sentence-transformers/clip-ViT-B-32
* https://huggingface.co/sentence-transformers/clip-ViT-L-14

create embeddings for all of the training images of the celebrities. And then compute the SS metric for all of the embeddings. Also record the running time for computing the embeddings for each model.


In [None]:
# Load all of the models 

models = [
    # ANSWER HERE
]

In [None]:
import time
from sklearn.metrics import silhouette_score

for model in models:
  # ANSWER HERE

What SS and running time do you get with each model? [ANSWER HERE]

If you had to pick one model to use for facial identity recognition, which one would it be? Why? 

[ANSWER HERE]

# Step 4: Choosing a Distance Threshold

In order to use this model for facial identity recognition, we need to choose a _similarity threshold_. If two faces are similar enough to each other (above this threshold), we will quantify them as being from the same person. If they are below this threshold, we will quantify them as different. Using the model you identified in the previous part, let's first compute the average distance between each pair of embeddings, as well as the average distance between each pair of embeddings of images that belong to the same celebrity.

Again, we'll find *cosine_similarity* quite useful

In [None]:
from sklearn.metrics.pairwise import cosine_similarity

model = # ANSWER HERE
embeddings = # ANSWER HERE

similarities = # ANSWER HERE

* What is the average cosine similarity between **all images**?

In [None]:
import numpy as np

# ANSWER HERE

* What is the average cosine similarity between all images of the **same celebrity**?

In [None]:
import numpy as np

# ANSWER HERE

Based on the above calculations, what similarity threshold do you pick, and why? [ANSWER HERE]

# Step 5: Building a Demo

A high-level metric like accuracy doesn't give us a great idea on how the model will work when presented with new data from the real world. To understand this, we will build a web-based demo that can be used on our phones or computers through a web browser to test our model.

The `gradio` library lets you build web demos of machine learning models with just a few lines code. Learn more about Gradio here: https://gradio.app/getting_started/

Gradio lets you build machine learning demos simply by specifying (1) a prediction function, (2) the input type and (3) the output type of your model. Write a prediction function that takes in two images and returns "SAME PERSON, UNLOCK PHONE" if they are within the distance threshold and "DIFFERENT PEOPLE, DON'T UNLOCK" if they are above the distance threshold 

In [None]:
import matplotlib.pyplot as plt

def predict(im1, im2):
  # ANSWER HERE
  if sim > # THRESHOLD HERE
    return sim, "SAME PERSON, UNLOCK PHONE"
  else:
    return sim, "DIFFERENT PEOPLE, DON'T UNLOCK"

* **Build a Gradio web demo of your image classifier and `launch()` it**

Create a `gradio.Interface` and launch it! In this case, we've provided the Gradio code that you need to launch the demo.

In [None]:
import gradio as gr

interface = gr.Interface(fn=predict, 
                         inputs= [gr.Image(type="pil", source="webcam"), 
                                  gr.Image(type="pil", source="webcam")], 
                         outputs= [gr.Number(label="Similarity"),
                                   gr.Textbox(label="Message")]
                         )

interface.launch(debug=True)

## Step 5b: Upload your Demo to Spaces (Optional)

Although we don't require it, we highly recommend that you upload your Gradio app to Hugging Face Spaces as it will allow you to easily share it with others or add it to your machine learning portfolio!

1. Create a new **public** Space with the code for your Gradio app. You might find this tutorial helpful: https://huggingface.co/blog/gradio-spaces (Note that in addition to uploading the code for your Gradio demo, you'll also need to upload the saved model files, as well as a `requirements.txt` file).
1. Once your app launches, please put the link to your Space here:

[ANSWER HERE]



# Step 6: Trying your Model with "Real World" Data!

* **Use the share link (or Space link if you completed 5b) created above to open up your app on your phone**

Now test your model on some real images -- of yourself or your friends. What do you notice about the performance of your model on your own images versus those on the training set?  

[ANSWER HERE]

# Bonus: Extensions

Now that you've worked through the project and have a functioning app, what else can we do to improve our results?

* **Extract faces from images**: One way to dramatically improve the performance of our app would be to do some preprocessing to [extract faces](https://www.digitalocean.com/community/tutorials/how-to-detect-and-extract-faces-from-an-image-with-opencv-and-python) from an image before it is passed into the image embedding. Can you implement this and see how your embedding results change? Does your Gradio app get better as well?
* **Systematically explore different similarity thresholds**: In step 4, you picked a single similarity threshold. How can you systematically the "best" threshold? One way would be to try various similarity thersholds and then compute the false positive rates (what % of the time two different celebrities would be classified as the same person with this threshold) as well as false negative rates (what % of the time the same celebrity would be classified as two different people with this threshold). We would need to find a threshold that balances a low false negative and a low false positive rate.
* **Try scraping a much larger dataset.** For this project, we manually downloaded a relatively small dataset. We would get more robust results if we were to programmatically scrape a larger dataset. There are many Python libraries that allow you to scrape image libraries (such as [flickrapi](https://gist.github.com/yunjey/14e3a069ad2aa3adf72dee93a53117d6)). Using such libraries, can you build a 10x larger dataset? How do your evaluation results change as a result? 
* **Finetune the image embedding model on faces**: We used general-purpose image embedding models as part of this exercise. To improve the performance of our models, we could actually *fine-tune* our embeddings on a dataset of facial images first. Such a [process is described here](https://huggingface.co/blog/how-to-train-sentence-transformers), and if you implement this, you'll notice a marked improvement in the results. Note that this requires building a larger dataset, so if you are interested in doing this extension, do the the previous extension as well! 



---


#### This project is from [Abubakar Abid's](https://twitter.com/abidlabs) course: *Building Computer Vision Applications* on CoRise. Learn more about the course [here](https://corise.com/course/vision-applications).