# Facial Recognition: using Clustering and Principal Component Analysis

This assignment uses faces as a way to exercise skills in clustering and PCA.
* I discussed clustering and PCA in lecture, but please reach out if you need help getting started.

Execute this cell to import the libraries that I used for the PCA_KMeans notebook.

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler

Execute the cell below to import the Olivetti faces data-set from AT&T.
* https://scikit-learn.org/stable/modules/generated/sklearn.datasets.fetch_olivetti_faces.html

This dataset can be used for classification, but we're going to use it to explore dimensionality reduction and clustering.

In [None]:
# import the face dataset
from sklearn.datasets import fetch_olivetti_faces
olivetti = fetch_olivetti_faces()

# print a description of the dataset
print(olivetti.DESCR)

# assign the numpy array of face data to a variable 'x'
x = olivetti.data

We now have a numpy array `x` which contains all the samples for our face exploration.

Find the number of elements in each dimension of the `x` array.

How many samples are there?  how many feature variables?  and how many unique faces?

Visualize the first face by executing the following:

In [None]:
fig = plt.figure(figsize=(2,2))
plt.imshow(x[0].reshape(64,64),
           cmap='gray')

Use PCA to create a new numpy array from `x` that contains the principal component values containing 95% of the variance from the original data.

How many principal components are retained? and by what factor has the size of the data been reduced relative to `x`?

Let's take this reduced array and plot the face information that it still contains.

To transform the dimensionally-reduced data represented in the space of principal components back to the original feature space, you can execute the inverse transform:
* `pca.fit_transform(x)` gets the dimensionally-reduced array in the space of principal components
* `pca.inverse_transform(x_reduced)` transforms the dimensionally-reduced array back into the original feature space (the principal components that have been dropped are treated as having 0 values)

Using this inverse transform, make plots of 5 faces from the data that only retains 95% of the variance and compare them against plots of the same 5 faces made from the original data.

Do the same process again for 3 different values of retained variance:  50%, 75%, and 99%.  For each,
* print the number of principal components that are kept
* print the factor by which the dimensionality of the array has been reduced
* make plots of the same 5 faces
* comment on which facial features are still noticeable or not noticeable

Using the data that has been reduced while keeping 95% of the variance, do K-Means Clustering with 80 clusters (the number of faces that are included in the dataset is 40, but the clustering here works better with larger number of clusters).

What are the cluster values for the 5 faces you plotted above?

For each of your 5 faces, plot the other faces that are in the same cluster.

Do the faces in these clusters look the same or different?

To wrap up, let's consider the following hypothetical scenario:
1. Suppose that the faces in this dataset correspond to a collection of photos you've found on a phone
2. the phone used an algorithm to cluster common faces together as potentially being photos of the same person
3. you take a photo of me in class and the phone tries to match the new picture with one of the clusters it's identified

**First:** make sure that you have downloaded the "Ben1.jpg" file into the same directory as this notebook

**Second:** execute the cell below.  It should output a matplotlib figure of my face.

In [None]:
# Execute this to import and look at the new picture of me:

import matplotlib.image as mpimg
from sklearn.preprocessing import MinMaxScaler

im = mpimg.imread('Ben1.jpg')
im = im[:,:,0].reshape(4096,-1).astype('float32')
im = MinMaxScaler().fit_transform(im)

fig = plt.figure(figsize=(2,2))
plt.imshow(im.reshape(64,64),
           cmap='gray')

Identify which cluster my face is closest to and plot the other faces in that cluster.  You'll need to:
* Reshape my face array (stored now as variable `im`) into the proper shape
* Transform it into an array of principal component values
* Use the KMeans predict method on it
* Use the predicted cluster number to identify and plot the other faces in the same cluster

Do I look like the other faces in that cluster?  What do you think the clustering algorithm picked up on to put my face in that cluster?

## Submit

* Save your work (File -> Save Notebook)
* Verify that your notebook runs without error by restarting the kernel (or closing and opening the notebook) and selecting the top menu item for Run -> Run All Cells.  It should run successfully all the way to the bottom.
* Save your notebook again.  Keep all the output visible when saving the final version.
* Submit the file through the Canvas Assignment.