# DATA DESCRIPTION

For the development of this project, a group of images hosted in a free online database will be used.
This database has four different directories holding the images in different levels of difficulty as follows: faces94, faces95, faces96 and grimace. The last two are more complicated due to the images’ variation on background and scale and the type of facial expressions in them.

The whole set has 7900 images belonged to 395 individuals. Different genders and races are shown, people wearing glasses and beards are also taken into account and, regarding the age range, the majority of the data corresponds to first year undergraduate students between 18 and 20 years old, even though some older people are present in the data as well.

The main features of each directory containing the images are mentioned below:

Faces94: this is a collection of images consisting in a wide range of people’s pictures taken while they spoke in front of the camera. Because of the speech, this set is an introduction to the variation on facial expression.
Faces94 has 153 individuals’ images using portrait format, it contains pictures of male, female and malestaff in separate directories. The pictures’ background is plain green. It does not have any individual’s variation on head scale and image lighting, but it does have a few on head turn, tilt and slant, and considerable on facial expression.
Additionally, there is no individual hairstyle variation as the images were taken in a single session.

Faces95: this is a collection of pictures taken while people took one step forward towards the camera. This movement is used to introduce important variations on head scale among the same individual’s images.
Faces95 holds 72 people’s images in portrait format, male and female subjects are shown. The background in these images consists of a red curtain. Due to the subjects’ movement at the shooting time, there is some variations on picture’s backgrounds caused by people’s shadows, and also on head scale and image lighting.
Some variation is present on facial expression and there is no individual hairstyle variation neither, as the images in this directory were also taken in a single session.

Faces96: this collection of facial images holds subjects’ pictures taken while people took one step forward towards the camera as in Faces95. This movement is also used to introduce important variations on head scale among the same individual’s images.
This dataset contains 152 individuals’ images in square format and has male and female subjects’ pictures. The background used this time, consists of glossy posters which makes it more complicated than the ones used in Faces94 and Faces95. It has a large variation on head scale and image lighting. It also presents some minor variation on the position of the face in the image, on head turn, tilt and slant and on facial expression.
In this set, there is no hairstyle variation neither, as the images were taken in a single session.

Grimace: this directory has a collection of subject images taken while people move their heads and make grimaces which get more extreme near the end of the sequence. The other features are similar to those in Faces95.  
The number of participants in this dataset is 18 people. The whole set of images is presented in portrait format and it contains images of female and male subjects. The background used in this pictures is plain and there is a very little variation on head scale and image lighting, a bit more on head turn, tilt and slant and a big variation on facial expression.
As in the other directories, there is no hairstyle variation in here, as the images were taken in a single session as well.

References

University of Essex. (2008, June 20). Description of the Collection of Facial Images [online]. Retrieved from https://cswww.essex.ac.uk/mv/allfaces/index.html


## Data preprocessing

In order to prepare the image dataset to proceed with the final stages of modeling and valuation processes. It is unavoidable to standardize the images features on the initial dataset (codename - faces94) and the external images use it on the future testing activities under the following conditions:

**General Images Characteristics**:

* File Format *.jpg
* Images on Gray Scale.
* Size 180x200 for the images on the dataset (codename- faces94)

**Activities**:

* Organize the images on a short-listed to prepare a new dataset.
* Exclude from the dataset all the images without the *.jpg format.
* On the OpenCV library of Python, upload the images and storage the matrix in arrays for numeric treatment.
* On the OpenCV library of Python, change the images to gray-scale and resize the photos to 180x200.
* Finally, the outcome is a new dataset with the proper images for testing and modeling face recognition. With the Eigenfaces model to apply the Principal Component Analysis (PCA) so represent the face images in a low dimension.


In [None]:
import sys
sys.path.append('../utils/')

In [None]:
from ImageUtils import *

In [None]:
import numpy as np
import pandas as pd # Needs the package Pandas to be installed. Check Anaconda Environments and Packages.
from sklearn.decomposition import PCA # Needs SciKit Learn package to be installed. Check Anaconda Environments and Packages.
import matplotlib.pyplot as plt
from scipy.spatial.distance import mahalanobis
from collections import Counter

# DATASET FACES 94

In [None]:
face94_male = readFaces94MaleFaces(gray=True)
plt.imshow(face94_male[0], plt.cm.gray);

# Principal component analysis (PCA)

In [None]:
N, height, width = face94_male.shape

In [None]:
labels_faces = np.ones(N)

In [None]:
pca = PCA(n_components=200, whiten=True).fit(face94_male.reshape(N, height*width))
pca.components_.shape

In [None]:
rows = 4
cols = 6
plt.figure(figsize=(15,10))
for i in np.arange(rows * cols):
    plt.subplot(rows, cols, i + 1)
    plt.imshow(pca.components_[i].reshape(height, width), plt.cm.gray)

# Mean face

In [None]:
mean_face = pca.mean_.reshape(height, width)
mean_face2 = np.mean(face94_male.reshape(N, height*width), axis=0).reshape(height, width)
fig = plt.figure(figsize=(8,10))
ax1 = fig.add_subplot(1,2,1)
plt.title("PCA mean")
ax1.imshow(mean_face, plt.cm.gray)
ax2 = fig.add_subplot(1,2,2)
plt.title("np mean")
ax2.imshow(mean_face2, plt.cm.gray)
Dis=np.linalg.norm(mean_face - mean_face2, ord=2, keepdims=False)
print("Distance "+ str(Dis))

# Median face

In [None]:
median_face = np.median(face94_male.reshape(N, height*width), axis=0).reshape(height, width)
plt.imshow(median_face, cmap=plt.cm.gray);

In [None]:
#face94_male_projected = pca.transform(face94_male.reshape(N, height*width))

# Images of natual landscapes

The landscape images were obtain of **ImageNet** database [ImageNet database](http://image-net.org/) , 
each one of the directions is [online](http://image-net.org/api/text/imagenet.synset.geturls?wnid=n13104059). We use cv2 package by read and resize images, then we create an Numpy array with a gray scale of images.

In [None]:
landscapes = np.array(readLandsCapeImage(gray=True)) # Read dataset
plt.imshow(landscapes[45], plt.cm.gray); # show image

In [None]:
labels_landscapes = np.zeros(landscapes.shape[0])

In [None]:
dataset = np.vstack((face94_male, landscapes))
plt.imshow(dataset[-1], plt.cm.gray);

In [None]:
labels = np.concatenate((labels_faces, labels_landscapes))

In [None]:
dataset_N, height, width = dataset.shape

In [None]:
mean_with_noise = np.mean(dataset.reshape(dataset_N, height*width), axis=0).reshape(height, width)
plt.imshow(mean_with_noise, cmap=plt.cm.gray);

In [None]:
median_with_noise = np.median(dataset.reshape(dataset_N, height*width), axis=0).reshape(height, width)
plt.imshow(median_with_noise, cmap=plt.cm.gray);

## Show atypical data distances

In [None]:
distance_info = getNormsAndDistanceInfoFromBaseImage(base_image=mean_with_noise, array_images=dataset, labels=labels)
visualizeOutlierInfo(distance_info,labels)

In [None]:
print(distance_info['falsitude_metrics'])

# Show atypical data distances (outliers interquartile range)

In [None]:
visualizeOutlierInfo2(distance_info,dataset,labels)

In [None]:
print(distance_info['falsitude_metrics_iqr'])

# Normas: Norm1.0  Norm2.0  Norm3.0  Norminf  Norm2.5  Norm0.71

In [None]:
selected_norm = "Norminf"
cols = 6
rows = int(np.ceil(distance_info["outliers"][selected_norm]["indices"].shape[0]/cols))
plt.figure(figsize=(180,200))
for i in np.arange(distance_info["outliers"][selected_norm]["indices"].shape[0]):
    plt.subplot(rows, cols, i + 1)
    plt.imshow(dataset[distance_info["outliers"][selected_norm]["indices"][i]], plt.cm.gray)

In [None]:
selected_norm = "Norm1.0"
selected_outliers = "outliersiqr"
Distance=distance_info["norms"][selected_norm][distance_info[selected_outliers][selected_norm]['indices']]
Ind=distance_info[selected_outliers][selected_norm]['indices']
Distance, Ind =zip(*sorted(zip(Distance, Ind)))
cols = 6
rows = int(np.ceil(len(Ind)/cols))
plt.figure(figsize=(180,200))
for i in np.arange(len(Ind)):
    plt.subplot(rows, cols, i+1)
    plt.imshow(dataset[Ind[-(i+1)]], plt.cm.gray)

In [None]:
dataset_norm = dataset/255
dataset_norm_cov = np.cov(dataset_norm.reshape(dataset_N, height*width))

np.linalg.det(dataset_norm_cov)