<a href="https://colab.research.google.com/github/lblogan14/deep_learning_for_computer_vision/blob/master/ch6_similarity_learning.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [16]:
from google.colab import drive
drive.mount('/content/drive')

Go to this URL in a browser: https://accounts.google.com/o/oauth2/auth?client_id=947318989803-6bn6qk8qdgf4n4g3pfee6491hc0brc4i.apps.googleusercontent.com&redirect_uri=urn%3Aietf%3Awg%3Aoauth%3A2.0%3Aoob&scope=email%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdocs.test%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fdrive.photos.readonly%20https%3A%2F%2Fwww.googleapis.com%2Fauth%2Fpeopleapi.readonly&response_type=code

Enter your authorization code:
··········
Mounted at /content/drive


In [17]:
%cd /content/drive/My' 'Drive/Colab' 'Notebooks/Deep_Learning_for_Computer_Vision/

/content/drive/My Drive/Colab Notebooks/Deep_Learning_for_Computer_Vision


In [0]:
import tensorflow as tf

#Similarity Learning Algorithms
**Similarity learning** is the process of training a metric to compute the similarity between two entities. A metric can be Euclidean or cosine or other custom distance function. Entities can be any data, such as images, videos, texts, or tables. A vector representation of the image is requried when to compute a metric.

##Siamese Network
is a neural network model where the network is trained to distinguish between two inputs.

A Siamese network can
train a CNN to produced an embedding by two encoders. Each encoder is fed
with one of the images in either a positive or a negative pair. A Siamese network
requires less data than the other deep learning algorithms. Siamese networks
were originally introduced for comparing signatures.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch6/siamese.JPG?raw=true)

Siamese networks can also be used for one-shot learning. **One-shot learning** is to learning with only one example. In this case, an image can be
shown and it can tell whether they are similar. For most of the similarity learning
tasks, a pair of positive and negative pairs are required to train. Such datasets can
be formed with any dataset that is available for classification tasks, assuming
that they are Euclidean distances. Here, the main objective of the encoders is to differentiate one from another.

###Contrastive Loss
differentiates images by similarity. 

The feature or latent layer is
compared using a similarity metric and trained with the target for a similarity
score. In the case of a positive pair, the target would be 0, as both inputs are the
same. For negative pairs, the distance between the pair of latent is a maximum of
0 in the case of cosine distance or regularised Euclidean distance.

The `contrastive_loss` is defined as below:

In [0]:
def contrastive_loss(model_1, model_2, label, margin=0.1):
  distance = tf.redunce_sum(tf.square(model_1 - model_2), 1)
  loss = label * tf.square(tf.maximum(0., margin-tf.sqrt(distance))) + (1 - label) * distance
  loss = 0.5 * tf.reduce_mean(loss)
  return loss

###Train a Siamese Network
We need two models.

Set up the layers and datasets first,

In [6]:
from tensorflow.examples.tutorials.mnist import input_data
mnist_data = input_data.read_data_sets('MNIST_data', one_hot=True)

input_size = 784
no_classes = 10
batch_size = 100
total_batches = 300

Extracting MNIST_data/train-images-idx3-ubyte.gz
Extracting MNIST_data/train-labels-idx1-ubyte.gz
Extracting MNIST_data/t10k-images-idx3-ubyte.gz
Extracting MNIST_data/t10k-labels-idx1-ubyte.gz


In [0]:
def add_variable_summary(tf_variable, summary_name):
  with tf.name_scope(summary_name + '_summary'):
    mean = tf.reduce_mean(tf_variable)
    tf.summary.scalar('Mean', mean)
    with tf.name_scope('standard_deviation'):
      standard_deviation = tf.sqrt(tf.reduce_mean(tf.square(tf_variable - mean)))
    tf.summary.scalar('StandardDeviation', standard_deviation)
    tf.summary.scalar('Maximum', tf.reduce_max(tf_variable))
    tf.summary.scalar('Minimum', tf.reduce_min(tf_variable))
    tf.summary.histogram('Histogram', tf_variable)

In [0]:
def convolution_layer(input_layer, filters, kernel_size=[3,3],
                      activation=tf.nn.relu):
  layer = tf.layers.conv2d(inputs=input_layer,
                           filters=filters,
                           kernel_size=kernel_size,
                           activation=activation)
  add_variable_summary(layer, 'convolution')
  return layer

In [0]:
def pooling_layer(input_layer, pool_size=[2,2], strides=2):
  layer = tf.layers.max_pooling2d(inputs=input_layer,
                                  pool_size=pool_size,
                                  strides=strides)
  add_variable_summary(layer, 'pooling')
  return layer

In [0]:
def dense_layer(input_layer, units, activation=tf.nn.relu):
  layer = tf.layers.dense(inputs=input_layer,
                          units=units,
                          activation=activation)
  add_variable_summary(layer, 'dense')
  return layer

Use the building blocks above to build a simple CNN,

In [0]:
def get_model(input_):
  input_reshape = tf.reshape(input_, [-1,28,28,1], name='input_reshape')
  convolution_layer_1 = convolution_layer(input_reshape, 64)
  pooling_layer_1 = pooling_layer(convolution_layer_1)
  convolution_layer_2 = convolution_layer(pooling_layer_1, 128)
  pooling_layer_2 = pooling_layer(convolution_layer_2)
  flattened_pool = tf.reshape(pooling_layer_2, [-1, 5*5*128], name='flattened_pool')
  dense_layer_bottleneck = dense_layer(flattened_pool, 1024)
  return dense_layer_bottleneck

The model defined will be used twice to define the encoders necessary for
Siamese networks.

Next, placeholders for both the models are defined. For every
pair, the similarity of the inputs is also fed as input. The models defined are the
same. The models can also be defined so that the weights are shared.

In [0]:
top_input = tf.placeholder(tf.float32, shape=[None, input_size])
bottom_input = tf.placeholder(tf.float32, shape=[None, input_size])
y_input = tf.placeholder(tf.float32, shape=[None, no_classes])

top_bottleneck = get_model(top_input)
bottom_bottleneck = get_model(bottom_input)
# Concatenate models for similarity learning
dense_layer_bottleneck = tf.concat([top_bottleneck, bottom_bottleneck], 1)

dropout_bool = tf.placeholder(tf.bool)
dropout_layer = tf.layers.dropout(inputs=dense_layer_bottleneck,
                                  rate=0.4,
                                  training=dropout_bool)
logits = dense_layer(dropout_layer, no_classes)

In [0]:
with tf.name_scope('loss'):
  softmax_cross_entropy = tf.nn.softmax_cross_entropy_with_logits_v2(labels=y_input,
                                                                     logits=logits)
  loss_operation = tf.reduce_mean(softmax_cross_entropy, name='loss')
  tf.summary.scalar('loss', loss_operation)

In [0]:
with tf.name_scope('optimizer'):
  optimizer = tf.train.AdamOptimizer().minimize(loss_operation)

In [0]:
with tf.name_scope('accuracy'):
  with tf.name_scope('correct_prediction'):
    predictions = tf.argmax(logits, 1)
    correct_predictions = tf.equal(predictions, tf.argmax(y_input, 1))
  with tf.name_scope('accuracy'):
    accuracy_operation = tf.reduce_mean(tf.cast(correct_predictions, tf.float32))
  tf.summary.scalar('accuracy', accuracy_operation)

In [0]:
session = tf.Session()
session.run(tf.global_variables_initializer())

merged_summary_operation = tf.summary.merge_all()
train_summary_writer = tf.summary.FileWriter('./tmp/ch6/train', session.graph)
test_summary_writer = tf.summary.FileWriter('./tmp/ch6/test')

test_images, test_labels = mnist_data.test.images, mnist_data.test.labels

Now we can train the network and vusialize the result with TensorBoard.

In [0]:
for batch_no in range(total_batches):
  mnist_batch = mnist_data.train.next_batch(batch_size)
  train_images, train_labels = mnist_batch[0], mnist_batch[1]
  _, merged_summary = session.run([optimizer, merged_summary_operation],
                                  feed_dict={top_input:train_images,
                                             bottom_input:train_images,
                                             y_input:train_labels,
                                             dropout_bool:True}
                                 )
  train_summary_writer.add_summary(merged_summary, batch_no)
  if batch_no % 10 == 0:
    merged_summary, _ = session.run([merged_summary_operation, accuracy_operation],
                                    feed_dict={top_input:test_images,
                                               bottom_input:test_images,
                                               y_input:test_labels,
                                               dropout_bool:False}
                                   )
    test_summary_writer.add_summary(merged_summary, batch_no)

Two encoders are defined, and
the latent space is concatenated to form the loss of training. The top and bottom
models are fed with data separately.

####TensorBoard


In [20]:
# install ngrok
!wget https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
!unzip ngrok-stable-linux-amd64.zip

--2018-12-18 16:42:05--  https://bin.equinox.io/c/4VmDzA7iaHb/ngrok-stable-linux-amd64.zip
Resolving bin.equinox.io (bin.equinox.io)... 52.21.103.149, 52.203.66.95, 52.207.111.186, ...
Connecting to bin.equinox.io (bin.equinox.io)|52.21.103.149|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 5363700 (5.1M) [application/octet-stream]
Saving to: ‘ngrok-stable-linux-amd64.zip.1’


2018-12-18 16:42:05 (17.7 MB/s) - ‘ngrok-stable-linux-amd64.zip.1’ saved [5363700/5363700]

Archive:  ngrok-stable-linux-amd64.zip
replace ngrok? [y]es, [n]o, [A]ll, [N]one, [r]ename: y
  inflating: ngrok                   


In [0]:
# run TensorBoard,
# locating the summary file
# training summary -> ./tmp/train
# testing summary -> ./tmp/test
LOG_DIR = './tmp/ch6'
get_ipython().system_raw(
      'tensorboard --logdir {} --host 0.0.0.0 --port 6006 &'.format(LOG_DIR))

In [0]:
# run ngrok
# run ngrok to tunnel TensorBoard port 6006 to the outside world
get_ipython().system_raw('./ngrok http 6006 &')

In [24]:
# Get URL
# access the colab TensorBoard web page
! curl -s http://localhost:4040/api/tunnels | python3 -c \
    "import sys, json; print(json.load(sys.stdin)['tunnels'][0]['public_url'])"

http://ae1404b5.ngrok.io


##FaceNet
solves the face verification problem and learns one deep CNN then transforms a face image into an embedding.

The FaceNet architecture is shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch6/facenet.JPG?raw=true)

FaceNet takes a batch of face images and trains them. In that batch, there will be
a few positive pairs. While computing the loss, the positive pairs and closest few
negative pairs are considered. Mining selective pairs enable smooth training. If
all the negatives are pushed away all the time, the training is not stable.
Comparing three data points is called triplet loss. The images are considered
with a positive and negative match while computing the loss. The negatives are
pushed only by a certain margin.

###Triplet Loss
learns the score vectors for the images. The score vectors of face
descriptors can be used to verify the faces in Euclidean space. The triplet loss is
similar to metric learning in the sense of learning a projection so that the inputs
can be distinguished. These projections or descriptors or score vectors are a
compact representation, hence can be considered as a dimensionality reduction
technique.

A **triplet** consists of an *anchor*, and *positive* and *negative faces*. An
*anchor* can be any face, and *positive faces* are the images of the same person.
The *negative image* may come from another person.

There will be a lot of negative faces for a given anchor. By selecting negatives that are
currently closer to the anchor, its harder for the encoder to distinguish the faces,
thereby making it learn better. This process is termed as **hard negative mining**.
The closer negatives can be obtained with a threshold in Euclidean space.
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch6/triplet.JPG?raw=true)

The `triplet_loss` functions can be defined as:

In [0]:
def triplet_loss(anchor_face, positive_face, negative_face, margin):
  def get_distance(x, y):
    return tf.reduce_sum(tf.square(tf.subtract(x, y)), 1)
  
  positive_distance = get_distance(anchor_face, positive_face)
  negative_distance = get_distance(anchor_face, negative_face)
  total_distance = tf.add(tf.subtract(positive_distance, negative_distance), margin)
  return tf.reduce_mean(tf.maximum(total_distance, 0.0), 0)

Every point has to be compared with
others to get the proper anchor and positive pairs. The mining of the triplets is
shown below:

In [0]:
from scipy.spatial.distance import cdist
import numpy as np

In [0]:
def mine_triplets(anchor, targets, negative_samples):
  distances = cdist(anchor, targets, 'cosine')
  distances = cdist(anchor, targets, 'cosine').tolist()
  QnQ_duplicated = [
      [target_index for target_index, dist in enumerate(QnQ_dist) if dist == QnQ_dist[query_index]]
    for query_index, QnQ_dist in enumerate(distances)]
  for i, QnT_dist in enumerate(QnT_dists):
    for j in QnQ_duplicated[i]:
      QnT_dist.itemset(j, np.inf)

  QnT_dists_topk = QnT_dists.argsort(axis=1)[:, :negative_samples]
  top_k_index = np.array([np.insert(QnT_dist, 0, i) for i, QnT_dist in enumerate(QnT_dists_topk)])
  return top_k_index

The FaceNet model is a state of the art method in
training similarity models for faces.

##DeepNet Model
is used to leran the embedding of faces for face verification tasks. It improves on the method of FaceNet by taking multiple crops of the same face and passing through several encoders to get a better embedding. This has achieved a better accuracy than FaceNet but takes more time for processing.

The face crops
are made in the same regions and passed through its respective encoders. Then
all the layers are concatenated for training against the triplet loss.

##DeepRank
is used to rank images based on similarity. 

Images are passed through different models:

![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch6/deeprank.JPG?raw=true)

The triplet loss is computed and backpropagation is done here as well.

Then the image can be converted into a linear embedding for ranking. The DeepRank architecture is shown below:
![alt text](https://github.com/lblogan14/deep_learning_for_computer_vision/blob/master/notes_images/ch6/deeprank2.JPG?raw=true)

#Human Face Analysis
* **Face detection**: Finding the bounding box of location of faces
* **Facial landmark detection**: Finding the spatial points of facial features
such as nose, mouth and so on
* **Face alignment**: Transforming the face into a frontal face for further
analysis
* **Attribute recognition**: Finding attributes such as gender, smiling and so on
* **Emotion analysis**: Analysing the emotions of persons
* **Face verification**: Finding whether two images belong to the same person
* **Face recognition**: Finding an identity for the face
* **Face clustering**: Grouping the faces of the same person together

##Face Landmarks and Attributes
**Face landmarks** are the spatial points in a human face. The spatial points
correspond to locations of various facial features such as eyes, eyebrows, nose,
mouth, and chin. The number of points may vary from 5 to 78 depending on the
annotation. Face landmarks are also referred to as **fiducial-points**, **facial keypoints**, or **face pose**.

Applications of face landmarks:
* Alignment of faces for better face verification or face recognition
* To track faces in a video
* Facial expressions or emotions can be measured
* Helpful for diagnosis of medical conditions

###Learn the facial keypoints

In [0]:
image_size = 40
no_landmark = 10
no_gender_classes = 2
no_smile_classes = 2
no_glasses_classes = 2
no_headpose_classes = 5
batch_size = 100
total_batches = 300

In [0]:
tf.reset_default_graph()

Placeholders for various inputs:

In [0]:
image_input = tf.placeholder(tf.float32, shape=[None, image_size, image_size])

landmark_input = tf.placeholder(tf.float32, shape=[None, no_landmark])
gender_input = tf.placeholder(tf.float32, shape=[None, no_gender_classes])
smile_input = tf.placeholder(tf.float32, shape=[None, no_smile_classes])
glasses_input = tf.placeholder(tf.float32, shape=[None, no_glasses_classes])
headpose_input = tf.placeholder(tf.float32, shape=[None, no_headpose_classes])

CNN model construction:

In [0]:
image_input_reshape = tf.reshape(image_input, [-1, image_size, image_size, 1],
                             name='input_reshape')

convolution_layer_1 = convolution_layer(image_input_reshape, 16)
pooling_layer_1 = pooling_layer(convolution_layer_1)
convolution_layer_2 = convolution_layer(pooling_layer_1, 48)
pooling_layer_2 = pooling_layer(convolution_layer_2)
convolution_layer_3 = convolution_layer(pooling_layer_2, 64)
pooling_layer_3 = pooling_layer(convolution_layer_3)
convolution_layer_4 = convolution_layer(pooling_layer_3, 64)
flattened_pool = tf.reshape(convolution_layer_4, [-1, 5 * 5 * 64],
                            name='flattened_pool')
dense_layer_bottleneck = dense_layer(flattened_pool, 1024)
dropout_bool = tf.placeholder(tf.bool)
dropout_layer = tf.layers.dropout(
        inputs=dense_layer_bottleneck,
        rate=0.4,
        training=dropout_bool
    )
landmark_logits = dense_layer(dropout_layer, 10)
smile_logits = dense_layer(dropout_layer, 2)
glass_logits = dense_layer(dropout_layer, 2)
gender_logits = dense_layer(dropout_layer, 2)
headpose_logits = dense_layer(dropout_layer, 5)

The loss is computed individually for all the facial features,

In [0]:
landmark_loss = 0.5 * tf.reduce_mean(
    tf.square(landmark_input, landmark_logits))

gender_loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=gender_input, logits=gender_logits))

smile_loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=smile_input, logits=smile_logits))

glass_loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=glasses_input, logits=glass_logits))

headpose_loss = tf.reduce_mean(
    tf.nn.softmax_cross_entropy_with_logits(
        labels=headpose_input, logits=headpose_logits))

loss_operation = landmark_loss + gender_loss + \
                 smile_loss + glass_loss + headpose_loss

optimiser = tf.train.AdamOptimizer().minimize(loss_operation)

In [0]:
# Need to create the fiducial image data first
# a module named fiducial_data needs to be created and imported 
session = tf.Session()
session.run(tf.initialize_all_variables())
fiducial_test_data = fiducial_data.test

In [0]:
for batch_no in range(total_batches):
    fiducial_data_batch = fiducial_data.train.next_batch(batch_size)
    loss, _landmark_loss, _ = session.run(
        [loss_operation, landmark_loss, optimiser],
        feed_dict={
            image_input: fiducial_data_batch.images,
            landmark_input: fiducial_data_batch.landmarks,
            gender_input: fiducial_data_batch.gender,
            smile_input: fiducial_data_batch.smile,
            glasses_input: fiducial_data_batch.glasses,
            headpose_input: fiducial_data_batch.pose,
            dropout_bool: True
    })
    if batch_no % 10 == 0:
        loss, _landmark_loss, _ = session.run(
            [loss_operation, landmark_loss],
            feed_dict={
                image_input: fiducial_test_data.images,
                landmark_input: fiducial_test_data.landmarks,
                gender_input: fiducial_test_data.gender,
                smile_input: fiducial_test_data.smile,
                glasses_input: fiducial_test_data.glasses,
                headpose_input: fiducial_test_data.pose,
                dropout_bool: False
            })

##Face Recognition
is the process of identifying a personage from a digital image or a video.

###Compute the similarity between faces
The faces have to be
detected, followed by finding the fiducial points. The faces can be aligned with
the fiducial points. The aligned face can be used for comparison.

In [0]:
from scipy import misc
import tensorflow as tf
import numpy as np
import os
import facenet
print facenet
from facenet import load_model, prewhiten
import align.detect_face

In [0]:
tf.reset_default_graph()

Load and align the images:

In [0]:
def load_and_align_data(image_paths, image_size=160, margin=44, gpu_memory_fraction=1.0):
    minsize = 20
    threshold = [0.6, 0.7, 0.7]
    factor = 0.709

    print('Creating networks and loading parameters')
    with tf.Graph().as_default():
        gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=gpu_memory_fraction)
        sess = tf.Session(config=tf.ConfigProto(
            gpu_options=gpu_options, log_device_placement=False))
        with sess.as_default():
            pnet, rnet, onet = align.detect_face.create_mtcnn(sess, None)

    nrof_samples = len(image_paths)
    img_list = [None] * nrof_samples
    for i in range(nrof_samples):
        img = misc.imread(os.path.expanduser(image_paths[i]), mode='RGB')
        img_size = np.asarray(img.shape)[0:2]
        bounding_boxes, _ = align.detect_face.detect_face(
            img, minsize, pnet, rnet, onet, threshold, factor)
        det = np.squeeze(bounding_boxes[0, 0:4])
        bb = np.zeros(4, dtype=np.int32)
        bb[0] = np.maximum(det[0] - margin / 2, 0)
        bb[1] = np.maximum(det[1] - margin / 2, 0)
        bb[2] = np.minimum(det[2] + margin / 2, img_size[1])
        bb[3] = np.minimum(det[3] + margin / 2, img_size[0])
        cropped = img[bb[1]:bb[3], bb[0]:bb[2], :]
        aligned = misc.imresize(
            cropped, (image_size, image_size), interp='bilinear')
        prewhitened = prewhiten(aligned)
        img_list[i] = prewhitened
    images = np.stack(img_list)
    return images

Process the image paths to get the embeddings:

In [0]:
def get_face_embeddings(image_paths, model=''):
    images = load_and_align_data(image_paths)
    with tf.Graph().as_default():
        with tf.Session() as sess:
            load_model(model)
            images_placeholder = tf.get_default_graph().get_tensor_by_name("input:0")
            embeddings = tf.get_default_graph().get_tensor_by_name("embeddings:0")
            phase_train_placeholder = tf.get_default_graph().get_tensor_by_name("phase_train:0")
            feed_dict = {images_placeholder: images,phase_train_placeholder: False}
            emb = sess.run(embeddings, feed_dict=feed_dict)

    return emb

Compute the distance between the embeddings:

In [0]:
def compute_distance(embedding_1, embedding_2):
    dist = np.sqrt(np.sum(np.square(np.subtract(embedding_1, embedding_2))))
    return dist

whih computes the Euclidean distance between the embeddings.

###Find the optimum threshold
Combining with the preceding functions, we can calculate the accuracy of this model:

In [0]:
import sys
import argparse
import os
import re
from sklearn.metrics import classification_report
from sklearn.metrics import accuracy_score

Use the following function to obtain the images:

In [0]:
def get_image_paths(image_directory):
  image_names = sorted(os.listdir(image_directory))
  image_paths = [os.path.join(image_directory, image_name) for image_name in image_names]
  return image_paths

The distances of the images are obtained when embeddings are passed:

In [0]:
def get_labels_distances(image_paths, embeddings):
  target_labels, distances = [], []
  for image_path_1, embedding_1 in zip(image_paths, embeddings):
    for image_path_2, embedding_2 in zip(image_paths, embeddings):
      if (re.sub(r'\d+', '', image_path_1)).lower() == (re.sub(r'\d+', '', image_path_2)).lower():
        target_labels.append(1)
      else:
        target_labels.append(0)
      distances.append(compute_distance(embedding_1, embedding_2))
  return target_labels, distances

Print out the threshold and accuracy:

In [0]:
def print_metrics(target_labels, distances):
  accuracies = []
  for threshold in range(50, 150, 1):
    threshold = threshold/100
    predicted_labels = [1 if dist <= threshold
                          else 0 for dist in distances]
    print('Threshold ', threshold)
    print(classification_report(target_labels, predicted_labels))
    accuracy = accuracy_score(target_labels, predicted_labels)
    print('Accuracy: ', accuracy)
    accuracies.append(accuracy)
  print('Highest accuracy: ', max(accuracies))

Execute the main codes:

In [0]:
image_paths = get_image_paths(image_directory)
embeddings = get_face_embeddings(image_paths)
target_labels, distances = get_labels_distances(image_paths, embeddings)
print_metrics(target_labels, distances)

##Face Clustering
is the process of grouping images of the same person together.

The embeddings of faces can be extracted, and a clustering
algorithm such as K-means can be used to club the faces of the same person
together. TensorFlow provides an API called `tf.contrib.learn.KmeansClustering` for
the K-means algorithm.