# Unsupervised Learning

In the previous chapter we used a neural network to classify images into one of three categories. We used a semi-automated method to label the data going into the network since data-labeling and cleansing is one of the most time-intensive and error-prone steps in machine learning.

But what if we didn't have access to labeled data, or what if we didn't know what labels should be applied to a dataset? This is the realm of "unsupervised learning," where an algorithm is given just the data and no labels. Clustering is one of the more common unsupervised tasks, where an algorithm is asked to collect data into natural groupings.

We will be using a special neural network called an "autoencoder" in the exercise. Instead of outputting a classification, an autoencoder attempts to reconstruct the original data as best it can. What makes this an interesting task is that the network contains a "bottleneck," or a section with fewer neurons than input/output pixels. This means that the network has to encode large areas of an image into small encodings. You probably can guess that autoencoders can be very useful for image compression and noise reduction.

What we will do in this experiment is build and train an autoencoder using the same data as Chapter 3. After the autoencoder is trained we will inspect its "bottleneck," which will be a fully connected layer. We will treat the activations in this layer as a feature vector that we will feed into a traditional clustering algorithm.

We will also introduce the idea of a TensorFlow Record file. This is a binary data format that replaces the flat file image representation that we used in Chapter 3. While this method of storing data is a bit more opaque, it has many advantages over the individual-image-file system.

In [None]:
%matplotlib inline
%load_ext autoreload 
%autoreload 2

# Math Stuff
import ee, scipy.misc, random, os
import numpy as np
from threading import Thread

# debug stuff
import matplotlib.pyplot as plt
import matplotlib.image as mpimg
from tqdm import tqdm

# Tensorflow
import tensorflow as tf
from scipy import misc


## Data Aquisition

In this chapter we will be using TFRecords (TensorFlow Records) instead of flat files. Since downloading all those images took time, though, we will reuse the images we downloaded in chapter 3.

In [None]:
CH3_DATA_DIR="./data/ch3"
CH4_DATA_DIR="./data/ch4"

TRAIN_TFRECORD_FILENAME = os.path.join(CH4_DATA_DIR, "ch4_train.tfrecords")
TEST_TFRECORD_FILENAME = os.path.join(CH4_DATA_DIR, "ch4_test.tfrecords")
    
TRAIN_IMG_DIR=os.path.join(CH3_DATA_DIR, "train_imgs")
TRAIN_IMG_LABELS=os.path.join(CH3_DATA_DIR, "train.txt")
TEST_IMG_DIR=os.path.join(CH3_DATA_DIR, "test_imgs")
TEST_IMG_LABELS=os.path.join(CH3_DATA_DIR, "test.txt")

# Create the data directory if necessary
if not os.path.exists(CH4_DATA_DIR):
    os.makedirs(CH4_DATA_DIR)

    

A `TFRecord` file is not random access; a reader starts at byte 1 and continues reading serially until it gets to the end of the file. When we read and write from such a file we must also tell TensorFlow how the data is stored. We do that by creating a `tf.train.Example` object for each example we encode. This object is made up of a `tf.train.Features` object that describes exactly what we will be encoding.

This code was based on the example at https://indico.io/blog/tensorflow-data-inputs-part1-placeholders-protobufs-queues/.

In [None]:
#
# Create the TRAINING TFRecord
#
 
print "Recording training data in TFRecord format..."
# Open a TFRecord writer
proto_writer = tf.python_io.TFRecordWriter(TRAIN_TFRECORD_FILENAME)

# Iterate over every exmaple and put it in the TFRecord
for line in tqdm(open(TRAIN_IMG_LABELS).read().splitlines()):
    png_path, label = line.split(',')
    img = misc.imread(png_path).flatten()
    
    proto_example = tf.train.Example(
        features=tf.train.Features( # a map of string to Feature proto objects
            feature={
                # A Feature contains one of either a int64_list,
                # float_list, or bytes_list
                'label': tf.train.Feature(
                    int64_list=tf.train.Int64List(value=[int(label)])),
                'image': tf.train.Feature(
                    int64_list=tf.train.Int64List(value=img.astype("int64")))
            }
        )
    )
    
    # use the proto object to serialize the example to a string
    serialized = proto_example.SerializeToString()
    # write the serialized object to disk
    proto_writer.write(serialized)
    
proto_writer.close()

#
# Create the TESTING TFRecord
#
 
print "Recording test data in TFRecord format..."
# Open a TFRecord writer
proto_writer = tf.python_io.TFRecordWriter(TEST_TFRECORD_FILENAME)

# Iterate over every exmaple and put it in the TFRecord
for line in tqdm(open(TEST_IMG_LABELS).read().splitlines()):
    png_path, label = line.split(',')
    img = misc.imread(png_path).flatten()
    
    proto_example = tf.train.Example(
        features=tf.train.Features( # a map of string to Feature proto objects
            feature={
                # A Feature contains one of either a int64_list,
                # float_list, or bytes_list
                'label': tf.train.Feature(
                    int64_list=tf.train.Int64List(value=[int(label)])),
                'image': tf.train.Feature(
                    int64_list=tf.train.Int64List(value=img.astype("int64")))
            }
        )
    )
    
    # use the proto object to serialize the example to a string
    serialized = proto_example.SerializeToString()
    # write the serialized object to disk
    proto_writer.write(serialized)

proto_writer.close()


In [None]:
#
# Look inside the TFRecord to prove to ourselves that the records are correctly recorded.
#

i = 0
for serialized_example in tf.python_io.tf_record_iterator(TRAIN_TFRECORD_FILENAME):
    i = i + 1
    if i > 3:
        break
    example = tf.train.Example()
    example.ParseFromString(serialized_example)

    # traverse the Example format to get data
    image = example.features.feature['image'].int64_list.value
    label = example.features.feature['label'].int64_list.value[0]

    # display data
    print "Label:", label
    img = np.array(image).astype("float32")
    plt.imshow(img.reshape((50,50, 3)), cmap='gray'); plt.show()


## Autoencoder Design

The autoencoder design is split into 2 parts: the data pipeline and the network architecture.

### Data Pipeline
Note that we are using TensorFlow functions (tf.foobar) instead of numpy functions (np.foobar). The data pipeline will actually be part of the TensorFlow graph in this case, which will speed up our network considerably. In chapter 3 we had to go out to the filesystem and load/read hundreds of images every iteration before calculations could be started. Since TensorFlow will be in charge of loading its own data it can preload data at the same time as it is doing neural network calculations, if that is most efficient. In any case, most of TensorFlow is written in C++ and optimaized for its neural network calculations which will give us a performace boost.

One more point: while we encoded the entire 3-channel image into the TFRecord, a color image will probably be too complex for our architecture to learn. Our data pipeline will only keep the red channel.

### Network Architecture
This section is pretty straightforward. Instead of the output being a 3-element long vector, it will be a 2500 element vector representing a 50x50 pixel image. One key difference is that we are not using batch normalization in this network; we want the network to recreate the exact input image, not a normalized version.

In [None]:
#
# Required input placeholders
#
training = tf.placeholder(dtype=tf.bool, name="is_training") # True if training, False if testing
batch_size = tf.placeholder(dtype=tf.int32, name="batch_size")

#
# Our data-loading pipeline.
#

def get_batch(proto_filename, batch_size):
    filename_queue = tf.train.string_input_producer([proto_filename], num_epochs=None)
    proto_reader = tf.TFRecordReader()

    # Examples from the TFRecord.
    _, serialized_example = proto_reader.read(filename_queue)

    # The serialized example is converted back to actual values
    # by describing the format of the objects to be returned
    features = tf.parse_single_example(
        serialized_example,
        features={
            # We know the length of both fields. If not the
            # tf.VarLenFeature could be used
            'label': tf.FixedLenFeature([], tf.int64),
            'image': tf.FixedLenFeature([50*50*3], tf.int64)
        })

    # now we have the raw data
    label = features['label']
    image = features['image']

    # wrest the data into the desired orientation.
    cast_img = tf.cast(image, tf.float32)
    threeD_img = tf.reshape(cast_img, [50,50, 3])
    bw_img = tf.slice(threeD_img, [0, 0, 0], [50, 50, 1])
    oneD_img = tf.reshape(bw_img, [50*50])
    squashed_img = tf.divide(oneD_img, 255)

    # ...and batch it
    # Since a TFRecord can only be read sequentially, in order to shuffle the
    # data we use a tf.train.shuffle_batch() tensor. This function queues up
    # a bunch of examples and shuffles the data based on what it has queued up.
    # It's a good idea to know what this function is doing. Since we didn't
    # shuffle our data before storing it in in the TFRecord, out first few passes
    # could be at risk of being a but undershuffled. But since we don't have that many
    # data examples everything should be pretty well shuffled after a few iterations.
    images_batch, labels_batch = tf.train.shuffle_batch(
        [squashed_img, label], batch_size=batch_size,
        capacity=2000,
        min_after_dequeue=1000)

    return images_batch, labels_batch


#
# Load the correct training/test data by using a conditional tensor.
#
train_images_batch, train_labels_batch = get_batch(TRAIN_TFRECORD_FILENAME, batch_size)
test_images_batch, test_labels_batch = get_batch(TEST_TFRECORD_FILENAME, batch_size)

images_batch, labels_batch = tf.cond(training,
                         lambda: (train_images_batch, train_labels_batch),
                         lambda: (test_images_batch, test_labels_batch)
                        )


#
# Model
#
x = images_batch
y_ = images_batch


# Ensure our images are the correct shape
input_layer = tf.reshape(x, [-1, 50, 50, 1])


#
# Convolutions
#

# Convolutional Layer #1
conv1 = tf.layers.conv2d(
    inputs=input_layer,
    filters=16,
    kernel_size=[3, 3],
    strides = [1,1],
    padding="valid",
    activation=tf.nn.relu)
pool1 = tf.layers.max_pooling2d(inputs=conv1, pool_size=[2, 2], strides=2)

# Convolutional Layer #2
conv2 = tf.layers.conv2d(
    inputs=pool1,
    filters=16,
    kernel_size=[3, 3],
    strides = [1,1],
    padding="valid",
    activation=tf.nn.relu)
pool2 = tf.layers.max_pooling2d(inputs=conv2, pool_size=[2, 2], strides=2)

# Convolutional Layer #3
conv3 = tf.layers.conv2d(
    inputs=pool2,
    filters=16,
    kernel_size=[4, 4],
    strides = [1,1],
    padding="valid",
    activation=tf.nn.relu)
pool3 = tf.layers.max_pooling2d(inputs=conv3, pool_size=[2, 2], strides=2)

# Dense Layers
conv3_flat = tf.reshape(pool3, [-1, 4 * 4 * 16])
fc1 = tf.layers.dense(inputs=conv3_flat, units=256, activation=tf.nn.relu)

# Here is our bottleneck; the part of our network with the fewest activations.
fc2 = tf.layers.dense(inputs=fc1, units=256, activation=tf.nn.relu)

# This layer uses information from the bottleneck to try to recreate the
# original image.
fc3 = tf.layers.dense(inputs=fc2, units=50*50, activation=tf.nn.relu)


#
# Loss
#

loss = None
train_op = None
y_pred = fc3
y_true = y_

# We're not using the softmax cross-entropy loss function here;
# Instead we're going to define the loss as the mean pixel difference
# between the input and output images.
loss = tf.reduce_mean(tf.pow(y_true - y_pred, 2))

In [None]:
#
# Train
#
from IPython.display import clear_output


global_step = tf.Variable(0, trainable=False)
starter_learning_rate = 1e-3
learning_rate = tf.train.exponential_decay(learning_rate = starter_learning_rate,
                                           global_step = global_step,
                                           decay_steps = 200,
                                           decay_rate = 0.96,
                                           staircase=True)


update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS) # batch norm
with tf.control_dependencies(update_ops): # batch norm
    
    optimizer = tf.train.AdamOptimizer(learning_rate).minimize(loss, global_step=global_step)


# Initialize tensorflow
sess = tf.InteractiveSession()
sess.run(tf.global_variables_initializer())

# start the data reader tensors
tf.train.start_queue_runners(sess=sess)

# And run 10k iterations
for i in range(30000):
    
    # Debug output
    if i%100 == 1:
        prediction, loss_out = sess.run( fetches = [y_pred, loss],
                                         feed_dict={training: False, batch_size:256})
        print "Step:", i, ", Loss:", loss_out
    
    # run an iteration
    optimizer.run(feed_dict={training: True, batch_size:256})



## Prediction

Remember, an autoencoder attempts to recreate the original image. In the next cell we can feed an image into the autoencoder and see what image the autoencoder is able to recreate.

In [None]:
# Prediction

# farm, city, mountain


img_in, prediction, loss_out, feature_vector = sess.run( fetches = [y_true,y_pred, loss, fc2],
                                         feed_dict={training: False, batch_size:1})

img_in = img_in.reshape((50,50))
prediction = prediction.reshape((50,50))

print "Prediction difference:", np.mean(np.square(img_in - prediction))

print "Original Image:"
plt.imshow(img_in, cmap='gray', vmin=0, vmax=1); plt.show()

print "Autoencoder-recreated Image:"
plt.imshow(prediction, cmap='gray', vmin=0, vmax=1); plt.show()


## Clustering

We can use the bottleneck activations as a "feature vector" for an image. We can feed these activations into a clustering algorithm to see if we can group similar images together.

The next section uses K-means clusetering to group images together, using activations from the `fc2` layer of the autoencoder. Since K-means requires that we know 'K' ahead of time (and we don't know it), we will run the algorithm several times with different values of K and measuer the average distance each data point falls from its cluseter center. We can try to find some sort of 'elbow' where we start getting diminishing returns.

In [None]:
#
# Clustering
#

from pylab import plot,show
from numpy import vstack,array
from numpy.random import rand
from scipy.cluster.vq import kmeans,vq
from scipy.spatial import distance

def featurize_image():
    img, feature_vector = sess.run( fetches = [y_true, fc2], feed_dict={training: False, batch_size:1})
    return img, feature_vector


#
# Calculate Error
#
def cluster_error(data, centroids, idx):
    sum_error = 0
    for i in range(idx.shape[0]):
        error = dst = distance.euclidean(data[i, :], centroids[idx[i], :])
        sum_error += error
    return sum_error
        
        
#
# Collect Data
#
data_img_list = []

for i in range(500):
    img, feature_vector = featurize_image()
    sample = {
        'img': img,
        'feature_vector': feature_vector
    }
    data_img_list.append(sample)

data = np.squeeze(np.array([d['feature_vector'] for d in data_img_list]))
   


    
#
# K Means
#

k_list = []
error_list = []

for k in range(2,20):
    centroids,_ = kmeans(data,k)
    # assign each sample to a cluster
    idx,_ = vq(data,centroids)

    error = cluster_error(data, centroids, idx)
    print "k=", k, ":", error
    k_list.append(k)
    error_list.append(cluster_error(data, centroids, idx))
    
plt.plot(k_list, error_list, '-o')
plt.show()
    
    
# In my experiments it seemed like K=5 gave the best results. But try
# different values and see what you get!
K=5

centroids,_ = kmeans(data,K)

# assign each sample to a cluster
idx,_ = vq(data,centroids)



for k in range(K):
    p = 1
    print "Cluster", k
    
    # collect images in class
    class_images = [data_img_list[i]
                    for i in range(idx.shape[0])
                    if idx[i] == k]
    
    # sort the image list based on each image's distance from the cluster center
    class_images.sort(
        key=lambda i_data: distance.euclidean(i_data['feature_vector'], centroids[k])
    )
    
    for image_data in class_images:
        if p <= 4*5:
            plt.subplot(4, 5, p)
            plt.axis('off')    
            plt.imshow(image_data['img'].reshape((50,50)), cmap='gray')
            p += 1
    plt.show()
    
