K-means clustering
===========

The k-means clustering algorithm is an unsupervised learning method broadly used in cluster analysis. the algorithm is based on two main steps:

1. **Assignment**: each observation in the input space is assigned to the Best Matching Unit (BMU) based on a distance measure.
2. **Update**: for each cluster estimate the mean and assign it to the centroids.

In general the distance measure used is th [Euclidean distance](https://en.wikipedia.org/wiki/Euclidean_distance). A more efficient version is the [cosine distance](https://en.wikipedia.org/wiki/Cosine_similarity) that is based on the inner product. The goal in k-means clustering is to minimize the within-cluster sum of squares (reconstruction error):

$$\underset{S}{\mathrm{argmin}} \sum_{i=1}^{k} \sum_{\bf{x} \in S_{i}} \lVert \bf{x} - \bf{\mu}_{i} \rVert^{2}$$

where $\bf{x}$ are the input vectors, $S = \{ S_{1},...,S_{k} \}$ are the number of clusters, and $\mu$ is the mean of the vectors belonging to each cluster.

Online k-means algorithm in Tensorflow
--------------------------------------------

Here I show you how to create an online version of the k-means algorithm. By *online* I mean that a batch of examples can be passed every time the training operation is called. This is particularly useful if the input space is particularly big. First of all we declare the input size and the number of clusters we want to use:

In [None]:
input_size = 6
number_centroids = 2 #the number of centroids
batch_size = 3 #The number of input vector to pass

Now we can write the real algortihm. The centroids and the input are assigned through two placeholders. The distance used is the **Euclidean distance**. Remember that we want to minimize the within-cluster sum of squares (reconstruction error) between the input vectors and the centroids. Since the problem can be defined in term of a loss function, we can use our dear optimizers.

In [None]:
import tensorflow as tf

#Placeholders for the input array and the initial centroids
input_placeholder = tf.placeholder(dtype=tf.float32, shape=[None, input_size])
initial_centroids_placeholder = tf.placeholder(dtype=tf.float32, shape=[input_size, number_centroids])

#Matrix containing the centorids
kmeans_matrix = tf.Variable(tf.random_uniform(shape=[input_size, number_centroids], 
                                              minval=0.0, maxval=1.0, dtype=tf.float32))
assign_centroids_op = kmeans_matrix.assign(initial_centroids_placeholder)

#Here the distance is estimated and the BMU computed
difference = tf.expand_dims(input_placeholder, axis=1) - tf.expand_dims(tf.transpose(kmeans_matrix), axis=0)
euclidean_distance = tf.norm(difference, ord='euclidean',axis=2) #shape=(?, 3)
bmu_index = tf.argmin(euclidean_distance, axis=1) #get the index of BMU
bmu = tf.gather(kmeans_matrix, indices=bmu_index, axis=1) #take the centroinds

#To minimize: within-cluster sum of squares (reconstruction error)
loss = tf.reduce_mean(tf.pow(input_placeholder - tf.transpose(bmu), 2))

#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.01)
optimizer = tf.train.AdamOptimizer(learning_rate=0.05)

#Training operation
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

Now the graph is ready and we can start a session to try the algorithm. Here I simply create a certain number of random input arrays. The initialization of the centroids is trivial, I take some of the input arrays and I pass them to the centroids.

In [None]:
import numpy as np

sess = tf.Session()
sess.run(tf.global_variables_initializer())
input_array = np.random.random((batch_size,input_size))
print("\ninput_array.T")
print(input_array.T)
print("\nkmeans_matrix before assignment")
print(sess.run([kmeans_matrix]))
sess.run([assign_centroids_op], 
         {initial_centroids_placeholder: input_array[0:number_centroids,:].T})
print("\nkmeans_matrix after assignment")
print(sess.run([kmeans_matrix]))

print("\nStarting training...")
print_every = 100
for i in range(1000):
    output = sess.run([train_op, loss], {input_placeholder: input_array})
    if(i%print_every==0):
        print("Loss: " + str(output[1]))

print("\nkmeans_matrix final")
print(sess.run([kmeans_matrix]))

Training on the Iris dataset
--------------------------------

Now we can try the algorithm on a real dataset. We can use the [Iris Flower dataset](../iris/iris.ipynb) for instance. The dataset is represented by the dimensions of the flowers: sepal length, sepal width, petal length, petal width. There are a total of three classes (0=Setosa, 1=Versicolor, 2=Virginica).
The TFRecord files for this dataset are already included in Tensorbag and we can load them in memory with the following snippet:

In [1]:
import tensorflow as tf

def _parse_function(example_proto):
    features = {"feature": tf.VarLenFeature(tf.float32),
                "label": tf.FixedLenFeature((), tf.int64, default_value=0)}
    parsed_features = tf.parse_single_example(example_proto, features)
    feature = tf.cast(parsed_features["feature"], tf.float32)
    feature = tf.sparse_tensor_to_dense(feature, default_value=0)
    label = parsed_features["label"]
    return feature, label

print "Loading the training datasets..."
tf_train_dataset = tf.data.TFRecordDataset("../iris/iris_train.tfrecord")
print "Parsing the training datasets..."
tf_train_dataset = tf_train_dataset.map(_parse_function)
print "Verifying types and shapes..."
print(tf_train_dataset.output_types)
print(tf_train_dataset.output_shapes)
print "Loading the test datasets..."
tf_test_dataset = tf.data.TFRecordDataset("../iris/iris_test.tfrecord")
print "Parsing the test datasets..."
tf_test_dataset = tf_test_dataset.map(_parse_function)
print "Verifying types and shapes..."
print(tf_test_dataset.output_types)
print(tf_test_dataset.output_shapes)

Loading the training datasets...
Parsing the training datasets...
Verifying types and shapes...
(tf.float32, tf.int64)
(TensorShape([Dimension(None)]), TensorShape([]))
Loading the test datasets...
Parsing the test datasets...
Verifying types and shapes...
(tf.float32, tf.int64)
(TensorShape([Dimension(None)]), TensorShape([]))


Couldn't import dot_parser, loading of dot files will not be possible.


In [2]:
with tf.name_scope('train_dataset'):
    batch_size = 100
    num_epochs = 1000
    tf_train_dataset = tf_train_dataset.batch(batch_size)
    tf_train_dataset = tf_train_dataset.repeat(num_epochs)
    iterator = tf_train_dataset.make_one_shot_iterator()
    next_batch_features, next_batch_labels = iterator.get_next()

In [9]:
input_size = 4 #sepal length, sepal width, petal length, petal width
number_centroids = 3 #the number of centroids is equal to the number of the classes

#Placeholders for the input array and the initial centroids
input_placeholder = next_batch_features
initial_centroids_placeholder = tf.placeholder(dtype=tf.float32, shape=[input_size, number_centroids])

#Matrix containing the centorids
kmeans_matrix = tf.Variable(tf.random_uniform(shape=[input_size, number_centroids], 
                                              minval=0.0, maxval=1.0, dtype=tf.float32))
assign_centroids_op = kmeans_matrix.assign(initial_centroids_placeholder)

#Here the distance is estimated and the BMU computed
difference = tf.expand_dims(input_placeholder, axis=1) - tf.expand_dims(tf.transpose(kmeans_matrix), axis=0)
euclidean_distance = tf.norm(difference, ord='euclidean',axis=2) #shape=(?, 3)
bmu_index = tf.argmin(euclidean_distance, axis=1) #get the index of BMU
bmu = tf.gather(kmeans_matrix, indices=bmu_index, axis=1) #take the centroinds

#To minimize: within-cluster sum of squares (reconstruction error)
loss = tf.reduce_mean(tf.pow(input_placeholder - tf.transpose(bmu), 2))

#optimizer = tf.train.GradientDescentOptimizer(learning_rate=0.1)
#optimizer = tf.train.AdamOptimizer(learning_rate=0.05)
optimizer = tf.train.RMSPropOptimizer(learning_rate=0.1)

#Training operation
train_op = optimizer.minimize(loss=loss, global_step=tf.train.get_global_step())

In [12]:
sess = tf.Session()
sess.run(tf.global_variables_initializer())
print("\nkmeans_matrix before assignment")
print(sess.run([kmeans_matrix]))
input_array = sess.run([next_batch_features])[0]
sess.run([assign_centroids_op], 
         {initial_centroids_placeholder: input_array[0:number_centroids,:].T})
print("\nkmeans_matrix after assignment")
print(sess.run([kmeans_matrix]))

print("\nLoss before training")
output = sess.run([loss])
print(output)

print("\nStarting training...")
while True:
    try:
        output = sess.run([train_op, loss, print_op])
        print("Loss: " + str(output[1]))
    except tf.errors.OutOfRangeError:
        break
        
print("\nkmeans_matrix final")
print(sess.run([kmeans_matrix]))


kmeans_matrix before assignment
[array([[0.94429076, 0.33206284, 0.8901272 ],
       [0.24301016, 0.5296408 , 0.25928485],
       [0.9749198 , 0.8789613 , 0.9710479 ],
       [0.28901565, 0.8512291 , 0.80947495]], dtype=float32)]

kmeans_matrix after assignment
[array([[5.5, 4.9, 6.9],
       [3.5, 3. , 3.1],
       [1.3, 1.4, 5.1],
       [0.2, 0.2, 2.3]], dtype=float32)]

Loss before training
[0.3781]

Starting training...
Loss: 0.3781
Loss: 0.3708525
Loss: 0.36320505
Loss: 0.3556881
Loss: 0.3483276
Loss: 0.34114778
Loss: 0.33417183
Loss: 0.32742205
Loss: 0.32091957
Loss: 0.31468362
Loss: 0.3087318
Loss: 0.30307946
Loss: 0.29773933
Loss: 0.2927216
Loss: 0.28803366
Loss: 0.2836796
Loss: 0.27966052
Loss: 0.27597445
Loss: 0.27261636
Loss: 0.26957813
Loss: 0.2668487
Loss: 0.26441464
Loss: 0.26226008
Loss: 0.2603674
Loss: 0.25871715
Loss: 0.25728917
Loss: 0.25594294
Loss: 0.25453267
Loss: 0.25335968
Loss: 0.25239164
Loss: 0.25159797
Loss: 0.25095072
Loss: 0.25042483
Loss: 0.24999787
Loss

**Copyright (c) 2018** Massimiliano Patacchiola, MIT License