# COMP 527: Implementing the k-means clustering algorithm

> In the assignment, you are required to cluster words belonging to four categories: animals, countries, fruits and veggies. The words are arranged into four different files. The first entry in each line is a word followed by 300 features (word embedding) describing the meaning of that word.

## Implementation
 
> (1) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. (30 marks)

In [1]:
import numpy as np
# import scipy
import matplotlib.pyplot as plt

In [3]:
class Word():
    """Object class for a categorized word with data vector."""
    
    def __init__(self, name, vector, category):
        self.name = name
        self.vector = vector
        self.category = category
    
    def __repr__(self):
        return f'word: {self.name}'

In [2]:
def read_data(list_of_filenames):
    """Read in data."""

    collection = []

    for filename in list_of_filenames:
        data = open(filename).read().split('\n')[:-1]

        for word_data in data:
            split = word_data.split(' ')
            name = split[0]
            raw_list = split[1:]

            floats = []
            for x_string in raw_list:
                floats.append(float(x_string))

            vector = np.array(floats)

            collection.append( Word(name, vector, filename))

    return collection

In [4]:
categories = ['animals', 'countries', 'fruits', 'veggies']
words = read_data(categories)
print(len(words))

329


In [5]:
category = {}
for c in categories:
    category[c] = []
    for w in words:
        if w.category == c:
            category[c].append(w)
    print(c, len(category[c]))

animals 50
countries 161
fruits 58
veggies 60


In [6]:
for w in words:
    if w.vector.shape != words[0].vector.shape:
        print('ERROR', w)

In [7]:
words[0].vector[0:2]

array([-0.015926, -0.079864])

In [8]:
def flatten(data):
    """Return two-dimensional vectors."""

    flat = []

    for d in data:
        two_dim = d.vector[0:2]
        flat.append(Word(d.name, two_dim, d.category))

    return flat

In [9]:
flat = flatten(words)

In [10]:
flat[0].vector

array([-0.015926, -0.079864])

In [11]:
def euclidean_distance(u, v):
    """Return Euclidean distance between two np.array vectors."""

    return np.sqrt( (u - v).dot( u - v ))

In [12]:
def manhattan_distance(u, v):
    """Return Manhattan distance between two np.array vectors."""

    w = u - v
    distance = 0
    for x in w:
        distance += abs(x)
    
    return distance

In [13]:
def cosine_similarity(u, v):
    """Return Cosine similarity of two np.array vectors."""
    
    return u.dot(v)/( np.sqrt(u.dot(u)) * np.sqrt(v.dot(v)) )

In [14]:
def normalize(data):
    """Return normalized vectors (ie. parallel vector with unit magnitude)."""
    
    normalized_data = []
    
    for d in data:
        normalized_vector = d.vector / np.sqrt( d.vector.dot(d.vector) )
        normalized_data.append(Word(d.name, normalized_vector, d.category))
        
    return normalized_data

In [15]:
metrics = euclidean_distance, manhattan_distance, cosine_similarity

In [16]:
a = np.array([0,1])

In [17]:
b = np.array([1,0])

In [18]:
for metric in metrics:
    print(metric(a,b))

1.4142135623730951
2
0.0


In [22]:
class KMeans():
    
    def __init__(
                self, 
                k = 4, 
                data = words, 
                metric = euclidean_distance, 
                normalize = False, 
                max_iterations = 10**3, 
                seed = None,
                ):
        """
        Initialize KMeans Model.
        
        Args:
            k (int): number of clusters to divide data into.
            data (list): list of dicts which must each include
                the keys 'name' (string) and 'vector' (np.ndarray).
            metric (function): to measure distance between points.
            normalize (Boolean): whether or not to normalize vectors.
            iterations (int): when to stop if no convergence.
            seed (int): for reproducible (pseudo-)randomness.
        """
        
        self.k = k
        
        if normalize:
            self.data = normalize(data)
        else:
            self.data = data
        
        self.metric = metric
        
        self._upperbound, self._lowerbound = self._bounds()
        
        if seed:
            np.random.seed(seed)
        
        # we track centroid positions and cluster labels in nested dicts,
        # of the form dict_name[centroid_number][iteration_number]
        self._centroid = {}
        for centroid_number in range(k):
            self._centroid[centroid_number] = {}
        
        self._cluster = {}

        # we record cluster labels explicitly as well
        self._label = {}
        for datum in data:
            self._label[datum.name] = {}
        
        self.max_iterations = max_iterations
        for i in range(self.max_iterations):
            self._iteration = i
            self._iterate()
            if i > 0 and self._cluster[i] == self._cluster[i-1]:
                break
            
        
    def _bounds(self):
        """Find upper and lower bounds of data space."""
        
        upper = np.zeros(len(words[0].vector))
        lower = np.zeros(len(words[0].vector))
        
        for d in self.data:
            for i, x in enumerate(d.vector):
                upper[i] = max(upper[i], x)
                lower[i] = min(lower[i], x)
        
        return upper, lower
    
    
    def _start(self):
        """Generate starting positions for k centroids."""
        
        for centroid_number in range(self.k):
            self._centroid[centroid_number][0] = self._lowerbound \
                + np.random.random() * (self._upperbound - self._lowerbound)
        print('centroids successfully positioned')
    
    
    def _classify(self):
        """Assign each data point to cluster of nearest centroid."""
        
        self._cluster[self._iteration] = {}
        for centroid_number in range(self.k):
            self._cluster[self._iteration][centroid_number] = []
        
        for d in self.data:
            distances = [] 
            
            for centroid_number in range(self.k):
                
                distances.append(self.metric(d.vector, self._centroid[centroid_number][self._iteration]))
            
            closest_centroid = np.argmin(distances)
            
            self._cluster[self._iteration][closest_centroid].append(d)
            self._label[d.name][self._iteration] = closest_centroid
        
        print(f'classification successful for iteration {self._iteration}')

            
    def _reposition(self):
        """Move centroids to mean of each cluster."""
        
        for centroid_number in range(self.k):
            
            clustered = self._cluster[self._iteration - 1][centroid_number]
            
            if len(clustered) > 0:
                vector_sum = np.zeros(len(clustered[0].vector))
                
                for datum in clustered:
                    vector_sum += datum.vector

                cluster_mean = vector_sum / len(clustered)

                self._centroid[centroid_number][self._iteration] = cluster_mean

            else:
                # nothing assigned to this cluster
                self._centroid[centroid_number][self._iteration] = self._centroid[centroid_number][self._iteration -1] 
        
        print(f'repositioning successful for iteration {self._iteration}')
            
            
    def _stop():
        """Stop iterating and return results."""
        
        return 'done'
    
    def _iterate(self):
        """Position centroids and classify data by nearest centroid."""
        
        if self._iteration == 0:
            self._start()
        else:
            self._reposition()
        
        if self._iteration == self.max_iterations:
            self._stop()
        else:
            self._classify()
            
        

In [25]:
model = KMeans(data=flatten(words))

centroids successfully positioned


ValueError: operands could not be broadcast together with shapes (2,) (300,) 

In [26]:
upper = np.zeros(len(words[0].vector))
lower = np.zeros(len(words[0].vector))

In [27]:
data = words
for d in data:
    for i, x in enumerate(d.vector):
        upper[i] = max(upper[i], x)
        lower[i] = min(lower[i], x)

In [28]:
centroids[0][0] #= \
#                     lower + np.random.random() * (upper - lower)

NameError: name 'centroids' is not defined

In [None]:
lower

In [None]:
min(-1,0,1)

In [None]:
lower[0]

In [None]:
words[1].vector[0]

In [None]:
lower[0] = min(lower[0], words[1].vector[0])

In [None]:
lower[0]

In [None]:
lower[i] = min(lower[i], words[j].vector[i])

In [None]:
min.__doc__

In [None]:
for j in range(5):
    for i in range(5):
        upper[i] = max([upper[i], words[j].vector[i]])
        lower[i] = min([lower[i], words[j].vector[i]])
    print(j, words[j].vector[0:5],'\n', upper[0:5],'\n', lower[0:5],'\n', '\n\n\n')

In [None]:
wo

In [None]:
w = np.array([0,1,0])

In [None]:
w[1] = max(0,2)

In [None]:
w

In [None]:
for i, x in enumerate(words[2].vector): print(i,x)

In [None]:
lower

In [None]:
max

In [None]:
type(words[0]['vector'][0])

In [None]:
number = '0.53533'

In [None]:
float(number)

In [None]:
len(data['animals'])

In [None]:
]


animal = open('animals').read().split('\n')[:-1]

In [None]:
for a in animals:
    

In [None]:
animals = animals.split('\n')

In [None]:
a

## Initial Computation

> (2) Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalize

> (3) Now re-run the k-means clustering algorithm you implemented in part (1) but normalise each feature vector to unit L2 length before computing Euclidean distances. Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Manhattan Distance

> (4) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance over the unnormalised feature vectors. Vary the value of k from 1 to 10
and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalized Manhattan Distance

> (5) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance with L2 normalised feature vectors. Vary the value of k from 1 to 10 and
compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

##  Cosine Similarity

> (6) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use cosine similarity as the distance (similarity) measure.Vary the value of k from 1 to 10 andcompute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Compare and Discuss

> (7) Comparing the different clusterings you obtained in (2)-(6) discuss what is the best setting for k-means clustering for this dataset. (20 marks)