# COMP 527: Implementing the k-means clustering algorithm

> In the assignment, you are required to cluster words belonging to four categories: animals, countries, fruits and veggies. The words are arranged into four different files. The first entry in each line is a word followed by 300 features (word embedding) describing the meaning of that word.

## Implementation
 
> (1) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. (30 marks)

In [1]:
import numpy as np
import scipy
import matplotlib.pyplot as plt

In [2]:
class Word():
    """Object class for a categorized word with data vector."""
    
    def __init__(self, name, vector, category):
        self.name = name
        self.vector = vector
        self.category = category
    
    def __repr__(self):
        return f'word: {self.name}'

In [3]:
def read_data(list_of_filenames):
    """Read in data."""

    collection = []

    for filename in list_of_filenames:
        data = open(filename).read().split('\n')[:-1]

        for word_data in data:
            split = word_data.split(' ')
            name = split[0]
            raw_list = split[1:]

            floats = []
            for x_string in raw_list:
                floats.append(float(x_string))

            vector = np.array(floats)

            collection.append( Word(name, vector, filename))

    return collection

In [4]:
categories = ['animals', 'countries', 'fruits', 'veggies']
words = read_data(categories)
print(len(words))

329


In [6]:
category = {}
for c in categories:
    category[c] = []
    for w in words:
        if w.category == c:
            category[c].append(w)
    print(c, len(category[c]))

animals 50
countries 161
fruits 58
veggies 60


In [8]:
for w in words:
    if w.vector.shape != words[0].vector.shape:
        print('ERROR', w)

In [10]:
def euclidean_distance(u, v):
    """Return Euclidean distance between two np.array vectors."""

    return np.sqrt( (u - v).dot( u - v ))

In [11]:
def manhattan_distance(u, v):
    """Return Manhattan distance between two np.array vectors."""

    w = u - v
    distance = 0
    for x in w:
        distance += abs(x)
    
    return distance

In [12]:
def cosine_similarity(u, v):
    """Return Cosine similarity of two np.array vectors."""
    
    return u.dot(v)/( np.sqrt(u.dot(u)) * np.sqrt(v.dot(v)) )

In [13]:
def normalize(u):
    """Return normalized vector (ie. parallel vector with unit magnitude)."""
    
    return u / np.sqrt( u.dot(u) )

In [15]:
euclidean_distance(category['animals'][0].vector, category['fruits'][3].vector)

7.609447771107847

In [None]:
class KMeans():
    
    def __init__(self, k = 5, data = words, metric = euclidean_distance, normalize = False, iterations = 10**10, seed = 1):
        """
        Initialize KMeans Model.
        
        Args:
            k (int): number of clusters to divide data into.
            data (list): list of dicts which must each include
                the keys 'name' (string) and 'vector' (np.ndarray).
            metric (function): to measure distance between points.
            normalize (Boolean): whether or not to normalize vectors.
            iterations (int): when to stop if no convergence.
            seed (int32): for reproducible (pseudo-)randomness.
        """
        
        self.k = k
        self.data = data
        self._upperbound, self._lowerbound = self._bounds()
        self.seed = seed
        self._centroids = self._start()
        self._iteration = 0
        self.labels = {}
    
    def _bounds(self):
        """Find upper and lower bounds of data space."""
        
        upper = np.zeros(len(words[0].vector))
        lower = np.zeros(len(words[0].vector))
        
        for d in data:
            for i, x in enumerate(d.vector):
                upper[i] = max(upper[i], x)
                lower[i] = min(lower[i], x)
        
        return upper, lower
    
    def _start(self):
        """Generate starting positions for k centroids."""
        
        

In [75]:
upper = np.zeros(len(words[0].vector))
lower = np.zeros(len(words[0].vector))

In [78]:
data = words
for d in data:
    for i, x in enumerate(d.vector):
        upper[i] = max(upper[i], x)
        lower[i] = min(lower[i], x)

In [80]:
upper

array([1.2946 , 0.9564 , 0.92792, 0.85099, 1.2966 , 0.8382 , 0.83656,
       0.96064, 0.95663, 0.9745 , 0.99057, 1.0373 , 1.5271 , 0.90659,
       1.5421 , 1.0302 , 1.1887 , 0.73925, 0.9368 , 0.97064, 0.94997,
       1.3882 , 0.74385, 0.86281, 0.91948, 0.66668, 0.84123, 1.0959 ,
       0.88543, 0.77246, 1.4636 , 1.226  , 1.3733 , 1.4071 , 1.0275 ,
       1.2237 , 1.0922 , 0.80705, 0.84601, 1.1418 , 1.3081 , 0.86901,
       1.0584 , 1.2829 , 0.67847, 0.99822, 0.9105 , 0.64708, 0.88737,
       0.9553 , 1.0997 , 0.70873, 0.9406 , 1.0906 , 0.80305, 1.0234 ,
       0.98461, 0.96752, 1.0902 , 0.72347, 0.76898, 1.0795 , 0.95018,
       0.86595, 1.1526 , 1.0572 , 1.1124 , 0.93094, 0.61421, 1.0535 ,
       0.79016, 1.0957 , 0.90486, 1.065  , 1.0284 , 1.0057 , 0.73638,
       0.97551, 0.98087, 1.0967 , 0.93825, 1.1122 , 0.96254, 0.97279,
       0.41297, 0.95982, 1.0984 , 0.92698, 1.172  , 1.2084 , 0.77948,
       1.1452 , 0.8467 , 0.87888, 1.1526 , 0.93815, 0.57699, 0.98814,
       1.0946 , 0.82

In [81]:
lower

array([-0.99684, -0.91587, -1.0514 , -1.4107 , -0.88439, -1.6586 ,
       -2.8027 , -0.95706, -1.0216 , -2.0161 , -1.3544 , -0.71916,
       -0.78765, -1.306  , -0.85387, -0.95628, -0.99129, -1.2773 ,
       -0.97502, -0.74652, -0.88368, -1.1614 , -0.68309, -0.79758,
       -0.9627 , -1.0844 , -1.0948 , -1.1589 , -0.8113 , -0.76359,
       -1.6845 , -0.91243, -0.84475, -0.89747, -0.4943 , -0.86348,
       -1.084  , -0.85391, -0.93993, -0.74393, -0.55972, -0.89634,
       -1.1886 , -1.0417 , -0.74994, -0.82764, -0.89737, -1.0151 ,
       -1.038  , -1.1633 , -0.84163, -0.93281, -0.81178, -0.60436,
       -1.4898 , -1.1582 , -1.1715 , -1.0008 , -0.5402 , -1.1432 ,
       -0.98529, -1.1117 , -1.0757 , -0.7376 , -1.3685 , -0.75747,
       -0.84723, -0.87765, -1.2229 , -0.79086, -1.3877 , -1.0092 ,
       -0.83267, -0.72224, -0.59768, -0.81945, -0.92068, -0.82162,
       -0.99162, -0.94463, -0.86352, -1.4489 , -0.75855, -1.5452 ,
       -1.3396 , -1.2483 , -0.97945, -1.5953 , -1.011  , -1.24

In [58]:
min(-1,0,1)

-1

In [63]:
lower[0]

0.08111

In [65]:
words[1].vector[0]

0.47727

In [71]:
lower[0] = min(lower[0], words[1].vector[0])

In [72]:
lower[0]

0.08111

In [60]:
lower[i] = min(lower[i], words[j].vector[i])

In [70]:
min.__doc__

'min(iterable, *[, default=obj, key=func]) -> value\nmin(arg1, arg2, *args, *[, key=func]) -> value\n\nWith a single iterable argument, return its smallest item. The\ndefault keyword-only argument specifies an object to return if\nthe provided iterable is empty.\nWith two or more arguments, return the smallest argument.'

In [76]:
for j in range(5):
    for i in range(5):
        upper[i] = max([upper[i], words[j].vector[i]])
        lower[i] = min([lower[i], words[j].vector[i]])
    print(j, words[j].vector[0:5],'\n', upper[0:5],'\n', lower[0:5],'\n', '\n\n\n')

0 [ 0.08111  -0.50285  -0.055975  0.45965  -0.30271 ] 
 [0.08111 0.      0.      0.45965 0.     ] 
 [ 0.       -0.50285  -0.055975  0.       -0.30271 ] 
 



1 [ 0.47727 -0.91587 -0.2977  -0.22489  0.55337] 
 [0.47727 0.      0.      0.45965 0.55337] 
 [ 0.      -0.91587 -0.2977  -0.22489 -0.30271] 
 



2 [-0.33575  0.38897 -0.41929 -0.33219  0.5317 ] 
 [0.47727 0.38897 0.      0.45965 0.55337] 
 [-0.33575 -0.91587 -0.41929 -0.33219 -0.30271] 
 



3 [ 0.2111   0.21763 -0.52638 -0.42277  0.84672] 
 [0.47727 0.38897 0.      0.45965 0.84672] 
 [-0.33575 -0.91587 -0.52638 -0.42277 -0.30271] 
 



4 [ 0.08111  -0.50285  -0.055975  0.45965  -0.30271 ] 
 [0.47727 0.38897 0.      0.45965 0.84672] 
 [-0.33575 -0.91587 -0.52638 -0.42277 -0.30271] 
 





In [None]:
wo

-0.12452

In [34]:
w = np.array([0,1,0])

In [36]:
w[1] = max(0,2)

In [37]:
w

array([0, 2, 0])

In [27]:
for i, x in enumerate(words[2].vector): print(i,x)

0 -0.33575
1 0.38897
2 -0.41929
3 -0.33219
4 0.5317
5 -0.25839
6 -2.3869
7 -0.43443
8 -0.3976
9 -0.99356
10 0.47093
11 -0.16265
12 -0.13474
13 -1.306
14 0.34694
15 0.1215
16 -0.15811
17 -0.011231
18 -0.4656
19 -0.18031
20 0.026682
21 -0.028445
22 -0.44228
23 0.20955
24 0.044307
25 0.27514
26 -0.2314
27 -0.10864
28 -0.0087113
29 0.20522
30 0.36109
31 -0.35431
32 0.25217
33 0.26608
34 0.11942
35 -0.21606
36 0.073164
37 0.25023
38 0.24612
39 0.20797
40 -0.18702
41 -0.038054
42 0.23604
43 0.42484
44 0.10187
45 0.058443
46 -0.60782
47 -0.52279
48 -0.026276
49 -0.14402
50 0.22169
51 -0.1585
52 -0.81178
53 0.082893
54 -0.022136
55 -0.12966
56 0.17201
57 0.62484
58 -0.023122
59 -0.15704
60 -0.41946
61 -0.49499
62 0.056224
63 -0.081352
64 0.35428
65 0.15145
66 -0.26535
67 0.10071
68 -1.0047
69 0.34271
70 -0.003079
71 0.35994
72 0.4007
73 0.1518
74 0.11983
75 -0.30275
76 0.13739
77 -0.36725
78 0.3665
79 0.31037
80 0.513
81 0.20102
82 -0.34841
83 0.28565
84 -0.48071
85 0.21667
86 -0.37125
87 0.60

In [22]:
lower

array([-5.8577e-01, -3.7071e-01, -1.2452e-01, -6.0234e-01,  7.0299e-01,
       -1.4603e+00, -5.2778e-01, -3.5435e-02, -4.3165e-01, -1.0401e+00,
        2.6789e-01, -2.9573e-01,  8.7415e-01,  2.4446e-01,  3.6380e-01,
       -4.8924e-01, -3.0546e-01, -1.1816e+00,  5.2453e-01,  4.4108e-02,
        2.6787e-01, -1.5690e-02,  1.3511e-01,  3.5650e-01, -2.2939e-01,
        6.1426e-02, -2.4105e-01, -1.1017e-01, -5.1229e-01, -4.0380e-02,
        8.3256e-01, -2.5489e-01,  2.3119e-01,  4.6005e-02,  2.5584e-01,
       -1.7051e-01, -7.3765e-01,  5.6647e-01, -2.5144e-01,  6.0860e-01,
        3.7638e-01, -1.5358e-01,  2.3254e-01,  1.9730e-01, -7.8956e-03,
        1.4744e-01, -5.6575e-02, -6.7243e-03,  3.1051e-01, -6.7962e-01,
       -1.6831e-01,  2.4753e-01, -2.5611e-01,  6.4161e-01, -7.7726e-01,
       -3.1699e-01,  6.5896e-02,  4.7737e-01,  1.3933e-01,  3.9412e-01,
        2.4505e-02,  6.0498e-01, -4.8965e-01, -7.2412e-02, -5.4783e-01,
       -7.5747e-01, -8.4723e-01,  6.1058e-02, -1.8614e-01, -4.91

In [241]:
max

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0.

In [37]:
type(words[0]['vector'][0])

float

In [31]:
number = '0.53533'

In [33]:
float(number)

0.53533

In [20]:
len(data['animals'])

50

In [13]:
]


animal = open('animals').read().split('\n')[:-1]

In [15]:
for a in animals:
    

50

In [8]:
animals = animals.split('\n')

AttributeError: 'list' object has no attribute 'split'

In [None]:
a

## Initial Computation

> (2) Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalize

> (3) Now re-run the k-means clustering algorithm you implemented in part (1) but normalise each feature vector to unit L2 length before computing Euclidean distances. Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Manhattan Distance

> (4) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance over the unnormalised feature vectors. Vary the value of k from 1 to 10
and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalized Manhattan Distance

> (5) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance with L2 normalised feature vectors. Vary the value of k from 1 to 10 and
compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

##  Cosine Similarity

> (6) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use cosine similarity as the distance (similarity) measure.Vary the value of k from 1 to 10 andcompute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Compare and Discuss

> (7) Comparing the different clusterings you obtained in (2)-(6) discuss what is the best setting for k-means clustering for this dataset. (20 marks)