# COMP 527: Implementing the k-means clustering algorithm

> In the assignment, you are required to cluster words belonging to four categories: animals, countries, fruits and veggies. The words are arranged into four different files. The first entry in each line is a word followed by 300 features (word embedding) describing the meaning of that word.

## Implementation
 
> (1) Implement the k-means clustering algorithm with Euclidean distance to cluster the instances into k clusters. (30 marks)

In [2]:
import numpy as np
import scipy
import matplotlib.pyplot as plt

In [197]:
class Word():
    """Object class for a categorized word with data vector."""
    
    def __init__(self, name, vector, category):
        self.name = name
        self.vector = vector
        self.category = category
    
    def __repr__(self):
        return f'word: {self.name}'

In [198]:
def read_data(list_of_filenames):
    """Read in data."""

    collection = []

    for filename in list_of_filenames:
        data = open(filename).read().split('\n')[:-1]

        for word_data in data:
            split = word_data.split(' ')
            name = split[0]
            raw_list = split[1:]

            floats = []
            for x_string in raw_list:
                floats.append(float(x_string))

            vector = np.array(floats)

            collection.append( Word(name, vector, filename))

    return collection

In [199]:
categories = ['animals', 'countries', 'fruits', 'veggies']
words = read_data(categories)
print(len(words))

329


In [200]:
sub = {}
for c in categories:
    sub[c] = []
    for w in words:
        if w.category == c:
            sub[c].append(w)
    print(c, len(sub[c]))

animals 50
countries 161
fruits 58
veggies 60


In [172]:
animals = read_data('animals')
countries = read_data('countries')
fruits = read_data('fruits')
veggies = read_data('veggies')

In [173]:
categories = ['animals', 'countries', 'fruits', 'veggies']
words = {}
for c in categories:
    words[c] = read_data(c)

In [206]:
big_values = []
small_values = []
negative_values = []
for w in words:
    big_values.append(np.amax(w.vector))
    negative_values.append(np.amin(w.vector))
    small_values.append(np.amin(abs(w.vector)))

In [211]:
space_max = space_min = np.zeros(len(words))

In [219]:
for w in words:
    if w.vector.shape != words[0].vector.shape:
        print('ERROR', w)

In [None]:
for w in words:
    for i in range(w.vector.shape[0]):
        space_max = np.amax(w.vector[i])

In [210]:
for w in words:
    space_max = np.amax(w.vector)

-0.91587

In [208]:
min(negative_values)

-3.3685

In [203]:
min(small_values)

3.5616e-06

In [177]:
def euclidean_distance(u, v):
    """Return Euclidean distance between two np.array vectors."""

    return np.sqrt( (u - v).dot( u - v ))

In [178]:
def manhattan_distance(u, v):
    """Return Manhattan distance between two np.array vectors."""

    w = u - v
    distance = 0
    for x in w:
        distance += abs(x)
    
    return distance

In [179]:
def cosine_similarity(u, v):
    """Return Cosine similarity of two np.array vectors."""
    
    return u.dot(v)/( np.sqrt(u.dot(u)) * np.sqrt(v.dot(v)) )

In [180]:
def normalize(u):
    """Return normalized vector (ie. parallel vector with unit magnitude)."""
    
    return u / np.sqrt( u.dot(u) )

In [186]:
euclidean_distance(animals[0].vector, fruits[3].vector)

7.609447771107847

In [None]:
class KMeans():
    
    def __init__(self, k = 5, data = words, metric = euclidean_distance, normalize = False, iterations = 10**10):
        """
        Initialize KMeans Model.
        
        Args:
            k (int): number of clusters to divide data into.
            data (list): list of dicts which must each include
                the keys 'name' (string) and 'vector' (np.ndarray).
            metric (function): to measure distance between points.
            normalize (Boolean): whether or not to normalize vectors.
            iterations (int): when to stop if no convergence.
        """
        
        self.k = k
        self.data = data
        self._upperbound, self._lowerbound = self.find_limits()
        self._centroids = {}
        self._iteration = 0
        self.labels = {}
    

In [37]:
type(words[0]['vector'][0])

float

In [31]:
number = '0.53533'

In [33]:
float(number)

0.53533

In [20]:
len(data['animals'])

50

In [13]:
]


animal = open('animals').read().split('\n')[:-1]

In [15]:
for a in animals:
    

50

In [8]:
animals = animals.split('\n')

AttributeError: 'list' object has no attribute 'split'

In [None]:
a

## Initial Computation

> (2) Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalize

> (3) Now re-run the k-means clustering algorithm you implemented in part (1) but normalise each feature vector to unit L2 length before computing Euclidean distances. Vary the value of k from 1 to 10 and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Manhattan Distance

> (4) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance over the unnormalised feature vectors. Vary the value of k from 1 to 10
and compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Normalized Manhattan Distance

> (5) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use Manhattan distance with L2 normalised feature vectors. Vary the value of k from 1 to 10 and
compute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

##  Cosine Similarity

> (6) Now re-run the k-means clustering algorithm you implemented in part (1) but this time use cosine similarity as the distance (similarity) measure.Vary the value of k from 1 to 10 andcompute the precision, recall, and F-score for each set of clusters. Plot k in the horizontal axis and precision, recall and F-score in the vertical axis in the same plot. (10 marks)

## Compare and Discuss

> (7) Comparing the different clusterings you obtained in (2)-(6) discuss what is the best setting for k-means clustering for this dataset. (20 marks)