# E2 Implementing and Comparing K Means Clustering Algorithms

### Objective
The objective of this project is to implement and compare different versions of the K Means clustering algorithm. You will be implementing three versions of the algorithm: Pure Python, Numpy Arrays, and Cython. Additionally, you will use profiling techniques to analyze the performance of  each  implementation  and  compare  them  as  a  function  of  the  size  of  the problem.

## Python Profiling

In [6]:
from pure import kmeans
import random

In [12]:
data = [[random.randint(0, 100), random.randint(0, 100)] for _ in range(10_000)]

kmeans_model = kmeans()

%timeit kmeans_model.fit(data)
%load_ext line_profiler
%lprun -f kmeans_model.fit kmeans_model.fit(data)

3.06 s ± 1.15 s per loop (mean ± std. dev. of 7 runs, 1 loop each)


Timer unit: 1e-09 s

Total time: 4.53508 s
File: /home/juan/Desktop/upystuff/HPC/E2/pure.py
Function: fit at line 13

Line #      Hits         Time  Per Hit   % Time  Line Contents
    13                                               def fit(self, data):
    14         1      25909.0  25909.0      0.0          self.centroids = random.sample(data, self.k)              
    15        37      10850.0    293.2      0.0          for _ in range(self.max_iterations):           
    16        37 4486266126.0    1e+08     98.9              self.clusters = self.addPoints(data)             
    17        37   48676728.0    1e+06      1.1              new_centroids = self.update()             
    18        37      59523.0   1608.7      0.0              if self.convergence(new_centroids):
    19         1        181.0    181.0      0.0                  break  
    20        36      41160.0   1143.3      0.0              self.centroids = new_centroids

## Numpy Profiling


In [8]:
import numpy as np
from kmeans_numpy import numpy_kmeans

In [13]:
data_numpy = [[random.randint(0, 100), random.randint(0, 100)] for _ in range(10_000)]
kmeans_numpy = numpy_kmeans()
%timeit kmeans_numpy.fit(data_numpy)
%lprun -f kmeans_numpy.fit kmeans_numpy.fit(data_numpy)

116 ms ± 8.78 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timer unit: 1e-09 s

Total time: 0.140973 s
File: /home/juan/Desktop/upystuff/HPC/E2/kmeans_numpy.py
Function: fit at line 10

Line #      Hits         Time  Per Hit   % Time  Line Contents
    10                                               def fit(self, data):
    11         1    1496657.0    1e+06      1.1          data = np.array(data)
    12         1        807.0    807.0      0.0          sample = data.shape[0]
    13         1      32741.0  32741.0      0.0          indices = random.sample(range(sample), self.k)
    14         1      11660.0  11660.0      0.0          self.centroids = data[indices, :]
    15                                           
    16        46       8892.0    193.3      0.0          for i in range(self.max_iterations):
    17        46    1334544.0  29011.8      0.9              distances = np.zeros((sample, self.k))
    18       506     367349.0    726.0      0.3              for i, centroid in enumerate(self.centroids):
    19       460   98242821.0 2

## Cython Profiling


In [14]:
from kmeans_cython import numpy_kmeans as cython_kmeans
import numpy as np

data_cython = [[random.randint(0, 100), random.randint(0, 100)] for _ in range(10_000)]
kmeans_cython = cython_kmeans()
%timeit kmeans_cython.fit(data_cython)
%lprun -f kmeans_cython.fit kmeans_cython.fit(data_cython)

28.8 ms ± 5.62 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


Timer unit: 1e-09 s