# Implementing KMeans

In this exercise we will be implementing the k-means clustering algorithm. For an introduction on how this algorithm works I recommend you to read:
- [K-Means Clustering Algorithm Overview](https://stanford.edu/~cpiech/cs221/handouts/kmeans.html)

The following figures illustrate the steps the algorithm follows to find two centroids (taken from the previous link):

![K-Means algorithm](http://bigdata.cesga.es/files/kmeansViz.png)

## Dependencies

In [None]:
from __future__ import print_function
import math
from collections import namedtuple

## Parameters

In [None]:
# Number of clusters to find
K = 5
# Convergence threshold
THRESHOLD = 0.1
# Maximum number of iterations
MAX_ITERS = 20

## Load data

In [None]:
def parse_coordinates(line):
    fields = line.split(',')
    return (float(fields[3]), float(fields[4]))

In [None]:
data = sc.textFile('datasets/locations')

In [None]:
points = data.map(parse_coordinates)

## Useful functions

In [None]:
def distance(p1, p2):  
    "Calculate the squared distance between two given points"
    return (p1[0] - p2[0]) ** 2 + (p1[1] - p2[1]) ** 2

def closest_centroid(point, centroids):    
    "Calculate the closest centroid to the given point: eg. the cluster this point belongs to"
    distances = [distance(point, c) for c in centroids]
    shortest = min(distances)
    return distances.index(shortest)

def add_points(p1,p2):
    "Add two points of the same cluster in order to calculate later the new centroids"
    return [p1[0] + p2[0], p1[1] + p2[1]]

## Iteratively calculate the centroids

In [None]:
%%time
# Initial centroids: we just take K randomly selected points
centroids = points.takeSample(False, K, 42)

# Just make sure the first iteration is always run
variation = THRESHOLD + 1
iteration = 0

while variation > THRESHOLD  and iteration < MAX_ITERS:
    ...
        
print('Final centroids: {}'.format(centroids))