# K-Means Clustering Algorithm

Author: Jacob McCabe

## Overview

This notebook will take a look at the K-Means Clustering algorithm. The main ideas covered will include:

1. What is clustering?
2. The K-Means algorithm
3. Picking $k$


## What is clustering?

Clustering is a branch of unsupervised machine learning. The goal is to assign points to 'clusters' based on similarity. It's often used for exploratory data analysis when we want to learn something about the data. With clustering we can find a meaningful structure or organization in the data, discretize continuous values (vector quantization), and it can sometimes be used as a step in semi-supervised learning. 

## The algorithm

K-Means clustering groups data points into $k$ clusters, with the goal of minimizing the ***distortion***. Distortion is a measure of spread between the points of a cluster and uses the Euclidean distance. 
The initial conditions for the algorithm are the locations of the centroids, by randomizing them and using an iterative process we want to settle into a local minimum. Since this is a non-convex optimization problem, it can be good to run the algorithm several times with varying starting points and select the clustering that best minimizes distortion.. 

1. The set of centroids, $c_1, \ldots, c_k$ are initialized as random points in space.
2. Iterate until no points change groups or patience ran out.
    - Assign each point to the nearest centroid based on Euclidean distance.
    - Update each centroid to be the 'center' of each cluster (minimize distortion for each cluster).

A question that has probably gone through your head is "how do I pick the $k$?". After all, since this is unsupervised we have no idea how many clusters to look for. We can use the ***elbow method*** to do this. This requires us to run K-Means a number of times from $k=1...n$ where there are $n$ data points in the set. By plotting the minimum distortion against $k$ we will get a graph that looks something like exponential decay. At $k=1$, there will obviously be the maximum distortion since we are saying that every data point belongs to the same cluster, whereas at $k=n$ the distortion will go to zero since each point will be its own cluster. The best choice for $k$ is typically somewhere at the elbow in the graph. Yes, this is somewhat subjective and can be difficult to find. It is only meant to get a ballpark estimate of $k$. In the image below, we can see that there is an elbow around $k=3$, so this would be the value to use.

<img src="images/elbow_method.png" height="40%" width="40%">

## Resources

-[Image](https://medium.com/@mudgalvivek2911/machine-learning-clustering-elbow-method-4e8c2b404a5d)
-[scikit-learn KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html)


## Example using `KMeans` from `scikit-learn`