# Implementing K-Means from Scratch

In this notebook, you will implement $k$-means from scratch. This is something you have to do once in your life as a data scientist.

In [None]:
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
%matplotlib inline

In [None]:
data = pd.read_csv("/data/wines.csv", sep=";")

**TODO:** Implement the `assign_points_to_clusters()` and `recalculate_cluster_centroids()` functions below.

In [None]:
def assign_points_to_clusters(data, centroids):
    """Assign each observation in data to the nearest centroid.
    
    Args:
      - data: an n x p Pandas DataFrame of observations.
      - centroids: a k x p Pandas DataFrame of cluster centroids.
    
    Returns:
      a vector of length n, consisting of numbers 0, 1, ..., k-1 
      indicating the cluster assignment of each observation.
    """
    k = centroids.shape[0]
    raise NotImplementedError
    
def recalculate_cluster_centroids(data, clusters):
    """Recalculate cluster centroids based on cluster assignments.
    
    Args:
      - data: an n x p Pandas DataFrame of observations
      - clusters: a vector of length n, with numbers 0, 1, ..., k-1,
                  indicating the cluster assignment of each observation.
    
    Returns:
      a k x p Pandas DataFrame of cluster centroids.
    """
    k = clusters.max() + 1
    raise NotImplementedError

The function provided below runs $k$-means and plots the clusters and centroids at each iteration. The code has already been written for you, but you should read and understand the code.

In [None]:
def plot(data, centroids, clusters, ax, title):
    k = centroids.shape[0]
    for i in range(k):
        ax.plot(data[clusters == i].iloc[:, 0], data[clusters == i].iloc[:, 1], 'x', alpha=.2)
        ax.plot(centroids.iloc[i, 0], centroids.iloc[i, 1], 'ko')
    ax.set_title(title)

def run_k_means(data, k):
    
    # initialize the centroids to k randomly selected observations from the data set
    centroids = data.sample(k)
    clusters = assign_points_to_clusters(data, centroids)
    
    # repeat the above steps until the cluster assignments don't change
    while True:
        # plot data
        fig, ax = plt.subplots(1, 2, figsize=(12, 5))
        plot(data, centroids, clusters, ax[0], "Assign Clusters to Centroids")
        
        # STEP 1: recalculate cluster centroids
        centroids = recalculate_cluster_centroids(data, clusters)
        
        # plot data
        plot(data, centroids, clusters, ax[1], "Recalculate Centroids")
        
        # STEP 2: assign points to nearest cluster
        new_clusters = assign_points_to_clusters(data, centroids)
        
        # if cluster assignments haven't changed, terminate the loop
        if all(new_clusters == clusters):
            break
        else:
            clusters = new_clusters
    
    return centroids, clusters

The code below tests the `run_k_means()` function you implemented above.

In [None]:
X = data[["total sulfur dioxide", "volatile acidity"]]

_, clusters = run_k_means(X, 2)

What do you notice about the clusters above? How would you correct this problem?