<a href="https://colab.research.google.com/github/mathjams/machine-learning-basics/blob/main/k_means.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from utils import *

%matplotlib inline

How k-means works is that it slowly clusters data by moving them towards the closest centroids and then repositioning the centroids. Or this algorithm:

1. Assigning each training example $x^{(i)}$ to its closest centroid, and
2. Recomputing the mean of each centroid using the points assigned to it.

K-means always converges

In [None]:
import numpy as np

def find_closest_centroids(X, centroids):
    """
    Computes the centroid memberships for every example

    Args:
        X (ndarray): (m, n) Input values
        centroids (ndarray): (K, n) centroids

    Returns:
        idx (array_like): (m,) closest centroids

    """
    K = centroids.shape[0]
    idx = np.zeros(X.shape[0], dtype=int)
    for i in range(X.shape[0]):
        mintroid=0
        for j in range(K):
            if np.linalg.norm(X[i]-centroids[j])<np.linalg.norm(X[i]-centroids[mintroid]):
                mintroid=j
                idx[i]=j
    return idx

In [None]:
def compute_centroids(X, idx, K):
    """
    Returns the new centroids by computing the means of the
    data points assigned to each centroid.

    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each
                       example in X. Concretely, idx[i] contains the index of
                       the centroid closest to example i
        K (int):       number of centroids

    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """

    # Useful variables
    m, n = X.shape

    centroids = np.zeros((K, n))
    for i in range(K):
        sum=np.zeros(n)
        num=0
        for j in range(m):
            if idx[j]==i:
                sum+=X[j]
                num+=1
        centroids[i]=sum/num

    return centroids

You should run k-means several times with different numbers of centroids, because depending on where the centroids are initialized, we can end up with different groupings. Then, we can choose the final configuration which gives the least error (average sum of distances away from the final centroid matching).

In [None]:
def kMeans_init_centroids(X, K):
    """
    This function initializes K centroids that are to be
    used in K-Means on the dataset X

    Args:
        X (ndarray): Data points
        K (int):     number of centroids/clusters

    Returns:
        centroids (ndarray): Initialized centroids
    """

    # Randomly reorder the indices of examples
    randidx = np.random.permutation(X.shape[0])

    # Take the first K examples as centroids
    centroids = X[randidx[:K]]

    return centroids