# Clustering
Clustering algorithm groups data that are similar to each other

# K-means
The K-means algorithm is a method to automatically cluster similar data points together

Given $m$ training examples $\{x^{(1)}, ..., x^{(m)}\}$, we want to group them into $K$ clusters

Steps
1. Randomly initalize $K$ cluster centroids ($\mu_1, \mu_2, ..., \mu_K$), the clustering centroids are vectors that have the same dimension as the training examples
2. Assign each data point to its closest centroid
3. Compute the points of average of all data points that are assigned to each cluster centroid and reassign the clustering centroids position to those points
4. Repeat from step 2 until the movement of the cluster centroids converge

Note: if a cluster has no point assigned to it, we can whether elimiate that cluster or reinitalize it to a new position until at least one point is assigned to it

## K-means optimization

### Notataions
$x^{(1)}, ..., x^{(m)}$: $m$ training examples in total

$c^{(i)}$: index of cluster ($1, 2, ..., K$) that the $i$th training example, $x^{(i)}$, is assigned to

$\mu_k$: $k$th cluster centroid

$\mu_{c^{(i)}}$: the cluster centroid with the index of cluster $c^{(i)}$ (cluster centroid that the $i$th training example, $x^{(i)}$, is assigned to)

## Assign the closest centroid to a point
The determine the assigned cluster index for a point ($c^{(i)}$), we want to find a cluster that minimize
$$ ||x^{(i)} - \mu_k||^2,$$
which is the squared distance between two vectors

$x^{(i)}$: a vector representing $i$th training example

$\mu_k$: the coordinate of $k$th cluster

## Compute the mean of the each centroid
The following function calculates the average coordinate of all points assigned to a centroid the as the new clustering centroid
$$\mu_k = \frac{1}{|C_k|} \sum_{i \in C_k} x^{(i)}$$

$\mu_k$: the updated coordinate of $k$th cluster

$C_k$: the training examples that are assigned to centroid $k$
  
$|C_k|$: the number of training examples in the set $C_k$

## Cost function
The cost function compute the average squared distance between each training example and its assigned clustering centroid
$$J = \frac{1}{m} \sum\limits_{i = 1}^{m} ||x^{(i)} - \mu_{c^{(i)}}||^2$$

After each iteration of assigning points and updating centroids should reduce the cost. If the cost function converges, we can stop iterate 

# Code

In [1]:
import numpy as np
import matplotlib.pyplot as plt

In [2]:
# Find closest centroid
def find_closest_centroids(X, centroids):
    """
    Args:
        X (ndarray): (m, n) Input values      
        centroids (ndarray): (K, n) centroids
    
    Returns:
        idx (array_like): (m,) closest centroids
    """

    # Get total number of clusters
    K = centroids.shape[0]

    # Get the number of training examples
    idx = np.zeros(X.shape[0], dtype=int)

    # Loop through each training example
    for i in range(X.shape[0]):
        
        # Define minimum distance & index
        min_distance = np.inf
        index = -1
        
        # Loop through each cluster
        for j in range(K):
            
            # Calculate the squared distance
            distance = (np.linalg.norm(X[i] - centroids[j])) ** 2
            
            # Update the minimum distance and index
            if distance < min_distance:
                min_distance = distance
                index = j
                
        idx[i] = index
        
    # return an 1D array with size equals to the number of training example holding the cluster index for each example
    return idx

In [4]:
# Test
X = np.array([[1.84207953,4.6075716 ],
 [5.65858312,4.79996405],
 [6.35257892,3.2908545 ],
 [2.90401653,4.61220411],
 [3.23197916,4.93989405]])
initial_centroids = np.array([[3,3], [6,2], [8,5]])

idx = find_closest_centroids(X, initial_centroids)
print("First three elements in idx are:", idx[:3])

First three elements in idx are: [0 2 1]


In [6]:
# Update centroids after points are assigned
def compute_centroids(X, idx, K):
    """
    Args:
        X (ndarray):   (m, n) Data points
        idx (ndarray): (m,) Array containing index of closest centroid for each 
                       example in X. Concretely, idx[i] contains the index of 
                       the centroid closest to example i
        K (int):       number of centroids
    
    Returns:
        centroids (ndarray): (K, n) New centroids computed
    """
    
    # m: number of training example
    # n: number of features
    m, n = X.shape
    
    # Define output array
    centroids = np.zeros((K, n))
    
    # Looping through each cluster
    for j in range(K):
        
        # Define variables
        coord_sum = np.zeros((1,n))
        num_train = 0
        
        # Loop through each examples
        for i in range(m):
            
            # check if a training exmaples belongs to jth cluster
            if idx[i] == j:
                coord_sum += X[i]
                num_train += 1
        
        # Compute average and append
        centroids[j] = coord_sum / num_train
         
    # Returns the an array of updated centroids by computing the means of the data points assigned to each centroid.
    return centroids

In [7]:
# Test
K = 3
centroids = compute_centroids(X, idx, K)

print("The centroids are:", centroids)

The centroids are: [[2.65935841 4.71988992]
 [6.35257892 3.2908545 ]
 [5.65858312 4.79996405]]


In [11]:
# K Means algorithm
# You do not need to implement anything for this part

def run_kMeans(X, initial_centroids, max_iters=10):
    """
    Runs the K-Means algorithm on data matrix X, where each row of X
    is a single example
    """
    
    # Get number of training examples and features
    m, n = X.shape
    
    # Get number of clusters
    K = initial_centroids.shape[0]
    centroids = initial_centroids
    
    # Initialize an array with size m to store the index of the closest centroid to each training examples
    idx = np.zeros(m)
    
    # Update centroids
    for i in range(max_iters):
        
        # Update idx
        idx = find_closest_centroids(X, centroids)
            
        # Compute new centroid positions
        centroids = compute_centroids(X, idx, K)
        
    return centroids, idx

In [13]:
# Set number of centroids and max number of iterations
K = 3
max_iters = 10
initial_centroids = np.array([[3,3],[6,2],[8,5]])

# Run K-Means
centroids, idx = run_kMeans(X, initial_centroids, max_iters)

## Random initialization
To randomly initalize the clustering centroids,
1. Select the number of cluster, $K$
2. Randomly pick $K$ training examples and set a clustering centroid equal to each of the training examples

Issue: random initialization may result in a non-optimal clustering depending on which set of training examples are selected.

Thus, we should run random initialization multiple times and compute the cost for each set of initialization using K-mean algorithm. The set with the lowest cost will be the most optimal initializaiton

## Choosing the number of clusters
Selecting number of clusters is very ambiguous, so the number of clusters should depend on how well the model performs to serve for its actual purposes

# Code

In [14]:
# Random initialization
def random_init(X, K):
    """
   Args:
        X (ndarray): Data points 
        K (int):     number of centroids/clusters
    
    Returns:
        centroids (ndarray): Initialized centroids
    """
    
    # Randomize the indexes for training examples
    random_idx = np.random.permutation(X.shape[0])
    
    # Select the first K numbers from the randomized index list
    centroids = X[random_idx[:K]]
    
    
    return centroids