## K-Means Clustering

- Steps
    - Randomly select 2 centroids
    - For each iteration:
        - Compute the distance of each data point to the centroids and find the centroid that is closest
        - Determine new centroids by computing the mean of all the data points that belong to each centroid
        - Check if previous iteration centroids are equal to the current iteration centroids, then the algorithm has reached convergence. Else continue to the next iteration.

- **`Steps of the K-means Algorithm`**

    - **Randomly Select Initial Centroids:**
        - Choose `k` data points from the dataset `X` to serve as the initial centroids. This is typically done randomly to ensure diversity in the initial centroids.
    - **For Each Iteration:**
        - **Compute Distances:**
            - Calculate the distance (usually Euclidean distance) between each data point and each centroid.
        - **Assign Clusters:**
            - Assign each data point to the nearest centroid. This creates clusters where each data point belongs to the cluster defined by the nearest centroid.
        - **Update Centroids:**
            - Compute the new centroids by calculating the mean of all the data points that belong to each cluster. This updates the centroids to better represent the clusters.
        - **Check for Convergence:**
            - Compare the new centroids with the centroids from the previous iteration. If the centroids have not changed (or the change is below a certain threshold), the algorithm has converged and can be stopped. Otherwise, continue to the next iteration.

In [None]:
import numpy as np

def k_means(X, k=2, max_itr=100):
    # Randomly initialize centroids
    centroids = X[np.random.choice(X.shape[0], k, replace=False)]

    for _ in range(max_itr):
        # Calculate distances from each point to each centroid
        distances = np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)
        # print(distances)

        # Assign each point to the closest centroid
        labels = np.argmin(distances, axis=1)                   # np.argmin([3, 1, 2]) returns 1
        # print(labels)

        # Calculate new centroids
        new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])

        # Check for convergence
        # np.all() : Checks if all elements in an array evaluate to True.
        # np.allclose() : Checks if all elements in two arrays are element-wise equal within a tolerance.
        if np.allclose(centroids, new_centroids):
            print("Convergence")
            break

        centroids = new_centroids

    return centroids, labels

# Example usage:
X = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
centroids, labels = k_means(X, k=2, max_itr=100)
print("Centroids:", centroids)
print("Labels:", labels)

[0 0 1 0 0 1]
[0 0 1 0 0 1]
Convergence
Centroids: [[2.5 3. ]
 [2.5 0. ]]
Labels: [0 0 1 0 0 1]


In [4]:
# import numpy as np

# def k_means(X, k, max_iters=100):
#     # randomly selects k data points from X to serve as the initial centroids
#     # The `replace=False` parameter ensures that the initial centroids are unique. This is important because if the same data point were selected multiple times, the initial centroids would not be distinct, which could lead to poor clustering results.
#     centroids = X[np.random.choice(X.shape[0], k, replace=False)]                       

#     for _ in range(max_iters):
#         # This line assigns each data point to the nearest centroid. np.linalg.norm calculates the Euclidean distance between each data point and each centroid. np.argmin finds the index of the closest centroid for each data point.
#         labels = np.argmin(np.linalg.norm(X[:, np.newaxis] - centroids, axis=2), axis=1)
#         new_centroids = np.array([X[labels == i].mean(axis=0) for i in range(k)])       # calculates the new centroids as the mean of the data points assigned to each cluster
#         if np.all(centroids == new_centroids):                                          # If the centroids do not change (i.e., the algorithm has converged), the loop breaks early.
#             break
#         centroids = new_centroids

#     return labels, centroids

# # Example usage
# X = np.array([[1, 2], [2, 3], [3, 4], [4, 5]])
# labels, centroids = k_means(X, 2)
# print("Labels:", labels)
# print("Centroids:", centroids)

- `X[:, np.newaxis]`
    - This operation adds a new axis to the array `X`. If `X` has shape `(n, d)` (where `n` is the number of data points and `d` is the number of features), `X[:, np.newaxis]` will have shape `(n, 1, d)`. This is done to facilitate broadcasting in the next step.
- `X[:, np.newaxis] - centroids`
    - Here, centroids has shape `(k, d)`, where `k` is the number of clusters. When you subtract centroids from `X[:, np.newaxis]`, NumPy broadcasts the arrays to perform element-wise subtraction. The resulting array has shape `(n, k, d)`, where each element is the difference between a data point and a centroid.
- `np.linalg.norm(X[:, np.newaxis] - centroids, axis=2)`
    - The `np.linalg.norm` function calculates the Euclidean distance (L2 norm) between each data point and each centroid. The `axis=2` argument specifies that the norm should be computed along the last axis (the feature dimension). The resulting array has shape `(n, k)`, where each element is the distance between a data point and a centroid.
- `np.argmin(..., axis=1)`
    - The `np.argmin` function finds the index of the minimum value along a specified axis. In this case, `axis=1` specifies that we want to find the index of the minimum distance for each data point. The resulting array has shape `(n,)`, where each element is the index of the nearest centroid for each data point.