Write a program to cluster a set of points using K-means for IRIS
dataset. Consider, K=3, clusters. Consider Euclidean distance as the
distance measure. Randomly initialize a cluster mean as one of the data
points. Iterate at least for 10 iterations. After iterations are over, print the
final cluster means for each of the clusters.

Algorithm:

1. Initialize Cluster Means: Randomly select 3 points from the dataset as the initial cluster centroids.
2. Assign Points to Nearest Centroid: For each point, calculate the Euclidean distance to each centroid and assign the point to the closest centroid.
3. Recalculate Centroids: After assigning all points, recalculate the centroids as the mean of all points assigned to each cluster.
4. Repeat for 10 Iterations: Repeat the above two steps for 10 iterations or until convergence (we'll limit it to 10 iterations here).

In [11]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

In [21]:
df=pd.read_csv('Iris.csv')
X=df.drop(columns=['Id','Species']).values

In [22]:
X.shape

(150, 4)

In [25]:
K = 3

# Randomly initialize the cluster centroids (select 3 random points from the dataset)
np.random.seed(42)
initial_centroids = X[np.random.choice(X.shape[0], K, replace=False)]

# Function to compute Euclidean distance
def euclidean_distance(a, b):
    return np.linalg.norm(a - b)

# K-means algorithm
def kmeans(X, K, max_iter=10):
    # Initialize centroids
    centroids = initial_centroids
    for i in range(max_iter):
        # Step 1: Assign points to the nearest centroid
        labels = np.argmin(np.linalg.norm(X[:, np.newaxis] - centroids, axis=2), axis=1)
        
        # Step 2: Recalculate centroids
        new_centroids = np.array([X[labels == j].mean(axis=0) for j in range(K)])
        
        # If centroids do not change, break early
        if np.all(centroids == new_centroids):
            print(f'Converged at iteration {i+1}')
            break
        
        centroids = new_centroids
    
    return centroids, labels

# Run K-means on the Iris dataset
final_centroids, final_labels = kmeans(X, K)

# Print the final centroids (cluster means)
print("Final Cluster Means (Centroids):")
print(final_centroids)
# print(final_labels)

Converged at iteration 6
Final Cluster Means (Centroids):
[[5.9016129  2.7483871  4.39354839 1.43387097]
 [5.006      3.418      1.464      0.244     ]
 [6.85       3.07368421 5.74210526 2.07105263]]


Step 1: Assign points to the nearest centroid

labels = np.argmin(np.linalg.norm(X[:, np.newaxis] - centroids, axis=2), axis=1)


X[:, np.newaxis] - centroids: This subtracts each data point (represented by X[:, np.newaxis], which reshapes X for broadcasting) from each centroid. The resulting array has shape (n_samples, K, n_features), where n_samples is the number of data points and K is the number of clusters.

np.linalg.norm(..., axis=2): This computes the Euclidean distance between each data point and each centroid along the last axis (axis=2). The result is a distance matrix of shape (n_samples, K), where each row contains the distance of a data point to each centroid.

np.argmin(..., axis=1): This finds the index of the closest centroid (i.e., the centroid with the smallest distance for each data point). It returns an array of size n_samples (one index per data point), where each value corresponds to the index of the closest centroid.

Question-21 for K=4