# k-means clustering 
- Author: Kianoosh Vadaei 
- Dataset: MNIST



In [1]:
import numpy as np
from PIL import Image

**Code Description:**

This code processes a set of five images representing handwritten digits from the USPS dataset. Each image is divided into non-overlapping 16x16 submatrices, which are then flattened into vectors. These vectors are collected into a NumPy array called 'datas'. The final output reveals the shape of the 'datas' array, indicating the number of vectors and the size of each vector.

In [None]:
all_vectors= []
for i in range(5):
    img = Image.open(f'./usps_{i+1}.jpg')
    img_array = np.array(img)
    row_size , column_size = img_array.shape
    for row_start in range(0,row_size , 16):
        for col_start in range(0, column_size, 16):
            # Extract the 16*16 submatrix
            submatrix = img_array[row_start:row_start + 16, col_start:col_start + 16]
            flattened_vec = submatrix.flatten()
            all_vectors.append(flattened_vec)
datas = np.array(all_vectors)
datas.shape

**Code Description:**


This Python code implements the k-means clustering algorithm with the following functions:

initialize_centroids(data, k): Randomly selects k data points as initial centroids.

assign_to_clusters(data, centroids): Assigns each data point to the nearest centroid, forming clusters.

update_centroids(data, labels, k): Calculates new centroids based on the mean of data points in each cluster.

kmeans(data, k, max_iters=100): Performs the k-means algorithm, iterating until convergence or reaching a maximum of 100 iterations. Returns the final centroids and cluster labels.

In [None]:
def initialize_centroids(data, k):
    indices = np.random.choice(len(data), k, replace=False)
    return data[indices]

def assign_to_clusters(data, centroids):
    distances = np.linalg.norm(data[:, np.newaxis] - centroids, axis=2)
    return np.argmin(distances, axis=1)

def update_centroids(data, labels, k):
    new_centroids = np.zeros((k, data.shape[1]))
    for i in range(k):
        cluster_points = data[labels == i]
        if len(cluster_points) > 0:
            new_centroids[i] = np.mean(cluster_points, axis=0)
    return new_centroids

def kmeans(data, k, max_iters=100):
    centroids = initialize_centroids(data, k)
    
    for _ in range(max_iters):
        labels = assign_to_clusters(data, centroids)
        new_centroids = update_centroids(data, labels, k)
        
        # Check for convergence
        if np.array_equal(centroids, new_centroids):
            break
        
        centroids = new_centroids
    
    return centroids, labels

**Code Description:**

This code applies k-means clustering with k=5 to the dataset datas and prints the final centroids and cluster labels.

In [None]:
k = 5

# Run k-means clustering
centroids, labels = kmeans(datas, k)

print("Final Centroids:\n", centroids)
print("Labels:\n", labels)

**Code Description:**

This code creates and saves images for each of the k=5 centroids obtained from k-means clustering.

In [None]:
for i in range(k):
    tmp_centroid = centroids[i,:]
    centroid_matrix = tmp_centroid.reshape((16,16))
    image_data = np.uint8(centroid_matrix)

    image = Image.fromarray(image_data)

    image.save(f"kianoosh/k=5/centroids/centroid{i+1}.png")
    image.show()

In [27]:
for i in range(k):
    tmp_centroid = centroids[i,:]
    centroid_matrix = tmp_centroid.reshape((16,16))
    image_data = np.uint8(centroid_matrix)

    image = Image.fromarray(image_data)

    image.save(f"kianoosh/k=5/centroids/centroid{i+1}.png")
    image.show()