## Image Clustering with K-Means

* By Mohamamd Hassan Heydari


***
**Simple From Scratch course of K-Means Implementation in Python**
* https://youtu.be/5w5iUbTlpMQ?feature=shared

**Recommended full course of Unsupervised learning**
* https://www.coursera.org/learn/unsupervised-learning-recommenders-reinforcement-learning?specialization=machine-learning-introduction

**GeeksForGeeks K-Means toturial**
* https://www.geeksforgeeks.org/k-means-clustering-introduction/

***

Image Clustering is one of k-means usecases whuch allows us to cluster unlabeld images . We can represent each image in coputer as a Tensor with (m, n, p) dimensions . m and n are length and width of our image and p represents a simple ( r, g, b) of each pixels color . In our specific dataset which is MNIST, images are 16*16 in gray scale, so we dont need that RGB part of pixle . Simply we can represent our images as 2D matrices which are 16*16 here .
To train our k-means algorithm we need 1D vectors , so we flatten our images into shape ( 1, 16*16) or simply (1, 256) . A vector with 256 features !


Our main directory of images includes 5 images of number 1 to 5, each Big images is 544*528 and has 34*33 small images of that number with 16*16 pixels.
After we trained our model on multiple number of cluster centroids, we save them in their specific pre-built directories

In [None]:
import numpy as np # for our numerical calculations
from PIL import Image # to read and save images
import matplotlib.pyplot as plt # for showing the results

* first of all, we need to load our data, we simply use this function to load images and slice them into 1122 small images with 16*16 pixels.

In [None]:
def data_redaer():
    dataset = []
    for i in range(5):
        img = Image.open(f'images/usps_{i+1}.jpg')

        img_array = np.array(img)

        for i in range(0 ,img_array.shape[0] , 16):
            for j in range(0, img_array.shape[1] , 16):

                # slicing Big 544*528 image to 16*16 small image
                small_image = img_array[i : i + 16, j : j + 16]

                # making our image a 1D Vector
                small_image = small_image.flatten()

                dataset.append(small_image)

    dataset = np.array(dataset)

    return dataset

* Ti initialize first random centroids from dataset, we implement this function :

In [None]:
def init_centroids(X, k):
    randidx = np.random.permutation(X.shape[0])
    centroids = X[randidx[:k]]
    return centroids

* As k-means works, it repeatedly updates cluster centroids , we update them like this :

In [None]:
def compute_centroids(X, idx, K):
    m, n = X.shape
    centroids = np.zeros((K, n))

    for k in range(K) :
        points = X[idx == k]

        if len(points) > 0:
            centroids[k] = np.mean(points, axis= 0)

    return centroids

* Each sample of dataset should be assigned to its closest centroid, in this project , we do this task with this part of code :

In [None]:
def find_closest_centroid(X, centroids):
    K = centroids.shape[0]
    idx = np.zeros(X.shape[0], dtype=int)

    for i in range(X.shape[0]):
        distances = []
        for j in range(centroids.shape[0]):
            norm_i_j = np.linalg.norm(X[i] - centroids[j])
            distances.append(norm_i_j)

        idx[i] = np.argmin(distances)

    return idx

* In our main function of algorithm, we use previous functions to update the centroids and their assigned samples . We repeat this process untill number of iterations is reached

In [None]:
def run_k_means(X, initial_centroids, max_iters=100):
    m, n = X.shape
    K = initial_centroids.shape[0]
    centroids = initial_centroids
    idx = np.zeros(m)

    for i in range(max_iters):
        print(f'Epoch : {i} | K = {k}')
        idx = find_closest_centroid(X, centroids)
        centroids = compute_centroids(X, idx, K)

    return centroids, idx

* Finally, in the main body of code, we run k-means on different values of k ( 3 to 7 ) to examine the performance of this on the result . itn the last part, we save the cluster centroids to their directories

In [None]:
dataset = data_redaer()

for k in [3, 4, 5, 6, 7] :

    initial_centroids = init_centroids(dataset, k)
    # Run k-means clustering
    centroids, labels = run_k_means(dataset, initial_centroids)

    # Reshape the centroids into images
    centroids = centroids.reshape((k, 16, 16))

    if k == 5 :
        for item in centroids :
            plt.imshow(item, cmap='gray')
            plt.show()

    for i in range(k):

        image_data = np.uint8(centroids[i])
        image = Image.fromarray(image_data)

        image.save(f"centroids/{k}/centroid{i + 1}.png")
