# Image Clustering Analysis with PCA
This notebook implements an image clustering pipeline that:
1. Loads PNG images
2. Extracts features from the images
3. Applies PCA for dimensionality reduction
4. Uses elbow and silhouette methods to identify the # of clusters to use
4. Performs K-means clustering
5. Visualizes the clustering results

## Step 1 - Setup
First, let's install the required packages. Here are some ways to do this.

- In VS Code also offers these useful commands:
    - Ctrl+Shift+P (Windows/Linux) or Cmd+Shift+P (macOS)
    - Type "Python: Create Environment..."
    - Select Venv as the virtual environment type
    - Select the latest python interpreter 3.11.3 64-bit
    - Select requirements.txt

## Step 2 - Import Dependencies

In [32]:
import cv2
import os
import numpy as np
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from sklearn.metrics import silhouette_score
import matplotlib.pyplot as plt

# Configure matplotlib for inline display
%matplotlib inline

## Step 3 - Load the images
First, define some initial functions required to load the images.

In [33]:
def extract_features(image_path):
    # Load image using OpenCV
    image = cv2.imread(image_path)
    # Resize image to a fixed size
    resized_image = cv2.resize(image, (100, 100))
    # Flatten the image into a 1D array
    flattened_image = resized_image.flatten()

    return flattened_image

def load_images(folder_path):
    image_paths = [os.path.join(folder_path, f) for f in os.listdir(folder_path) if f.endswith('.png')]
    images = [extract_features(image_path) for image_path in image_paths]
    return image_paths, np.array(images)


Next, load the images.

In [34]:
# set the input path
input_path = "./images"

image_paths, images = load_images(input_path)

## Step 4 - Use PCA to reduce dimensionality
Use PCA to reduce the dimensionality of the images and display a chart highlighting the differences between the dimensions.

In [None]:
pca = PCA(n_components=50)
reduced_images = pca.fit_transform(images)

# Plot the PCA variances
explained_variances = pca.explained_variance_ratio_

plt.figure(figsize=(12, 6))
plt.bar(range(1, len(explained_variances) + 1), explained_variances, color='skyblue')
plt.xlabel('Principal Component')
plt.ylabel('Explained Variance (%)')
plt.title('Explained Variance by Principal Component')
plt.xticks(range(1, len(explained_variances) + 1))
plt.grid(axis='y', linestyle='--', alpha=0.7)
plt.show()

## Step 5 - Determine how many clusters to use
Determine the number of clusters to use by comparing the elbow and silhouette methods.

First, let's use the elbow method.

In [None]:
# Set the maximum clusters to consider
max_clusters = 10

wcss = []  # Within-cluster sum of squares

for i in range(1, max_clusters + 1):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(reduced_images)
    wcss.append(kmeans.inertia_)

# Plot the elbow curve
plt.figure(figsize=(10, 6))
plt.plot(range(1, max_clusters + 1), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Within-Cluster Sum of Squares (WCSS)')
plt.xticks(np.arange(1, max_clusters + 1))
plt.grid(True)
plt.show()

Now let's try the silhouette method.

In [None]:
# Set the maximum clusters to consider
max_clusters = 10

silhouette_scores = []  # Silhouette scores

for i in range(2, max_clusters + 1):
    kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
    kmeans.fit(reduced_images)
    labels = kmeans.labels_
    silhouette_scores.append(silhouette_score(reduced_images, labels))

# Plot the silhouette scores
plt.figure(figsize=(10, 6))
plt.plot(range(2, max_clusters + 1), silhouette_scores, marker='o', linestyle='--')
plt.title('Silhouette Method')
plt.xlabel('Number of Clusters')
plt.ylabel('Silhouette Score')
plt.xticks(np.arange(2, max_clusters + 1))
plt.grid(True)
plt.show()

## Step 6 - Cluster the images using K-Means

Cluster and chart the images based on the numbers of clusters suggested through the elbow and silhouette methods.

In [None]:
# Set the number of clusters to cluster by
num_clusters = 5

# Cluster images using K-means
kmeans = KMeans(n_clusters=num_clusters)
kmeans.fit(reduced_images)
labels = kmeans.fit_predict(reduced_images)

# Plot clusters on a scatter plot
plt.figure(figsize=(8, 6))
for i in range(len(np.unique(labels))):
    plt.scatter(reduced_images[labels == i, 0], reduced_images[labels == i, 1], label=f'Cluster {i}')
    
plt.title('Cluster Visualization')
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.legend()
plt.show()

## Next Steps
You might want to try...
- Adjusting the number of clusters to identify groupings
- Create a collage of thumbnails of the initial images
- Visualize the images by cluster to verify the groupings