# Synopsis
The purpose of this experiment is to compress image data with image quantization using KMeans clustering and compare the quantized images with the original images using the euclidean distance between pixels as a metric. The lower the euclidean distance the closer the pixel values are to each other and therefore the better the compression. The experiment will include a quantitative assessment (processing numerical values) and a qualitative assesment (visually inspecting the quantized images). To fully test this idea, the number of clusters will be varied from 1 to 15 and the distance will be plotted as a surface. The quantitative assessment will include calculating the average euclidean distance for the entire data set for each number of clusters and finding the lowest euclidean distance. This will inform the optimal number of clusters needed for compression. The qualitative assessment will inspect the quantized images to determine if the expressiveness of the original images is not lost, since images contain both chromatic and textural properties. For the experiment to be a success, the qualitative and quantitative assessments but agree that the compression was effective.

# Imports

In [None]:
# Automatically reload external code if changes are made
%load_ext autoreload
%autoreload 2

from mushrooms import MushroomGarden

In [None]:
from json import loads
from os.path import join
from matplotlib.animation import FuncAnimation
from mpl_toolkits.mplot3d import Axes3D

import cv2
import numpy as np
import matplotlib.pyplot as plt

In [None]:
# The class containing all logic needed for the experiment
mg = MushroomGarden()

# Data
## Examples
Display an example from each species of mushroom.

In [None]:
examples = []

fig = plt.figure(figsize=(20,8))

for id, (species, filename, bgr_im) in enumerate(mg.random_example_set()):
    rgb_im = cv2.cvtColor(bgr_im, cv2.COLOR_BGR2RGB)
    rgb_im = mg.preprocess_image(rgb_im)
    
    plt.subplot(2, 6, id+1)
    
    plt.axis('off')
    plt.title(f'{species.upper()}:\n{filename}')
    plt.imshow(rgb_im)

## Image Count Distribution
Determine the distribution of data. This will inform how many samples are needed to perform the experiment with fair representation.

In [None]:
labels = mg.species
values = list(mg.metadata.values())

fig = plt.figure()
plt.bar(labels, values)
plt.title('# of images per mushroom type')
plt.xlabel('Mushroom species')
plt.xticks(rotation=50)
plt.ylabel('# of images')
plt.grid(axis='y', linestyle='--')
plt.show()

Given the imbalanced nature of the data, we will opt for an undersampling method of 200 images per mushroom type, ensuring that at least 10% of each type is represented.

## Process all samples to produce average euclidean distance per no. of clusters
Gather 200 samples from each species and determine the average euclidean distance for a given number of clusters.

In [None]:
# This code block is commented out because the process duration is long.
# Run only once and reuse the saved outputs for further processing.
# Uncomment and rerun if the core logic changes.
# For faster iteration, use a smaller sample size.

# SAMPLE_SIZE = 200
# NO_CLUSTERS = 15

# for no_clusters in range(1, NO_CLUSTERS+1):
#     mg.process_all_samples(SAMPLE_SIZE, no_clusters)

If the cell above was run the results will have been saved in the parent directory to prevent accidental overwritting. Please create the following directory `data/results` and move the results there.

# Process results from various # of clusters
Gather all results into a single list for further processing.

This logic assumes that the following directory exists `data/results`.

In [None]:
results = []
for no_clusters in range(1,16):
    with open(join('data', 'results', f'{no_clusters}_results.json')) as f:
        json_result = loads(f.read())
        
        results.append(json_result['avg_mush_euclidean'])

We can now begin the quantitative assessment.

Display the results as an interactive 3D surface.

In [None]:
%matplotlib widget

# Generate some data for the surface plot
x = np.linspace(1, 11, 11)
y = np.linspace(1, no_clusters, no_clusters)
X, Y = np.meshgrid(x, y)
Z1 = np.array(results)

# Create the figure and axis objects
fig1 = plt.figure()
ax1 = fig1.add_subplot(111, projection='3d')

# Plot the surface
surf1 = ax1.plot_surface(X, Y, Z1, cmap='viridis')

# Set labels and title
ax1.set_xlabel('Mushrooms species')
ax1.set_ylabel('# of clusters')
ax1.set_zlabel('Euclidean distance')
ax1.set_title('Surface plot of avg. euclidean distances')

# Show the plot
plt.show()

The results are a little difficult to visually inspect, so let's create an approximate logarithmic fit to the data so that we can interpret it more easily.

In [None]:
%matplotlib widget

Z2 = []
x2 = np.linspace(1, 100, 100)
y2 = np.linspace(1, no_clusters, no_clusters)
X, Y = np.meshgrid(x2, y2)

for r in results:
    x_fit = np.linspace(1, len(r), 100)
    coefficients = np.polyfit(np.log(r), x, deg=1)
    z = np.polyval(coefficients, np.log(x_fit))
    Z2.append(-z)

Z2 = np.array(Z2)
fig2 = plt.figure()
ax2 = fig2.add_subplot(111, projection='3d')

surf2 = ax2.plot_surface(X, Y, Z2, cmap='viridis')
ax2.set_xlabel('Corresponding species')
ax2.set_ylabel('# of clusters')
ax2.set_zlabel('-log euclid. distance')
ax2.set_title('Negative Logarithmic fit to avg. euclidean distances')
plt.show()

The approximate fit over accentuates certain features of the original plot but still maintains relative distances well enough. From this plot we can see that the 2 to 6 range has the lowest euclidean distance. We can now perform the qualitative assessment with this in mind.

# Image Quantization
The following cell contains a function for display quantized images and also retrieves a single sample for visual inspection for every cluster chosen.

In [None]:
(species, filename, bgr_im) = mg.random_example_set().__next__()

def orig_quant_comparison(species, filename, orig, no_clusters):
    fig = plt.figure(figsize=(10, 4))

    filepath = join('data', 'species', species, filename)
    rgb_im = cv2.cvtColor(orig, cv2.COLOR_BGR2RGB)
    quant1 = mg.quantize_image(filepath, no_clusters)
    rgb_quant1 = cv2.cvtColor(quant1, cv2.COLOR_BGR2RGB)

    plt.subplot(1, 2, 1)
    plt.axis('off')
    plt.imshow(rgb_im)

    plt.subplot(1, 2, 2)
    plt.axis('off')
    plt.imshow(rgb_quant1)

    plt.show()


## Quantization with 2 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 2)

## Quantization with 3 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 3)

## Quantization with 4 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 4)

## Quantization with 5 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 5)

## Quantization with 6 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 6)

## Quantization with 15 clusters

In [None]:
orig_quant_comparison(species, filename, bgr_im, 15)

The above cells display quantized images for clusters 2-6. It is clear that this low number of clusters doesn't retain enough detail to fully express the qualities of the image. A cluster 15 quantized image is also provided for contrast. In this case, detail is preserved for both mushrooms and backgrounds.

# Conclusion
The quantitative assessment suggests that between 2 and 6 clusters (inclusive) is the optimal number to quantize an image. However, the qualitative assessment doesn't support this hypothesis. Given that both the quantitative and qualitative assessments must agree in order for a hypothesis to be potentially valid, this means that no further investigation is necessary as the hypothesis is invalid. In order to improve these results, perhaps a different clustering method or a different metric would be better. Follow up experiments would need to be conducted for all combinations of metrics and clustering algorithms.