# Homework08

Exercises to practice unsupervised learning with clustering

## Goals

- Get more practice with the ML flow: encode -> normalize -> train -> evaluate
- Understand the tradeoffs of modeling parameters
- Develop intuition for different clustering models and when to use them

### Setup

Run the following 2 cells to import all necessary libraries and helpers for this homework.

In [None]:
!pip install -r ./.devcontainer/requirements.txt

In [None]:
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/data_utils.py
!wget -q https://github.com/PSAM-5020-2025S-A/5020-utils/raw/main/src/image_utils.py

In [None]:
import json
import matplotlib.pyplot as plt
import pandas as pd

from os import listdir
from PIL import Image as PImage

from sklearn.preprocessing import OrdinalEncoder

from data_utils import object_from_json_url
from data_utils import StandardScaler
from data_utils import KMeansClustering, GaussianClustering

from image_utils import get_pixels, make_image

## Helmet Sizing

### Load Dataset

Let's load up the full [ANSUR](https://www.openlab.psu.edu/ansur2/) dataset that we looked at briefly in [Week 02](https://github.com/PSAM-5020-2025S-A/WK02) and then again in [Homework06](https://github.com/PSAM-5020-2025S-A/Homework06).

This is the dataset that has anthropometric information about U.S. Army personnel.

#### WARNING

Like we mentioned in class, this dataset is being used for these exercises due to the level of detail in the dataset and the rigorous process that was used in collecting the data.

This is a very specific dataset and should not be used to draw general conclusions about people, bodies, or anything else that is not related to the distribution of physical features of U.S. Army personnel.

In [None]:
# Load Dataset
ANSUR_FILE = "https://raw.githubusercontent.com/PSAM-5020-2025S-A/5020-utils/main/datasets/json/ansur.json"
ansur_data = object_from_json_url(ANSUR_FILE)

# Look at first 2 records
ansur_data[:2]

Let's load it into a `DataFrame`, like last week.

In [None]:
# Read into DataFrame
ansur_df = pd.json_normalize(ansur_data)
ansur_df.head()

### Unsupervised Learning

Let's pretend we are designing next-generation helmets with embedded over-the-ear headphones and we want to have a few options for sizes.

We could use clustering to see if there is a number of clusters that we can divide our population into, so each size covers a similar portion of the population.

We can follow similar steps to regression to create a clustering model that uses features about head and ear sizes:

1. Load dataset (done! 🎉)
2. Encode label features as numbers
3. Normalize the data
4. Separate the feature variables we want to consider (done below)
5. Pick a clustering algorithm
6. Determine number of clusters
7. Cluster data
8. Interpret results

For step $5$, it's fine to just pick an algorithm ahead of time to see what happens, but feel free to experiment and plot results for multiple clustering methods.

In [None]:
## Encode non-numerical features
genders = ['M', 'F']
# print(genders)
gender_encoder = OrdinalEncoder(categories=[genders])
gender_vals = gender_encoder.fit_transform(ansur_df[["gender"]].values)
ansur_df[["gender"]] = gender_vals 

In [None]:
display(ansur_df[["gender"]])

In [None]:
## Normalize the data
stdScaler = StandardScaler()
ansur_scaled_df = stdScaler.fit_transform(ansur_df)


In [None]:
## Separate the features we want to consider
ansur_features = ansur_scaled_df[["head.height", "head.circumference", "ear.length", "ear.breadth", "ear.protrusion"]]

In [None]:
## Create Clustering model
n_clusters = 7
km_model = KMeansClustering(n_clusters=n_clusters, random_state=1010)

## Run the model(s) on the data
km_predicted = km_model.fit_predict(ansur_features)


In [None]:
## Trying GaussianClustering model
gc_model = GaussianClustering(n_clusters=n_clusters, random_state=1010)

## Run the model(s) on the data
gc_predicted = gc_model.fit_predict(ansur_features)

In [None]:
km_predicted

In [None]:
gc_predicted

In [None]:
## Check errors
# score
print("KMeans score:",km_model.score(ansur_features.values))

# distance
print("KMeans distance score:", km_model.distance_score())
print("Gaussian distance score:", gc_model.distance_score())

# balance
print("KMeans balance score:", km_model.balance_score()) 
print("Gaussian balance score:", gc_model.balance_score())

In [None]:
gc_predicted

In [None]:
# function copied from https://github.com/PSAM-5020-2025S-A/WK08
def plot_clusters(features, labels, clusters, title):
  xl, yl, zl = labels[:3]
  x = features[xl]
  y = features[yl]
  z = features[zl]

  # 2D
  plt.scatter(x, y, c=clusters, marker='o', linestyle='', alpha=0.5)
  plt.title(f"{title} clustering")
  plt.xlabel(xl)
  plt.ylabel(yl)
  plt.ylim([-2.2, 3])
  plt.show()

  plt.scatter(x, z, c=clusters, marker='o', linestyle='', alpha=0.5)
  plt.title(f"{title} clustering")
  plt.xlabel(xl)
  plt.ylabel(zl)
  plt.ylim([-2.2, 3])
  plt.show()

  # 3D
  fig = plt.figure(figsize=(8, 8))
  ax = fig.add_subplot(projection='3d')

  ax.scatter(x, y, z, c=clusters, marker='o', linestyle='', alpha=0.5)

  ax.set_title(f"{title} clustering")
  ax.set_xlabel(xl)
  ax.set_ylabel(yl)
  ax.set_zlabel(zl)
  ax.set_ylim(-2.5, 8)
  ax.set_zlim(-2.5, 2.5)

  plt.show()

In [None]:
## Plot clusters as function of 2 or 3 variables

# ansur_features = ansur_scaled_df[["head.height", "head.circumference", "ear.length", "ear.breadth", "ear.protrusion"]]

clusters = km_predicted["clusters"]

labels_1 = ["head.height", "head.circumference", "ear.length",]
plot_clusters(ansur_features, labels_1, clusters, "K-Means")

labels_ear = ["ear.length", "ear.breadth", "ear.protrusion"]
plot_clusters(ansur_features, labels_ear, clusters, "K-Means")

labels_2 = ["head.circumference", "ear.length", "ear.breadth", ]
plot_clusters(ansur_features, labels_2, clusters, "K-Means")

### Interpretation

<span style="color:hotpink;">
Which clustering algorithm did you choose?<br>
Did you try a different one?<br>
Do the clusters make sense ? Do they look balanced ?
</span>

I first tried the KMeans model
balance_score returns ~0.93, which is very close to 1, indicating that the clusters are balanced
distance_score returns ~1.42, which is a bit greater than 1 standard deviation for the scaled data - this doesn't seem awesome

Looking at the graphs for KMeans for various features, there are a few features that seemed to be clustered well, but in some instances the clusters didn't seem very helpful. 

from the feature combinations I trid out, these graphs shows the most "clustery" clusters: head.circumference versus ear.length and head.height versus head.circumference 

Then I tried the Gaussian model, which seems slightly worse than KMeans -

KMeans distance score: 1.4211849284287907
Gaussian distance score: 1.458428409698391
KMeans balance score: 0.9314583333333333
Gaussian balance score: 0.88625

The distance is very similar, with KMeans a bit smaller, and the KMeans is also a bit more balanced.

Next step would be to try out differe numbers of clusters for each model!

## Figure out how many cluster

Experiment with the number of clusters to see if the initial choice makes sense.

The [WK08](https://github.com/PSAM-5020-2025S-A/WK08) notebook had a for loop that can be used to plot errors versus number of clusters.

In [None]:
## Plot errors and pick how many cluster

# copied from WK08 notebook - edited to round decimal points to 2 for legibility
# try 2 - 10 clusters for K-Means Clustering
num_clusters = list(range(2,10))

# collect distance, silhouette and balance scores
score_scores = []
distance_scores = []
silhouette_scores = []
balance_scores = []

# get distance, likelihood and balance for different clustering sizes
for n in num_clusters:
  mm = KMeansClustering(n_clusters=n)
  mm.fit_predict(ansur_features)
  score_scores.append(round(mm.score(ansur_features.values), 2))
  distance_scores.append(round(mm.distance_score(),2))
  silhouette_scores.append(round(mm.silhouette_score(),2))
  balance_scores.append(round(mm.balance_score(), 2))

In [None]:
print("score_scores", score_scores)
print("distance_scores", distance_scores)
print("silhouette_scores", silhouette_scores)
print("balance_scores", balance_scores)

In [None]:
# plot scores as function of number of clusters
plt.plot(num_clusters, score_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distance Squared Sum Score")
plt.title("K-means Clustering")
# plt.ylim([0, 3])
plt.show()

plt.plot(num_clusters, distance_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Distance Score")
plt.title("K-means Clustering")
# plt.ylim([0, 3])
plt.show()

plt.plot(num_clusters, silhouette_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Silhouette Score")
plt.title("K-means Clustering")
# plt.ylim([-1, 1])
plt.show()

plt.plot(num_clusters, balance_scores, marker='o')
plt.xlabel("Number of Clusters")
plt.ylabel("Balance Score")
plt.title("K-means Clustering")
plt.ylim([0, 1])
plt.show()

### Interpretation

<span style="color:hotpink;">
Based on the graphs of errors versus number of clusters, does it look like we should change the initial number of clusters ?<br>
How many clusters should we use ? Why ?
</span>

The graphs for each score, apart for balance, which hovers between ~0.9-1 for every cluster size. 

Since the use case of this might be sizing for helmets, it could be interesting to look at 5 clusters for XS, S, M, L, XL sizes

### Revise Number of Clusters.

Re-run with the new number of clusters and plot the data in $2D$ or $3D$.

This can be the same graph as above.

In [None]:
## Re-run clustering with final number of clusters
## Create Clustering model
n_clusters_rerun = 5
km_model_rerun = KMeansClustering(n_clusters=n_clusters_rerun, random_state=1010)

In [None]:
## Run the model on the training data
km_predicted_rerun = km_model_rerun.fit_predict(ansur_features)

In [None]:
## Check errors
print("KMeans score:",km_model_rerun.score(ansur_features.values))
print("KMeans distance score:", km_model_rerun.distance_score())
print("KMeans balance score:", km_model_rerun.balance_score()) 
print("KMeans silhouette score:", km_model_rerun.silhouette_score()) 

In [None]:
## Plot in 3D
# function copied from https://github.com/PSAM-5020-2025S-A/WK08
def plot_clusters_3d(features, labels, clusters, title):
  xl, yl, zl = labels[:3]
  x = features[xl]
  y = features[yl]
  z = features[zl]
  
  # 3D
  fig = plt.figure(figsize=(8, 8))
  ax = fig.add_subplot(projection='3d')

  ax.scatter(x, y, z, c=clusters, marker='o', linestyle='', alpha=0.5)

  ax.set_title(f"{title} clustering")
  ax.set_xlabel(xl)
  ax.set_ylabel(yl)
  ax.set_zlabel(zl)
  ax.set_ylim(-2.5, 8)
  ax.set_zlim(-2.5, 2.5)

  plt.show()

# ansur_features = ansur_scaled_df[["head.height", "head.circumference", "ear.length", "ear.breadth", "ear.protrusion"]]
clusters_rerun = km_predicted_rerun["clusters"]

labels_1_rerun = ["head.height", "head.circumference", "ear.length",]
plot_clusters_3d(ansur_features, labels_1_rerun, clusters_rerun, "K-Means")

labels_ear_rerun = ["ear.length", "ear.breadth", "ear.protrusion"]
plot_clusters_3d(ansur_features, labels_ear_rerun, clusters_rerun, "K-Means")

labels_2_rerun = ["head.circumference", "ear.length", "ear.breadth", ]
plot_clusters_3d(ansur_features, labels_2_rerun, clusters_rerun, "K-Means")

### Interpretation

<span style="color:hotpink;">
Do these look better than the original number of clusters?
</span>

The graph for head.height versus head.circumference looks a bit worse, but the other two graphs I think look a bit beetter. By comparing 5 versus 7 I think I would move forward with 5 due to the clearer size mapping. 

## Image Organization

We have a dataset of about $600$ flower images that we might want to classify by species... eventually.

What we want to do first is take a look at all of the images and see what kind of images we have, what kind of colors our flowers have and see if there's any other visual information that could help us classify these images later.

We'll see how to use clustering and distances to organize our images by color to create a visualization that we cna use to get to know our dataset.

### Load Dataset

The following cell downloads the dataset:

In [None]:
!wget -qO- https://github.com/PSAM-5020-2025S-A/5020-utils/releases/latest/download/flowers.tar.gz | tar xz

Then, we can take a look at a few of the images:

In [None]:
IMG_DIR = "./data/image/flowers"

In [None]:
display(PImage.open(f"{IMG_DIR}/00_001.png"))
display(PImage.open(f"{IMG_DIR}/15_001.png"))

### Find Representative Colors

The overall process for organizing our images by color will be something like this:

1. Iterate over all files in the `data/image/flowers` directory, open each image file and treat it as a dataset
   1. Load image into a `DataFrame` where each pixel is a row and R,G,B values are columns/features
   2. Cluster into $2$ - $16$ colors
   3. Pick $3$ or $4$ representative colors
   4. Store image filenames and their representative colors in a Python object
2. Once all images have been processed we can order our dataset by different color characteristics: white to black, red to blue, hue value, brightness

### One Image

Let's step through the process of getting representative colors for one image, and then we can repeat this in a loop to process all of the flower images.

#### Open Image

The `PIL` library does all the work here:

In [None]:
# Open image
fname = "00_001.png"
pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")

#### Put into `DataFrame`

We get the pixels and make a dataset/`DataFrame` out of them:

In [None]:
# Load into DataFrame
pxs = get_pixels(pimg)
pxs_df = pd.DataFrame(pxs, columns=["R", "G", "B"])

In [None]:
pxs_df

#### Cluster colors

Create a clustering object, cluster colors into $8$ clusters with `fit_predict()` and take a look at our color palette (`cluster_centers_`):

In [None]:
# Create Clustering object
km_model_flower = KMeansClustering(n_clusters=8)

# Cluster by color
km_predicted_flower = km_model_flower.fit_predict(pxs_df)


In [None]:
km_predicted_flower

In [None]:
# Take a look at the color palette (cluster_centers_)
km_model_flower.cluster_centers_

color_centers = [[round(r), round(g), round(b)] for r,g,b in km_model_flower.cluster_centers_]
print(color_centers)

In [None]:
# convert [0, 255] to [0, 1]
c = [(r/255, g/255, b/255) for r,g,b in color_centers]

heights = [1] * len(c) # placeholder data
fig, ax = plt.subplots(figsize=(8, 1))
ax.bar(range(len(c)), heights, color=c, edgecolor='black')
ax.get_yaxis().set_visible(False)
ax.get_xaxis().set_visible(False)
plt.show()


#### Checkpoint

<span style="color:hotpink;">
Does anything stand out about the colors?
</span>

The majority of the colors are a pink/red hue and green. This makes sense given the original image.

#### Reconstruct Image

Since we're only doing one image for now, let's take a look at the clustering result.

This is like in the lecture notebook. We'll start with an empty pixel array and as we iterate through the `DataFrame` of cluster ids we append the corresponding colors to it.

In [None]:
# create empty pixel array
clustered_pxs = []

# iterate through resulting list of cluster ids
#  append corresponding color value to pixel array

for gidx in km_predicted_flower["clusters"]:
  clustered_pxs.append(color_centers[gidx])

Now we can look at the image. If this next cell gives errors about using `float` values in images, just make sure the pixel values that are being appended above are all whole number `int` values.

In [None]:
display(make_image(clustered_pxs, width=pimg.size[0]))

#### Checkpoint

<span style="color:hotpink;">
How does changing the number of clusters affect the resulting image?<br>Try some lower values like <code>2</code> and <code>4</code>, and also some higher ones like <code>12</code> and <code>16</code>. Take a look at a different image.
</span>

In [None]:
# Create Clustering object
def cluster_flower_colors(n_clusters, fname):
  pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")
  pxs = get_pixels(pimg)
  pxs_df = pd.DataFrame(pxs, columns=["R", "G", "B"])
  km_model_flower = KMeansClustering(n_clusters=n_clusters)
  km_predicted_flower = km_model_flower.fit_predict(pxs_df)
  color_centers = [[round(r), round(g), round(b)] for r,g,b in km_model_flower.cluster_centers_]
  clustered_pxs = []
  for gidx in km_predicted_flower["clusters"]:
    clustered_pxs.append(color_centers[gidx])
  display(make_image(clustered_pxs, width=pimg.size[0]))

In [None]:
cluster_flower_colors(2, "00_001.png")
cluster_flower_colors(10, "00_001.png")
cluster_flower_colors(16, "00_001.png")

In [None]:
cluster_flower_colors(2, "03_021.png")
cluster_flower_colors(10, "03_021.png")
cluster_flower_colors(16, "03_021.png")

In [None]:
cluster_flower_colors(2, "04_026.png")
cluster_flower_colors(10, "04_026.png")
cluster_flower_colors(16, "04_026.png")

With only 2 clusters the basic outline of each flower is more or less visible, but essentially all detail is lost
With 16 clusters the image is pretty good! And the difference between 10 and 16 is not actually so much - enough detail is retained in 10 that it may not be worth using more clusters if feature reduction is the goal.

#### Pick Colors

Ok, we have some representative colors for our images. We should keep more than one color, but maybe we don't have to keep $12$.

We can use the `value_counts()` function of our `DataFrame` to see how many pixels are represented by each of our cluster colors:

In [None]:
# cluster ids and pixel counts, ordered by descending counts
ccounts = km_predicted_flower["clusters"].value_counts()
display(ccounts)

Since what we are really trying to do here is get some information about the colors of the flowers present in our images, and given the type of images we have, we can start by assuming that the flower colors will be in the top-$4$ clusters returned by `value_counts()`.

We can revisit this assumption later. We might also want to add some filters here to ignore sky and vegetation colors (blues and greens) and only keep flower colors.

For now, let's just grab the top-$4$ colors from `value_counts()`, remembering we want to keep their rounded `int` values and not the default `float` values in `cluster_centers_`.

In [None]:
# Object to keep colors for each file
file_info = {
  "filename": fname,
  "colors": []
}

# TODO: go through ccounts.index and get corresponding colors for each clusters
# TODO: add top-4 colors to the "colors" key of the file_color_info object
for idx in ccounts.index[:4]:
    file_info["colors"].append(km_model_flower.cluster_centers_[idx])



In [None]:
display(file_info)

#### Checkpoint

<span style="color:hotpink;">
Why might we want to cluster into <code>8</code> or even <code>12</code> colors when in the end we're only keeping <code>4</code>?
</span>

If we only grouped into 4 colors then each color would be less specific, whereas grouping by a higher numbers lets us pick more specific colors and then just keep the top 4 most important. 

### Iterate and Cluster

We've processed one image, now let's process $600$... for-loops FTW!

We'll need to loop through all of the images in our directory and repeat the process above for each one of them.

We can create a function that takes a filename as input and returns the top-$4$ colors for that image, or... we can just put all of the clustering logic in the body of a for loop. Whichever is easiest.

Let's get started.

In [None]:
# list of all files in the flowers directory
flower_files = sorted([f for f in listdir(IMG_DIR) if f.endswith(".png")])

Here's the loop. In the end we want our `file_colors` list to have objects that have a filename and $4$ colors associated with each filename. Something like:

```py
[
  {
    "filename": "00_001.png",
    "colors": [[12,44,12], [112,144,62],  [12,84,112], [212,144,102]]
  },
  {
    "filename": "00_002.png",
    "colors": [[22,24,28], [112,114,122], [128,200,2], [250,240,230]]
  },
  ...
]
```

This can take a while to run (up to a minute for $600$ images). We can use slicing to test our logic on a subset of `flower_files` before processing all $600$ images.

In [None]:
def get_cluster_colors(fname, file_colors):
  pimg = PImage.open(f"{IMG_DIR}/{fname}").convert("RGB")
  pxs = get_pixels(pimg)
  pxs_df = pd.DataFrame(pxs, columns=["R", "G", "B"])
  km_predicted_flower = km_model_flower.fit_predict(pxs_df)
#   color_centers = [[round(r), round(g), round(b)] for r,g,b in km_model_flower.cluster_centers_]
  ccounts = km_predicted_flower["clusters"].value_counts()
#   print(ccounts)
  for idx in ccounts.index[:4]:
    file_colors.append(km_model_flower.cluster_centers_[idx])

In [None]:
get_cluster_colors(flower_files[0], [])

In [None]:
# List to keep colors for each file
file_colors = []

for fname in flower_files:
  colors = []
  get_cluster_colors(fname, colors)
  file_colors.append({  "filename": fname,
  "colors": colors})

In [None]:
# file_colors

#### Order Images (almost)

We have a list with objects that keep track of filenames and representative colors. We could create a `DataFrame` or csv dataset with these, but let's go ahead and just use this directly in this format.

What we want to do is re-order our list of objects, but using a `key` function that takes each object's colors into consideration.

We'll look into how to do this dynamically later, but for now let's order our images by something like _brightness_. It's _like_ brightness because what we'll do is measure how close each image is to the white color `(255,255,255)`.

We'll need some helper functions first:

- `color_distance()`: takes $2$ colors and returns the distance between them
- `min_color_distance()`: given a reference color and a list of colors, returns the distance between the reference color and the closest color in the list

In [None]:
import math
def color_distance(c0, c1):
  return math.sqrt(sum((a - b) ** 2 for a, b in zip(c0, c1)))


Some tests for the `color_distance()` function:

In [None]:
# Some tests for the color_distance() function
print(color_distance([0,0,0], [255,255,255]), "should be", 255*3**.5)
print(color_distance([0,100,0], [100,100,0]), "should be", 100)
print(color_distance([55,222,120], [91,51,192]), "should be", 189)
print(color_distance([147,207,246], [87,57,50]), "should be", 254)
print(color_distance([12,250,126], [112,10,195]), "should be", 269)
print(color_distance([106,71,61], [105,136,100]), "should be", 75.81)

In [None]:
# TODO: implement function that returns minimum distance between a reference color and colors from a list
def min_color_distance(ref_color, color_list):
  min = color_distance(color_list[0], ref_color)
  for c in color_list:
    dist = color_distance(c, ref_color)
    if dist < min:
      min = dist
  return min

Three tests for the `min_color_distance()` function:

In [None]:
# Some tests for the color_distance() function
print(min_color_distance([0,0,0], [[255,255,255],[0,100,0],[100,100,0],[58,58,58]]), "should be", 100)
print(min_color_distance([0,0,0], [[255,255,255],[0,100,0],[100,100,0],[58,57,58]]), "should be", 99.88)
print(min_color_distance([91,51,192], [[147,207,246],[87,57,50],[12,250,126],[112,10,195]]), "should be", 46.16)

#### Order Images (for real now)

Alright. We have a function that can be used to order our images by their distance to a given color.

Let's order our images by how close they are to the brightest color `(255,255,255)`. We'll define a `key` function that, given an object from our `file_colors` list, returns how close that image is to the color `(255,255,255)`.

In [None]:
# TODO: implement function that returns how close our image is to the color white
def by_bright_dist(A):
  ref_color=[255,255,255]
  min_dist = min_color_distance(ref_color, A["colors"])
  return min_dist

Order the list and write out a `JSON` file with the image order.

In [None]:
file_colors_sorted = sorted(file_colors, key=by_bright_dist)

In [None]:
files_sorted = [A["filename"] for A in file_colors_sorted]

with open("./data/flower_order.json", "w") as ofp:
  json.dump(files_sorted, ofp)

### Viewing Results

We can check the results by running a webserver and looking at a simple web page that orders the images according to the resulting `JSON` file from above.

We'll make use of the [`Live Server`](https://marketplace.visualstudio.com/items?itemName=ritwickdey.LiveServer) VSCode extension.

We can start the server by clicking on the "_Go Live_" button towards the right hand side of the bar at the very bottom of our text editor:

<img src="./imgs/go_live.jpg" width="600px">

Clicking the "_Go Live_" button in Codespace should open up a new tab with a plain html navigation view of our repository. Clicking on the `html/` directory should open up a web page with all of the flower images. If not, you can use your Codespace url to try to find the web server address.

If your Codespace url is something like:<br>`https://mango-special-giggle-v6v7asd322f7p6.github.dev/`

Then, the webserver should be running at:<br>`https://mango-special-giggle-v6v7asd322f7p6-5500.app.github.dev/`

### Review, Contemplate, Experiment

Yes, images with white parts are towards the beginning, but the images towards the end aren't necessarily the ones with dark flowers, but are the ones that have all of their representative colors farthest away from white `(255,255,255)`, which includes very saturated colors/images.

A couple of interesting experiments here could be:
- Decrease the number of clusters or the number of colors kept after clustering.
- Use different colors as the reference for the distance functions. For example, create `by_gold_dist()` or `by_purple_dist()` functions to use as the `key` for sorting.
- Order the list of cluster colors by [hue](https://stackoverflow.com/questions/23090019/fastest-formula-to-get-hue-from-rgb). This can be a bit tricky to get right because some colors, like white, black and gray, don't have a unique value for hue, but depend on other aspects of the color, like saturation and lightness, to be well-defined.

In [None]:
# sort by closest to orange
orange = [255, 92, 0]

def by_orange_dist(A):
  ref_color=orange
  min_dist = min_color_distance(ref_color, A["colors"])
  return min_dist

In [None]:
file_colors_orange = sorted(file_colors, key=by_orange_dist)

files_orange = [A["filename"] for A in file_colors_orange]

with open("./data/flower_order.json", "w") as ofp:
  json.dump(files_orange, ofp)

In [None]:
# sort by closest to pink
pink = [255,141,161]

def by_pink_dist(A):
  ref_color=pink
  min_dist = min_color_distance(ref_color, A["colors"])
  return min_dist

In [None]:
file_colors_pink = sorted(file_colors, key=by_pink_dist)

files_pink = [A["filename"] for A in file_colors_pink]

with open("./data/flower_order.json", "w") as ofp:
  json.dump(files_pink, ofp)

### Interpretation

<span style="color:hotpink;">
What did you try ? What happened ?
</span>

by_orange_dist captured all of the orange flowers pretty well (better than the "brightest" sorting I think), and there were some pretty good groupings of red, yellow, pink and then white that followed

by_pink_dist worked pretty nicely as well, but a bit less well than the orange sorting

### Conclusion

It's challenging to define a set of functions that will perfectly order our flowers by color without first having to define very specific color values for filtering and corner-cases. At a high-level, we can imagine that this is because color is a $3$-dimensional value, and we're using it to organize our images into a single-dimensional order.

The beginning of our ordering is usually pretty good, since there's only one way for a color to be _close_ to our reference color, but the ordering gets less consistent towards the end because there are many different ways for a color to be _far_ from the reference color.

Next week we'll see a very powerful technique that, amongst other things, will help us get around this kind of "_dimensionality mismatch_".