## K-Means Clustering

This is a classic example of **unsupervised learning.** Take a look at sklean's explaination:

https://scikit-learn.org/1.5/modules/clustering.html#k-means

This is used for taking a dataset and grouping a variable number of similar entities within it.

Central concepts:

**interia**: simplified, this is a measurement of similarity within clusters (distance from the center of data in each group). You want your clusters to be as general (low interia) and limited (small number of clusters) as possible understanding that this is dependent on the context.

**centriods**: This is how k-means operates. By mapping all of the data points in vector space it looks for centers of certain clusters. A centroid would be the center of a specific group.

**elbow graph**: this is a graph that you output to try to assess what the optimal number (*k*) of clusters are. It's not always perfectly dependable, but generally the graph will take a hard turn, when cluster get to be too similar. That said, often we use trial and error to see what clusters are the most meaningful.

**uses**: this is used to cluster groups of text, handwritten digits, and many other data sets that are not pre-relabeled.


### K-means Clustering on Images for Palettes/Dominant Colors

Yes, we are jumping back into images!

This is a very simple way of engaging unsupervised learning, and assessing the process of clustering.

Below, I take us through the process of creating **color clusters** and for your homework, you'll be asked to run through this process, multiple times with multiple images and assess the outcomes.

The only new library that needs to be installed is the **CV2** library (computer vision). It is more powerful than Pillow in this context, and produces matrices of image colors that map well to K-means.

`pip install opencv-python`


**import you librarys**

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import cv2
from sklearn.cluster import KMeans

Load in the image with cv2

In [None]:
my_img = cv2.imread("spirited_away2.jpg")
plt.imshow(my_img)

Convert the colors:

In [None]:
my_img=cv2.cvtColor(my_img,cv2.COLOR_BGR2RGB)
plt.imshow(my_img)

Check the size (shape) of the image

In [None]:
print('Dimensions : ',my_img.shape)


**Downsize (optional)**

If you have a particularly large image, you should use this method to downsize it.
If your image is under a megabyte, it should be fine but if you're noticing that the shape is greater than 2000 x 2000 it will be much more computationally quick to downsize it.

By default, I've set this up to downsize an image by half:
`fx=0.5, fy=0.5`

You can adjust the decimal point based on what you think will reduce the shape properly.

In [None]:


# my_img = cv2.resize(my_img, (0,0), fx=0.5, fy=0.5) 
# print('Dimensions : ',my_img.shape)
# plt.imshow(my_img)

**bringing the RGB values into a data frame**

In [None]:
df = pd.DataFrame(my_img.reshape(-1, 3),
                    columns=['R', 'G', 'B'])
df

#### Creating the elbow plot
Below, we extract the information into proper list format in order to run a test for inertia (separation between groups)

Generally, where the curve makes a hard turn and becomes less distant on the y-axis is where you want to try to cluster. But it's not a perfect measure. It gives you a decent idea.

That is why we are using images because your eyes will be a better judge of what the actual clusters of color might be.

In [None]:
#extracting a list from our data
img_data = df[['R', 'G', 'B']].values

**Running the inertia calculation**

**This will take some time**. It's mapping 20 clusters here so we can visualize where we should set a cluster number.

In [None]:
inerti=[]
for i in range(1,21):
  km=KMeans(n_clusters=i)
  km.fit(img_data)
  o=km.inertia_
  inerti.append(o)
print(inerti)

Now we plot the outcome on an **elbow graph**

In [None]:
plt.plot(list(np.arange(1,21)),inerti)
plt.show()

Depending on the number we judge, and it's really a judgment call here, **set the number of clusters** and **run the algorithm.**

In [None]:
from sklearn.cluster import KMeans
#set the cluster number
cluster_n = 5
# Compute kmeans
X = df.iloc[:,0:3].values
km = KMeans(n_clusters = cluster_n, init = 'k-means++')
y = km.fit_predict(X)

Add the clusters to our dataframe

In [None]:
df['cluster'] = y  

df

Now we want to isolate the **centroids** which are the mathematical centers of each cluster.

We are going to hold it in a couple different formats so that we can graph this in a few different ways.

In [None]:

centroid_array=km.cluster_centers_ #need this for later
centroids = centroid_array.tolist()
centroids

In [None]:
centroid_array

These are groups of RGB values.

Below, we convert them to decimal values (i.e. from 0 255 to 0 to 1) also, so we can graph them.

In [None]:
for centroid in centroids:
    (X, Y, Z) = centroid[0], centroid[1], centroid[2] 
    centroid[0] = X/255
    centroid[1] = Y/255
    centroid[2] = Z/255

centroids

Now we add those values to our data frame!

In [None]:
for i in range(cluster_n):
    df.loc[df['cluster']==i, 'R_centroid'] = centroids[i][0]
    df.loc[df['cluster']==i, 'G_centroid'] = centroids[i][1]
    df.loc[df['cluster']==i, 'B_centroid'] = centroids[i][2]

df

**3-D Plot of the clusters**
Here we are going to plot the different clusters in 3-D space.

This is a decent way of visualizing what's happening when we are clustering.

(This also takes a bit of time!)


In [None]:
import matplotlib.pyplot as plt

kplot = plt.axes(projection='3d')
xline = np.linspace(0, 15, 1000)
yline = np.linspace(0, 15, 1000)
zline = np.linspace(0, 15, 1000)
kplot.plot3D(xline, yline, zline, 'black')

for i in range(5):
    cluster = df[df.cluster==i]
    cluster_color = cluster[["R_centroid", "G_centroid", "B_centroid"]].values.tolist()
    kplot.scatter3D(cluster.R, cluster.G, cluster.B, facecolors=cluster_color)

plt.axis('on')
plt.show()

**visualizing percentiles of each cluster**

Here we get out the different labels from the clustering algorithm, and then we make a pie chart showing the dominance of the different colors groups.

In [None]:
# print(centroid)
labels=km.labels_
print(labels)
labels=list(labels)

In [None]:
percent=[]
for i in range(len(centroid_array)):
  j=labels.count(i)
  j=j/(len(labels))
  percent.append(j)
print(percent)
plt.pie(percent,colors=np.array(centroid_array/255),labels=np.arange(len(centroid_array)))
plt.show()


**getting a simplified data frame** that shows the information for our clusters.

In [None]:
each_cluster = df.groupby('cluster').first()
each_cluster

Generating an output so we can see the colors' HEX values and **see the pallette**

In [None]:
#we the color values
palette_list = each_cluster[["R_centroid", "G_centroid", "B_centroid"]].values.tolist()
palette_list

In [None]:
#display the colorsa and hex values
from matplotlib.colors import to_hex
for color in palette_list:
    # color_array = [x/255 for x in color[0][0]]
    print(to_hex(color))
    plt.figure(figsize=(1, 1))
    plt.axis('off')
    plt.imshow([[color]]);
    plt.show();

In [None]:
#display the image and colors
palette2 = np.array(palette_list)[np.newaxis, :, :]

plt.imshow(my_img);
plt.axis('off')
plt.show()
plt.imshow(palette2)
plt.axis('off')
plt.show()


The end!