# **Setup**
 
Reset the Python environment to clear it of any previously loaded variables, functions, or libraries. Then, import the libraries needed to complete the code Professor Melnikov presented in the video.

In this Jupyter notebook (JN) you will investigate **centroids** and **medoids**, which are representative points for the groups of observations (i.e., points or vectors). The code below defines these two terms and then computes them for each simulated cluster and for all clusters at once.

Centroids and medoids will later be used to define [KMeans](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html) and [KMedoids](https://scikit-learn-extra.readthedocs.io/en/stable/generated/sklearn_extra.cluster.KMedoids.html) algorithms, which automatically identify clusters of similar observations.

In [None]:
%reset -f
from IPython.core.interactiveshell import InteractiveShell as IS
IS.ast_node_interactivity = "all"    # allows multiple outputs from a cell
import numpy as np, pandas as pd, matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Review**

Review the code Professor Melnikov used to define centroid and medoid.

## **Centroid and Medoid**

Consider a set of observations $\mathcal X:=\{x_1,...,x_n\}\in\mathbb R^d,d\in\mathbb N$

**Centroid** of a $\mathcal X$ is the arithmetic mean of points in $\mathcal X$:

$$x_{\text{centroid}}:=\frac{1}{n}\sum_{i=1}^n x_i$$

<span style="color:black">**Medoid** of a set is a representative point in $\mathcal X$. Often, a medoid is defined with the smallest sum of Euclidean distances to all other points in $\mathcal X$, but here we define it as the closest point to the centroid $\bar x$. In general, a medoid can be described as:

$$x_{\text{medoid}}:=\underset{x\in \mathcal X}{\text{argmin}}\sum_{i=1}^n D(x,x_i)$$
    
<span style="color:black">Where $D$ is the desired distance function (for example, a typical Euclidean distance). In this formula you are looking for any element $x$ in the original set of points, $\mathcal X$, which minimizes the sum of all distances between $x$ and each other element $x_i$. Note that unlike $\text{min}$, $\text{argmin}$ returns the element that minimizes its argument, not the minimum value itself. Thus, $x_{\text{medoid}}$ is one of the elements in $\mathcal X$.

<span style="color:black">By definition, a medoid is guaranteed to be one of $x_1,...,x_n$, but a centroid typically is not. While a medoid is costlier to compute, it is often preferable when you want a representative point to be a point in the set. For example, you would compute a medoid if you would like to find the most representative book, phrase, film, person, etc. in a set. However, both the centroid and medoid are contained "inside" the sets they represent.
    
<span style="color:black">Note that there can be multiple centroids and multiple medoids. For example, in the set $\{1,1,1\}$, each value is a medoid and a centroid. Other examples are 
    
1. <span style="color:black">$\{-1,0,0,1\}$ contains two centroids: $0,0$
1. <span style="color:black">$\{-2,-1,1,2\}$ contains two medoids: $-1,1$, which is closest to the centroid, $0$, which is not in the set

These examples can be trivially extended to 2D (or $\mathbb R^d$) space by adding a zero (or any number $a$) as a second coordinate of each value, i.e. $\{[-1,0],[0,0],[0,0],[1,0]\}$, as you can (should) verify.
    
<span style="color:black">The next cell defines four points/vectors in 2D, `a`, `b`, `c`, and `d`. A centroid `vCentroid` is computed using the formula above. The function `GetMedoid()` takes a list of points, `vX`, and computes a centroid, `vMean`, which is then used to compute a medoid as the point in `vX` closest to `vMean`.

In [None]:
ar = np.array # a shortcut 
a, b, c, d = ar([0,1]), ar([1, 3]), ar([4,2]), ar([3, 1.5])
vCentroid = np.mean([a, b, c, d], axis=0)    # Equivalently: (a + b + c + d) / 4}

def GetMedoid(vX:'list of vectors'):
    vMean = np.mean(vX, axis=0)                               # compute centroid
    return vX[np.argmin([sum((x - vMean)**2) for x in vX])]   # pick a point closest to centroid

vMedoid = GetMedoid([a, b, c, d])

print(f'centroid = {vCentroid}')   # not among the points a,b,c,d
print(f'medoid   = {vMedoid}')     # precisely the point d

## **Cluster Center in 2D**

<span style="color:black">The next cell plots all four points in blue. The centroid is marked as a red circle (and is not any of the original points), while the medoid, i.e., "innermost" blue point, is marked with a red X.

In [None]:
df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D plane', s=100);
fmt = dict(markerfacecolor='none', ms=10) # other plot formatting parameters
plt.plot(vCentroid[0], vCentroid[1], 'ro', **fmt); # plot centroid as red circle
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20);    # plot medoid as red star

## **Cluster Center in 3D**

<span style="color:black">A 3D visualization is also possible with [`scatter3D`](https://matplotlib.org/stable/api/_as_gen/mpl_toolkits.mplot3d.axes3d.Axes3D.html?highlight=scatter3d#mpl_toolkits.mplot3d.axes3d.Axes3D.scatter3D) method of matplotlib's `axes3d` object (which are beyond the scope of this course). In general, 3D plots are harder to visualize due to lost depth in 2D images, although they might look fancy.
    
<span style="color:black">The next cell demonstrates a more complex example in 3D plot. Here each point has three coordinates (x,y,z), and the depth is a bit difficult to judge (unless color shades are used). Note that 3D plots such as this are notoriously harder to evaluate.
    
<span style="color:black">All points are drawn from a [Gaussian](https://www.britannica.com/topic/normal-distribution) or normal distribution. The red point is the centroid and appears to be "inside" and "most" representative of the points.

In [None]:
np.random.seed(0)
n = 100
vXYZ = np.random.normal(size=n*3).reshape((n, 3))  # sample values from univariate Gaussian(0,1) distribution
mu = np.mean(vXYZ, axis=0)

ax = plt.axes(projection='3d')
ax.scatter3D(vXYZ[:,0], vXYZ[:,1], vXYZ[:,2]);
ax.scatter3D(mu[0], mu[1], mu[2], cmap='red');
plt.title('Centroid in 3D');

## **Centers of Multiple Clusters**

The next few cells will simulate multiple clusters, labeled as 0, 1, 2, and compute a centroid and medoid for each cluster, including for the overall set of points.

The SKL function [`make_blobs()`](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.make_blobs.html) takes the desired number of points, number of centers, number of features (i.e., point dimensions), random number generator seed, and dispersion (or standard deviation for each cluster).

<span style="color:black">The resulting set of 150 points have two coordinates, $x_1, x_2$, and a label $y$ identifying the point's cluster.

In [None]:
from sklearn.datasets import make_blobs   # simulates points from multivariate Gaussian distributions
# ?make_blobs
X, y = make_blobs(n_samples=150, centers=3, n_features=2, random_state=0, cluster_std=0.6)
df = pd.DataFrame(X, columns=['x1', 'x2'])
df['y'] = y
df

To compute cluster centroids, simply group by the (known) cluster label and compute the arithmetic mean of each group.

In [None]:
dfMeans = df.groupby('y').mean()  # centroids for each cluster
dfMeans

<span style="color:black">To compute cluster medoids we again group by cluster label and apply the `GetMedoid()` UDF to each group of points, where `dfX.values` is a matrix with points placed as rows. Note that each of the medoids is one of the original points.

In [None]:
dfMedoids = df.groupby('y')[['x1','x2']].apply(lambda dfX: pd.Series(GetMedoid(dfX.values)))   # medoids for each cluster
dfMedoids.columns = ['x1', 'x2']
dfMedoids

<span style="color:black">The global centroid is computed similarly, but without using the label $y$ for the grouping. Here, all available points contribute to the calculation of the most representative point.

In [None]:
dfGlobalMean = df[['x1','x2']].mean().to_frame().T  # global centroid
dfGlobalMean

<span style="color:black"> Similarly, the global medoid is also computed without aggregation over the labels, and again, the resulting point is one of the original points.

In [None]:
dfGlobalMedoid = pd.DataFrame(GetMedoid(df[['x1','x2']].values)).T
dfGlobalMedoid.columns = ['x1', 'x2']
dfGlobalMedoid

<span style="color:black"> The next cell defines an aesthetical UDF used to alter the color of all of the points in a cluster by some amount.

In [None]:
def Adjust_Lightness(color:'color name'='tan', amount=0.5) -> (float, float,float):
    '''Adjusts the color's brightness and returns a new color in RGB format
    color can be in a matplotlib-rcognizable format (string color name, RGB, HLS, ...) '''
    import matplotlib.colors as mc, colorsys as cs
    try: col = mc.cnames[color]
    except: col = color
    col = cs.rgb_to_hls(*mc.to_rgb(col))
    return cs.hls_to_rgb(col[0], max(0, min(1, amount * col[1])), col[2])

Adjust_Lightness('tan')

<span style="color:black"> Finally, all three clusters are plotted in distinguishing colors with the centroids and medoids plotted as well. Note that medoids are those that coincide with the original points, while centroids are those that do not. The cluster representations are "inside" their corresponding groups, while the global representations are inside the global set, although outside the individual clusters.

In [None]:
vPalette = np.array(['gray', 'plum', 'tan'])
vColors = vPalette[y]
vMedoidColors = [Adjust_Lightness(c) for c in vPalette]
vCentroidColors = [Adjust_Lightness(c) for c in vPalette]

ax = dfMedoids.plot.scatter('x1', 'x2', color=vMedoidColors, grid=True, s=100);
dfMeans.plot.scatter('x1', 'x2', color=vCentroidColors, grid=True, ax=ax, style='o', s=100);
dfGlobalMean.plot.scatter('x1', 'x2', color='red', grid=True, ax=ax, style='x', s=100);
dfGlobalMedoid.plot.scatter('x1', 'x2', color='red', grid=True, ax=ax, style='x', s=100);
df.plot.scatter('x1', 'x2', color=vColors, grid=True, figsize=[8, 8], title='Simulated cluster blobs', ax=ax,);

<hr style="border-top: 2px solid #606366; background: transparent;">

# **Optional Practice**

Now you will practice building centroids and medoids for different groups of points.

As you work through these tasks, check your answers by running your code in the *#check solution here* cell, to see if you’ve gotten the correct result. If you get stuck on a task, click the **See solution** drop-down to view the answer.

## **Task 1**

In addition to points $a,b,c,d$ add a point $e$, which is the **centroid** of $a,b,c,d$ computed above. Then recompute the new centroid and new medoid. Do centroid and medoid differ from those above? Why or why not? (Plotting all points is not necessary, but may be helpful).

<b>Hint:</b> Use <code>np.mean()</code> and <code>GetMedoid()</code> to compute a centroid and medoid, respectively.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
a, b, c, d, e = ar([0,1]), ar([1, 3]), ar([4,2]), ar([3, 1.5]), ar([2,1.875])
vCentroid, vMedoid = np.mean([a, b, c, d], axis=0), GetMedoid([a, b, c, d])
vCentroid1, vMedoid1 = np.mean([a, b, c, d, e], axis=0), GetMedoid([a, b, c, d, e])

print(f'centroid  = {vCentroid},  medoid   = {vMedoid}')
print(f'centroid1 = {vCentroid1}, medoid1   = {vMedoid1}')

df = pd.DataFrame([a, b, c, d, e], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D', s=100);
fmt = dict(markerfacecolor='none', ms=10) # other plot formatting parameters
plt.plot(vCentroid1[0], vCentroid1[1], 'ro', **fmt); # plot red centroid
plt.plot(vMedoid1[0], vMedoid1[1], 'rx', ms=20); # plot medoid (red star)
            </pre>When the average point is added to the list of points, the computed centroid doesn't change, but it now becomes one of the points.  Since this average point is the closest to the centroid (itself), it is now by definition the new medoid. You can check the first claim mathematically as the following. Say you have points $a,b$ with an average $(a+b)/2$, and now add this average as an additional point, i.e. $a,b,(a+b)/2$. Then the new average is unchanged, i.e. $\frac{1}{3}(a+b+(a+b)/2)=(3a+3b)/6=(a+b)/2$. Now you can extend this to the points $x_1,x_2,...,x_n$
</details> 
</font>

<hr>

## **Task 2**

In addition to points $a,b,c,d$ add a point $e$, which is the **medoid** of $a,b,c,d$ computed above. Then recompute the new centroid and new medoid. Do centroid and medoid differ from those above? Why or why not? (Plotting all points is not necessary, but may be helpful).

<b>Hint:</b> Use <code>np.mean()</code> and <code>GetMedoid()</code> to compute a centroid and medoid, respectively.

In [None]:
# check solution here

<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
a, b, c, d, e = ar([0,1]), ar([1, 3]), ar([4,2]), ar([3, 1.5]), ar([3,1.5])
vCentroid, vMedoid = np.mean([a, b, c, d], axis=0), GetMedoid([a, b, c, d])
vCentroid1, vMedoid1 = np.mean([a, b, c, d, e], axis=0), GetMedoid([a, b, c, d, e])

print(f'centroid  = {vCentroid},  medoid   = {vMedoid}')
print(f'centroid1 = {vCentroid1}, medoid1   = {vMedoid1}')

df = pd.DataFrame([a, b, c, d, e], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D', s=100);
plt.plot(vCentroid[0], vCentroid[1], 'go', **fmt); # plot green centroid
plt.plot(vMedoid[0], vMedoid[1], 'gx', ms=20); # plot medoid (green star)
plt.plot(vCentroid1[0], vCentroid1[1], 'ro', **fmt); # plot red centroid
plt.plot(vMedoid1[0], vMedoid1[1], 'rx', ms=20); # plot medoid (red star)
            </pre>Since we added a medoid, we essentially duplicated the point $d$. This pulls the old centroid a bit closer to the $d$, which still doesn't guarantee the alignment between centroids and original points. The point $d$ is still the closest point to the new (and old) centroid. So, the medoid stays at the location of point $d$.
</details> 
</font>

<hr>

## **Task 3**

Use `a`, `b`, `c`, and `d` to define corners of a unit square, a square with unit length sides and left-lower corner at the origin. Then compute its centroid and medoid. Plotting is not necessary, but helps to visualize.

<b>Hint:</b> Use <code>np.mean()</code> and <code>GetMedoid()</code> to compute a centroid and medoid, respectively.

In [None]:
# check solution here



<font color=#606366>
    <details><summary><font color=#B31B1B>▶ </font>See <b>solution</b>.</summary>
<pre class="ec">
a, b, c, d = ar([0,0]),ar([1,0]),ar([0,1]),ar([1,1])
vCentroid = np.mean([a, b, c, d], axis=0)
vMedoid = GetMedoid([a, b, c, d])

print(f'centroid = {vCentroid}, medoid   = {vMedoid}')

df = pd.DataFrame([a, b, c, d], columns=['x', 'y'])
ax = df.plot.scatter('x', 'y', grid=True, title='Centroid in 2D', s=100);
plt.plot(vCentroid[0], vCentroid[1], 'ro', markerfacecolor='none', ms=10); # plot red centroid
plt.plot(vMedoid[0], vMedoid[1], 'rx', ms=20); # plot medoid (red star)
</pre>
</details> 
</font>

<hr>