<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>



# Week 7| Lab: Clustering

**Clemson University** </br>
**Instructor(s):** Tim Ransom </br>

------------------------------------------------------------------------
## Learning objectives

- List different types of clustering algorithms.
- Apply k-means clustering to a dataset.
- Interpret the results of a k-means clustering analysis.
- Compare and contrast k-means and hierarchical clustering.
- Visualize clusters using scatter plots.

-----------------

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.preprocessing import StandardScaler
from sklearn import preprocessing
from sklearn.cluster import KMeans

from matplotcheck.base import PlotTester
from matplotlib.patches import PathPatch
%matplotlib inline 


<div class="exercise"><b>Question 1:</b> </div>
Why is clustering important in data analysis and machine learning? (Select the most appropriate answer)

- 1. It is a supervised learning technique used for classification tasks.
- 2. It helps in discovering hidden patterns and segmenting data without prior labels.
- 3. Clustering is only useful for preprocessing and does not have real-world applications.
- 4. It is used solely for feature selection in deep learning models.


In [None]:
# your code here
raise NotImplementedError

# Clustering Algorithms

We will now walk through three clustering algorithms, first discussing
them at a high-level, then showing how to implement them with Python
libraries. Let's first load and scale our data, so that particular
dimensions don't naturally dominate in their contributions in the
distant calculations:

In [None]:
# loads and displays our summary statistics of our data
multishapes = pd.read_csv("data/multishapes.csv")
multishapes.head()

In [None]:
ms_df = multishapes[['x','y']]
ms_df.describe()

In [None]:
# scale our data
scaled_df = pd.DataFrame(preprocessing.scale(ms_df), 
                         index=multishapes['shape'], 
                         columns=ms_df.columns)
scaled_df.describe()

In [None]:
# plots our data
msplot = scaled_df.plot.scatter(x='x',y='y',c='Black',title="Multishapes data",figsize=(11,8.5))
msplot.set_xlabel("X")
msplot.set_ylabel("Y")
plt.show()

## 1. k-Means clustering:

### Code (via `sklearn`):

In [None]:
from sklearn.cluster import KMeans

<div class="exercise"><b>Exercise 1</b>:     </div>

-   Create a KMeans object named `ms_kmeans` using 3 clusters and a random<sub>state</sub>
    of 109.
-   Fit the `ms_kmeans` object to the `scaled_df`

In [None]:
"""Write your code for Exercise-1 here:"""

# your code here
raise NotImplementedError

Now that we've run k-Means, we can look at various attributes of our
clusters. Full documenation is
[here](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html).

In [None]:
display(ms_kmeans.cluster_centers_)
display(ms_kmeans.labels_[0:10])

### Plotting

Take note of matplotlib's `c=` argument to color items in the plot,
along with our stacking two different plotting functions in the same
plot.

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))

# Scatter plot of the data points
scatter = ax.scatter(scaled_df['x'], scaled_df['y'], c=ms_kmeans.labels_)

# Highlight cluster centers
ax.scatter(ms_kmeans.cluster_centers_[:, 0],
           ms_kmeans.cluster_centers_[:, 1], 
           c='r', marker='h', s=100)

# add titles and labels
ax.set_title('Clustered Data with KMeans')
ax.set_xlabel('X-axis')
ax.set_ylabel('Y-axis')

plt.show()

<div class="exercise"><b>Question 2:</b></div>

### **Importance of Feature Scaling in Clustering**  

Clustering algorithms like **k-Means** rely on **distance metrics** (e.g., Euclidean distance) to assign points to clusters. If features have **different scales** (e.g., age in years vs. income in dollars), clustering results may be **skewed**.  

**Question:**  
Should we always scale our data before applying clustering algorithms like k-Means?  

**Select the most appropriate answer:**  

- 1. No, clustering algorithms are not affected by feature scaling.  
- 2. Yes, because clustering methods like k-Means rely on distance metrics, and unscaled features with larger ranges can dominate the clustering process.  
- 3. Scaling is only required when dealing with categorical data.  
- 4. Scaling should be avoided as it distorts the natural relationships in the data.  

**Store your answer in an integer variable named `answer` in the code cell below.**


In [None]:
# your code here
raise NotImplementedError

## 1.1 Quality of Clusters: Inertia

Inertia measures the total squared distance from points to their
cluster's centroid. We obviously want this distance to be relatively
small. If we increase the number of clusters, it will naturally make the
average distance smaller. If every point has its own cluster, then our
distance would be 0. That's obviously not an ideal way to cluster. One
way to determine a reasonable number of clusters to simply try many
different clusterings as we vary **k**, and each time, measure the
overall inertia.

In [None]:
wss = []
for i in range(1,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    wss.append(fitx.inertia_)

plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), wss, 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Inertia')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()

Look for the place(s) where distance stops decreasing as much (i.e., the
'elbow' of the curve). It seems that 4 would be a good number of
clusters, as a higher *k* yields diminishing returns.

## 1.2 Quality of Clusters: Silhouette

Let's say we have a data point $i$, and the cluster it belongs to is
referred to as $C(i)$. One way to measure the quality of a cluster
$C(i)$ is to measure how close its data points are to each other
(within-cluster) compared to nearby, other clusters $C(j)$. This is what
`Silhouette Scores` provide for us. The range is \[-1,1\]; 0 indicates a
point on the decision boundary (equal average closeness to points
intra-cluster and out-of-cluster), and negative values mean that datum
might be better in a different cluster.

Specifically, let $a(i)$ denote the average distance data point $i$ is
to the other points in the same cluster:</br>
</br>
<center>
$a(i) = \frac{1}{|C_i| - 1} \sum_{j \in C_i, i \neq j} d(i, j)$
</center>
</br>
</br>
Similarly, we can also compute the average distance that data point $i$
is to all **other** clusters. The cluster that yields the minimum
distance is denoted by $b(i)$:</br>
</br>
<center>
$b(i) = \min_{k \neq i} \frac{1}{|C_k|} \sum_{j \in C_k} d(i, j)$
</center>
</br>
</br>
Hopefully our data point $i$ is much closer, on average, to points
within its own cluster (i.e., $a(i)$ than it is to its closest
neighboring cluster $b(i)$). The silhouette score quantifies this as
$s(i)$:</br>
</br>
<center>
$s(i) = \frac{b(i) - a(i)}{\max \{ a(i), b(i) \}}, \text{ if } |C_i| > 1$
</center>
</br>
**NOTE:** If data point $i$ belongs to its own cluster (no other
points), then the silhouette score is set to 0 (otherwise, $a(i)$ would
be undefined).

The silhouette score plotted below is the **overall average** across all
points in our dataset.

The `silhouette_score()` function is available in `sklearn`. We can
manually loop over values of K (for applying k-Means algorithm), then
plot its silhouette score.

In [None]:
from sklearn.metrics import silhouette_score

scores = [0]
for i in range(2,11):
    fitx = KMeans(n_clusters=i, init='random', n_init=5, random_state=109).fit(scaled_df)
    score = silhouette_score(scaled_df, fitx.labels_)
    scores.append(score)

plt.figure(figsize=(11,8.5))
plt.plot(range(1,11), np.array(scores), 'bx-')
plt.xlabel('Number of clusters $k$')
plt.ylabel('Average Silhouette')
plt.title('The Elbow Method showing the optimal $k$')
plt.show()


### 1.3.1 Visualizing all Silhoutte scores for a particular clustering

Below, we borrow from an `sklearn` example. The second plot may be
overkill.

-   The second plot is just the scaled data.
-   If you only need the raw silhouette scores, use the
    `silhouette_samples()` function

In [None]:
from sklearn.metrics import silhouette_samples, silhouette_score
import matplotlib.cm as cm
#modified code from http://scikit-learn.org/stable/auto_examples/cluster/plot_kmeans_silhouette_analysis.html

def silplot(X, clusterer, pointlabels=None):
    cluster_labels = clusterer.labels_
    n_clusters = clusterer.n_clusters

    # Create a subplot with 1 row and 2 columns
    fig, (ax1, ax2) = plt.subplots(1, 2)
    fig.set_size_inches(11,8.5)

    # The 1st subplot is the silhouette plot
    # The silhouette coefficient can range from -1, 1 but in this example all
    # lie within [-0.1, 1]
    ax1.set_xlim([-0.1, 1])

    # The (n_clusters+1)*10 is for inserting blank space between silhouette
    # plots of individual clusters, to demarcate them clearly.
    ax1.set_ylim([0, len(X) + (n_clusters + 1) * 10])

    # The silhouette_score gives the average value for all the samples.
    # This gives a perspective into the density and separation of the formed
    # clusters
    silhouette_avg = silhouette_score(X, cluster_labels)
    print("For n_clusters = ", n_clusters,
          ", the average silhouette_score is ", silhouette_avg,".",sep="")

    # Compute the silhouette scores for each sample
    sample_silhouette_values = silhouette_samples(X, cluster_labels)

    y_lower = 10
    for i in range(0,n_clusters+1):
        # Aggregate the silhouette scores for samples belonging to
        # cluster i, and sort them
        ith_cluster_silhouette_values = \
            sample_silhouette_values[cluster_labels == i]

        ith_cluster_silhouette_values.sort()

        size_cluster_i = ith_cluster_silhouette_values.shape[0]
        y_upper = y_lower + size_cluster_i

        color = cm.nipy_spectral(float(i) / n_clusters)
        ax1.fill_betweenx(np.arange(y_lower, y_upper),
                          0, ith_cluster_silhouette_values,
                          facecolor=color, edgecolor=color, alpha=0.7)

        # Label the silhouette plots with their cluster numbers at the middle
        ax1.text(-0.05, y_lower + 0.5 * size_cluster_i, str(i))

        # Compute the new y_lower for next plot
        y_lower = y_upper + 10  # 10 for the 0 samples

    ax1.set_title("The silhouette plot for the various clusters.")
    ax1.set_xlabel("The silhouette coefficient values")
    ax1.set_ylabel("Cluster label")

    # The vertical line for average silhouette score of all the values
    ax1.axvline(x=silhouette_avg, color="red", linestyle="--")

    ax1.set_yticks([])  # Clear the yaxis labels / ticks
    ax1.set_xticks([-0.1, 0, 0.2, 0.4, 0.6, 0.8, 1])

    # 2nd Plot showing the actual clusters formed
    colors = cm.nipy_spectral(cluster_labels.astype(float) / n_clusters)
    ax2.scatter(X[:, 0], X[:, 1], marker='.', s=200, lw=0, alpha=0.7,
                c=colors, edgecolor='k')
    xs = X[:, 0]
    ys = X[:, 1]

    if pointlabels is not None:
        for i in range(len(xs)):
            plt.text(xs[i],ys[i],pointlabels[i])

    # Labeling the clusters
    centers = clusterer.cluster_centers_
    # Draw white circles at cluster centers
    ax2.scatter(centers[:, 0], centers[:, 1], marker='o',
                c="white", alpha=1, s=200, edgecolor='k')

    for i, c in enumerate(centers):
        ax2.scatter(c[0], c[1], marker='$%d$' % int(i), alpha=1,
                    s=50, edgecolor='k')

    ax2.set_title("The visualization of the clustered data.")
    ax2.set_xlabel("Feature space for the 1st feature")
    ax2.set_ylabel("Feature space for the 2nd feature")

    plt.suptitle(("Silhouette analysis for KMeans clustering on sample data "
                  "with n_clusters = %d" % n_clusters),
                 fontsize=14, fontweight='bold')


In [None]:
# run k-means with 3 clusters
ms_kmeans = KMeans(n_clusters=3, init='random', n_init=3, random_state=109).fit(scaled_df)

# plot a fancy silhouette plot
silplot(scaled_df.values, ms_kmeans)

<div class="exercise"><b>Exercise 2:</b></div>

### **Optimizing k-Means Clustering using Silhouette Scores**

Using the **optimal number of clusters** from the silhouette scores (as determined by the elbow plot above):

1. **Fit a new k-Means model** named `ms_kmeans_optimal` using the **optimal number of clusters**.
2. **Plot the clusters** as we originally did with k-Means.
3. **Plot the silhouette scores** just like in the previous cells.
4. Compare the clustering results for **3 clusters** versus the **optimal number of clusters** found using silhouette scores.

**Instructions:**
- Create a new `KMeans` object named `ms_kmeans_optimal` using the **optimal number of clusters**.
- Fit `ms_kmeans_optimal` to `scaled_df`.
- Generate a **scatter plot** of the clusters.
- Generate a **silhouette plot** to visualize the quality of the clustering.

**Write your code in the cell below.**


In [None]:
"""Write your code for Exercise-2 here:"""

# your code here
raise NotImplementedError

## 1.3 Quality of Clusters: Gap Statistic

The gap statistic compares within-cluster distances (like in
silhouette), but instead of comparing against the second-best existing
cluster for that point, it compares our clustering's overall average to
the average we'd see if the data were generated at random (we'd expect
randomly generated data to not necessarily have any inherit patterns
that can be easily clustered).

In essence, the within-cluster distances (in the elbow plot) will go
down just becuse we have more clusters. We additionally calculate how
much they'd go down on non-clustered data with the same spread as our
data and subtract that trend out to produce the plot below.

In [None]:
# If you need to install the gap_statistic package, run following code on you local machine
!pip install --upgrade pip
!pip install gap-stat --only-binary :all:
!pip install git+git://github.com/milesgranger/gap_statistic.git

In [None]:
from gap_statistic import OptimalK

gs_obj = OptimalK()

n_clusters = gs_obj(scaled_df.values, n_refs=50, cluster_array=np.arange(1, 15))
print('Optimal clusters: ', n_clusters)

In [None]:
gs_obj.gap_df

In [None]:
gs_obj.plot_results() # makes nice plots

If we wish to add error bars to help us decide how many clusters to use,
the following code displays such:

In [None]:
def display_gapstat_with_errbars(gap_df):
    gaps = gap_df["gap_value"].values
    diffs = gap_df["diff"]

    err_bars = np.zeros(len(gap_df))
    err_bars[1:] = diffs[:-1] - gaps[:-1] + gaps[1:]

    plt.scatter(gap_df["n_clusters"], gap_df["gap_value"])
    plt.errorbar(gap_df["n_clusters"], gap_df["gap_value"], yerr=err_bars, capsize=6)
    plt.xlabel("Number of Clusters")
    plt.ylabel("Gap Statistic")
    plt.show()

display_gapstat_with_errbars(gs_obj.gap_df)


For more information about the `gap_stat` package, please see [the full
documentation here](https://github.com/milesgranger/gap_statistic).

## 2. Agglomerative Clustering
### Code (via `scipy`):

There are many different cluster-merging criteria, one of which is
Ward's criteria. Ward's optimizes having the lowest total within-cluster
distances, so it merges the two clusters that will harm this objective
least. `scipy`'s agglomerative clustering function implements Ward's
method.

In [None]:
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

plt.figure(figsize=(11,8.5))
dist_mat = pdist(scaled_df, metric="euclidean")
ward_data = hac.ward(dist_mat)

hac.dendrogram(ward_data);

How do you read a plot like the above? What are valid options for number
of clusters, and how can you tell? Are some more valid than others? Does
it make sense to compute silhouette scores for an agglomerative
clustering? If we wanted to compute silhouette scores, what would we
need for this to be possible?

## Lessons:

-   It's expensive: O(n<sup>3</sup>) time complexity and
    O(n<sup>2</sup>) space complexity.
-   Many choices for linkage criteria
-   Every node gets clustered (no child left behind)


## 3. DBscan Clustering

### Code (via `sklearn`):

`DBscan` uses an intuitive notion of denseness to define clusters, rather than defining clusters by a central point as in k-means.

DBscan is implemented in good 'ol sklearn, but there aren't great
automated tools for searching for the optimal `epsilon` parameter. For
full documentation, please [visit this
page](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

In [None]:
from sklearn.cluster import DBSCAN

plt.figure(figsize=(11,8.5))

fitted_dbscan = DBSCAN(eps=0.2).fit(scaled_df)

plt.scatter(scaled_df['x'], scaled_df['y'], c=fitted_dbscan.labels_);

**Note:** the dark purple dots are not clustered with anything else.
They are lone singletons. You can validate such by setting epsilon to a
very small value, and increase the min<sub>samples</sub> to a high
value. Under these conditions, nothing would cluster, and yet all dots
become dark purple.

In [None]:
from sklearn.neighbors import NearestNeighbors

# x-axis is each individual data point, numbered by an artificial index
# y-axis is the distance to its 2nd closest neighbor
def plot_epsilon(df, min_samples):
    fitted_neigbors = NearestNeighbors(n_neighbors=min_samples).fit(df)
    distances, indices = fitted_neigbors.kneighbors(df)
    dist_to_nth_nearest_neighbor = distances[:,-1]
    plt.plot(np.sort(dist_to_nth_nearest_neighbor))
    plt.xlabel("Index\n(sorted by increasing distances)")
    plt.ylabel("{}-NN Distance (epsilon)".format(min_samples-1))
    plt.tick_params(right=True, labelright=True)


<div class="exercise"><b>Exercise 3:</b></div>

### **Experimenting with DBSCAN Parameters**
DBSCAN clustering uses the **epsilon (eps)** parameter to define the neighborhood size and **min_samples** to determine the minimum number of points required to form a dense cluster.  

#### **Tasks:**
1. Experiment with the **DBSCAN** clustering algorithm by **changing the epsilon (`eps`)** value and the **min_samples** parameter.
2. Identify the **default value** for `min_samples` (since the code above does not explicitly set it).
3. Use the `plot_epsilon()` function to **inspect how far each data point is from its nearest neighbor**.
4. Compute the **Nth nearest neighbor distance** for `min_samples` values **ranging from 1 to 10** and store the results in variables:
   - `value_a = f(1)`
   - `value_b = f(2)`

**This exercise is for experimental practice and will not be graded.**


In [None]:
plot_epsilon(scaled_df, 3)

## Lessons:

-   Can cluster non-linear relationships very well; potential for more
    natural, arbritrarily shaped groupings
-   Does not require specifying the \# of clusters (i.e., **k**); the
    algorithm determines such
-   Robust to outliers
-   Very sensitive to the parameters (requires strong knowledge of the
    data)
-   Doesn't guarantee that every (or ANY) item will be clustered

# END