<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>

# Week 7| Homework: Clustering

**Clemson University** </br>
**Instructor(s):** Tim Ransom </br>

------------------------------------------------------------------------
## Learning objectives

- List different types of clustering algorithms.
- Apply k-means clustering to a dataset.
- Interpret the results of a k-means clustering analysis.
- Compare and contrast k-means and hierarchical clustering.
- Visualize clusters using scatter plots.

-----------------

# About

This homework is intended to assess your knowledge of clustering concepts and implementation using Python scipy and scikit-learn. 
As presented in class, scikit-learn is a Python library that provides many tools for machine learning model development and analysis. 
In addition to numerous foundational scientific computing methods, scipy also provides hierarchical clustring support. You may refer to the course lectures and labs while completing this assignment. For complete
information, you may reference:

-   Python documentation [here](https://www.python.org/)
-   scikit-learn documentation
    [here](https://scikit-learn.org/stable/index.html)
-   scipy documentation [here](https://scipy.org/)
-   Pandas documentation [here](https://pandas.pydata.org/)
-   matplotlib documentation [here](https://matplotlib.org/)
-   seaborn documentation [here](https://seaborn.pydata.org/).

## Setup Instructions

In the exercises below, you will use data from the following files. Make
sure you have copied these to the appropriate location (e.g.,
*YOUR<sub>COURSEDIR</sub>/data*):

-   seeds.csv

Before beginning the exercises:

Execute the first two code cells to import the required Python packages
and load the data

To begin, first import the Python packages that are required for this
homework:

In [None]:
""" RUN THIS CELL TO GET THE RIGHT FORMATTING """
import requests
from IPython.core.display import HTML
css_file = 'https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/css/cpsc6300.css'
styles = requests.get(css_file).text
HTML(styles)

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples
import numpy as np
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

Execute the code cell below to load data from the
seeds<sub>cluster</sub>.csv\_ file into to Pandas DataFrame, `df_seeds`.
The columns include information about geometrical properties of kernels
belonging to different varieties of wheat:

-   `area` - the surface area, $A$, of the kernel
-   `perim` - the perimeter, $P$ of the kernel
-   `compact` - the compactness $C=4*\pi*A/P^2$
-   `klength` - the length of the kernel
-   `kwidth` - the width of the kernel
-   `asym` - asymmetry coefficient
-   `kglen` - the length of the kernel groove

In [None]:
df_seeds = pd.read_csv('data/seeds_cluster.csv')
display(df_seeds.head(3))

<div class="exercise"><b>Exercise 1</b>: </div>

- In the code cell below, use the scikit-learn `StandardScaler` to center
and scale the values in the `df_seeds` dataframe. 
- Store the result in a new dataframe called `df_seeds_scaled`. 
- When creating the new dataframe, you should provide input arguments to ensure the columns and index for the new dataframe are the same as those in `df_seeds`.

**Hint:** See Part 1 of lab 7.

In [None]:
"""Your code for Exercise 1 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 2</b>: </div>

- In the code cell below, use the interia (aka elbow) method to select the optimal number of clusters, $K$, when clustring the `df_seeds_scaled` data with the scikit-learn `KMeans` clustering method. 
- Your solution should consider values of $K\in\left[1,10\right]$ (including 1 and 10).
- Store inertia values for each k in the list called `inertia_values`
- Create a plot with the values of $K$ on the x-axis and the interia values on the y-axis.

In [None]:
k_values = range(1, 11)

In [None]:
"""Your code for Exercise 2 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 3:</b></div>  
Based on the elbow method plot from the previous exercise, what is the optimal number of clusters, and why? Select the most appropriate answer.

- **1.** The optimal number of clusters is **2** because the inertia drops significantly at \( K = 2 \), and further increases in \( K \) do not improve clustering quality.
- **2.** The optimal number of clusters is **3** because the elbow point occurs at \( K = 3 \), where inertia begins to decrease at a slower rate, indicating diminishing returns in clustering quality.
- **3.** The optimal number of clusters is **4** because the inertia continues to drop noticeably until \( K = 4 \), suggesting additional structure in the data.
- **4.** There is no single optimal number of clusters; both **3 and 4** could be reasonable choices based on the trade-off between simplicity and capturing additional structure.

**Store your answer in an integer variable named `answer` in the code cell below.**


In [None]:
# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 4</b>: </div>

In the code cell below:

-   set the variable `K` to the optimal number of *KMeans* clusters you
    identified in the previous exercise
-   create a scikit-learn `KMeans` model named `cluster_model` with
    `n_clusters` equal to the optimal `K` and `random_state` = 42
-   fit the model on the `df_seeds_scaled` data.
-   print the shape of the cluster centers

If you identified multiple candidates, select only one. (No points will
be deducted from this exercise for an incorrect value of `K`).

In [None]:
"""Write your code for Exercise 4 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 5:</b></div> 

Unlike the multishapes example we saw in lab exercise, the seeds dataset has seven (7) features that were used for clustering. This means each cluster centroid consists of seven values, placing them in a seven-dimensional space. Since we cannot visualize clusters in this space, we can instead examine them in two-dimensional space by selecting pairs of features.

For example, we can create a scatter plot of **seed perimeter vs. seed compactness**, where each data point is colored according to its assigned cluster.

In the code cell below, complete the function **`plot_cluster`** to visualize clusters based on two selected features.

The function takes the following inputs:

- `df` *(pd.DataFrame)* - A Pandas dataframe containing the dataset.
- `c1` *(str)* - The column name of the first feature (x-axis).
- `c2` *(str)* - The column name of the second feature (y-axis).
- `labels` *(array-like)* - The cluster label for each row in `df`.

Your completed function should:
- Create a scatter plot with **`df[c1]`** as the x-axis and **`df[c2]`** as the y-axis.
- Color the data points according to their cluster assignments.
- Label the x-axis and y-axis with the feature names.

Use the function to plot the clusters for **perimeter** vs. **compactness**.


In [None]:
def plot_cluster(df, c1, c2, labels):
    """Your code for Exercise 5 here:"""  
    pass

# use the function to plot the clusters for perim vs compact 
colors = cm.rainbow(np.linspace(0, 1, K))
plot_cluster(df_seeds_scaled, 'perim', 'compact',  cluster_model.labels_)

# your code here
raise NotImplementedError

# use the function to plot the clusters for perim vs compact 
colors = cm.rainbow(np.linspace(0, 1, K))
plot_cluster(df_seeds_scaled, 'perim', 'compact',  cluster_model.labels_)

<div class="exercise"><b>Exercise 6:</b></div> 

In this exercise, you will perform **agglomerative hierarchical clustering** on the `df_seeds_scaled` dataset and visualize the results using a **dendrogram**.

Follow these steps:

- Use the **`scipy.cluster.hierarchy`** module (imported as `hac`) to perform the clustering.
- Set **Euclidean** as the spatial distance metric.
- Use [complete linkage](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.complete.html#scipy.cluster.hierarchy.complete) as the linkage function.
- Store the linkage matrix in a variable called **`linkage_matrix`**.
- Use **`hac.dendrogram()`** to plot the dendrogram and visualize the hierarchical clustering structure.

### **Validation Criteria:**
Your implementation will be tested on the following criteria:
1. **Correct Data Type**: The `linkage_matrix` should be a NumPy array.
2. **Correct Shape**: It should have a shape of `(n-1, 4)`, where `n` is the number of samples.
3. **Correct Clustering Parameters**:
   - The **linkage method** must be `"complete"`.
   - The **distance metric** must be `"euclidean"`.
4. **Cluster Distance Progression**: The distances in the last column of the `linkage_matrix` should be **non-decreasing**.
5. **Correct Number of Merges**: The number of rows in `linkage_matrix` should be **one less** than the number of samples.
6. **Dendrogram Execution**: The dendrogram plot should execute without errors.

Write your implementation in the code cell below.


In [None]:
"""Write your code for Exercise 6 here:"""

# your code here
raise NotImplementedError

<div class="exercise"><b>Exercise 7:</b></div>  
In hierarchical clustering, forming a specific number of clusters requires "cutting" the dendrogram at an appropriate distance threshold. Based on the dendrogram created in the previous exercise, at approximately what distance should you cut the dendrogram to form **three (3) clusters**?

Select the most appropriate answer:

- **1.** Around **10**, as three clusters are clearly visible at this level.
- **2.** Around **15**, since it maintains balanced cluster sizes while still forming three groups.
- **3.** Around **20**, as this level creates three well-separated clusters.
- **4.** Around **25**, since cutting at a higher distance ensures three distinct groups.

**Store your answer in an integer variable named `answer` in the code cell below.**


In [None]:
# your code here
raise NotImplementedError

# END