<div class='heading'>
    <div style='float:left;'><h1>CPSC 4300/6300: Applied Data Science</h1></div>
     <img style="float: right; padding-right: 10px" width="100" src="https://raw.githubusercontent.com/bsethwalker/clemson-cs4300/main/images/clemson_paw.png"> </div>
     </div>

**Clemson University**<br>
**Fall 2024**<br>
**Instructor(s):** Aaron Masino <br>

## Homework 4: Clustering
This homework is intended to assess your knowledge of clustering concepts and implementation using Python scipy and scikit-learn. As presented in class, scikit-learn is a Python library that provides many tools for machine learning model development and analysis. In addition to numerous foundational scientific computing methods, scipy also provides hierarchical clustring support. You may refer to the course lectures and labs while completing this assignment. For complete information, you may reference:
-  Python documentation [here](https://www.python.org/)
-  scikit-learn documentation [here](https://scikit-learn.org/stable/index.html)
-  scipy documentation [here](https://scipy.org/)
-  Pandas documentation [here](https://pandas.pydata.org/)
-  matplotlib documentation [here](https://matplotlib.org/)
-  seaborn documentation [here](https://seaborn.pydata.org/).


# Setup Instructions
In the exercises below, you will use data from the following files. Make sure you have copied these to the appropriate location (e.g., _YOUR_COURSE_DIR/data_):
- seeds.csv

### Before beginning the exercises:
Execute the first two code cells to import the required Python packages and load the data

To begin, first import the Python packages that are required for this homework:

In [None]:
import matplotlib.pyplot as plt
import matplotlib.cm as cm
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.metrics import silhouette_score, silhouette_samples
import numpy as np
import scipy.cluster.hierarchy as hac
from scipy.spatial.distance import pdist

SEED = 42

In [None]:
# mount the google drive - this is necessary to access supporting resources
from google.colab import drive
drive.mount("/content/drive")

Execute the code cell below to load data from the seeds_cluster.csv_ file into to Pandas DataFrame, `df_seeds`. The columns include information about geometrical properties of kernels belonging to different varieties of wheat:
- `area` - the surface area, $A$, of the kernel
- `perim` - the perimeter, $P$ of the kernel
- `compact` - the compactness $C=4*\pi*A/P^2$
- `klength` - the length of the kernel
- `kwidth` - the width of the kernel
- `asym` - asymmetry coefficient
- `kglen` - the length of the kernel groove

In [None]:
########### DO NOT MODIFY THIS CODE #############
df_seeds = pd.read_csv('/content/drive/MyDrive/Colab Notebooks/cpsc-4300-6300/data/seeds_cluster.csv')
display(df_seeds.head(3))

# Exercise 1 (1 point)
In the code cell below, use the scikit-learn `StandardScaler` to center and scale the values in the `df_seeds` dataframe. Store the result in a new dataframe called `df_seeds_scaled`. When creating the new dataframe, you should provide input arguments to ensure the columns and index for the new dataframe are the same as those in `df_seeds`.

Hint: See Part 1 of lab 5.

In [None]:
df_seeds_scaled = None

########### START YOUR CODE HERE #############


# Exercise 2 (2 points)
In the code cell below, use the interia (aka elbow) method to select the optimal number of clusters, $K$, when clustring the `df_seeds_scaled` data with the scikit-learn `KMeans` clustering method. Your solution should consider values of $K\in\left[1,10\right]$ (including 1 and 10). Create a plot with the values of $K$ on the x-axis and the interia values on the y-axis.

In [None]:
########### START YOUR CODE HERE #############


# Excercise 3 (1 point)
Based on the plot you created in the previous exercise, what do you think is the optimal number of clusters and why? If you think there is more than one acceptable choice, indicate the possible choices and describe why you think there is not a single definitive choice. Enter your answer in the markdown cell below.

########### YOUR ANSWER HERE #############<br/><br/>


# Exercise 4 (1 point)
In the code cell below:
- set the variable `K` to the optimal number of _KMeans_ clusters you identified in the previous exercise
- create a scikit-learn `KMeans` model named `cluster_model` with `n_clusters` equal to the optimal `K`
- fit the model on the `df_seeds_scaled` data.
- print the shape of the cluster centers

If you identified multiple candidates, select only one. (No points will be deducted from this exercise for an incorrect value of `K`).

In [None]:
K = None
cluster_model = None

########### START YOUR CODE HERE #############


# Excercise 5 (2 points)
Unlike the multishapes example we saw in class, the `seeds` dataset has seven (7) features that were used to perform clustering. Notice that each cluster centroid includes seven (7) values. This is because the centroids are in a seven (7) dimensional space. We cannot visualize the clusters in this space. Instead, we can view the clusters in two-dimensional space by selecting a pair of features. For example, we can create a scatter plot of seed area vs. seed compactness and view the clusters by coloring the datapoints based on their cluster. Counting the pairs where the x-axis and y-axis are the same feature, there are $7^2=49$ possible feautre pairs.

Let's look at one such pair. In the code cell below, complete the `plot_cluster` function. The function takes the following inputs:
- `df` - a Pandas dataframe
- `c1` - the column name of the first feature
- `c2` - the column name of the second feature
- `labels` - the cluster label for each row in `df`

Your completed function should create a scatter plot using the values for the first feature, `df[c1]`, for the x-axis coordinate, and the second feature, `df[c2]`, for the y-axis coordinate. The color of the markers in the scatter plot should indicate the cluster assignment for the datum. The plot x and y labels should be set to the feature names.

Hint: see lab 5 part 2

In [None]:
def plot_cluster(df, c1, c2, labels):
  ########### START YOUR CODE HERE #############

  return None


########### DO NOT MODIFY THIS CODE #############
# use the function to plot the clusters for perim vs compact
colors = cm.rainbow(np.linspace(0, 1, K))
plot_cluster(df_seeds_scaled, 'perim', 'compact',  cluster_model.labels_)

# Exericse 6 (2 points)
In the code cell below, use the `scipy.cluster.hierarchy` (imported above as hac) to perform agglomerative clustering on the `df_seeds_scaled` data and plot the dendrogram. In your solution, use `euclidean` as the spatial distance metric. Use [complete linkage](https://docs.scipy.org/doc/scipy/reference/generated/scipy.cluster.hierarchy.complete.html#scipy.cluster.hierarchy.complete) as the linkage function.

In [None]:
########### START YOUR CODE HERE #############


# Excercise 7 (1 point)
In order to form a desired number of clusters it is necessary to "cut" the dendrogram. At approximately what value would you cut the dendrogram above to form three (3) clusters? Enter your answer in the markdown cell below.  

########### YOUR ANSWER HERE #############<br/><br/>