[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/jkitchin/s26-06642/blob/main/dsmles/assignments/hw11-clustering.ipynb)

# Homework 11: Clustering

**Due:** One week after Lecture 11

**Points:** 10

Apply clustering methods to discover patterns in chemical data.

In [None]:
! pip install -q pycse
from pycse.colab import pdf

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans, DBSCAN, AgglomerativeClustering
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from scipy.cluster.hierarchy import dendrogram, linkage

## Problem 1: K-Means Clustering (4 points)

Cluster process operating conditions to identify operating regimes.

In [None]:
np.random.seed(42)

# Three operating regimes
regime1 = np.random.multivariate_normal([350, 5], [[100, 0], [0, 0.5]], 50)
regime2 = np.random.multivariate_normal([450, 8], [[80, 0], [0, 0.8]], 40)
regime3 = np.random.multivariate_normal([400, 3], [[60, 0], [0, 0.3]], 35)

operations = np.vstack([regime1, regime2, regime3])
op_data = pd.DataFrame(operations, columns=['temperature', 'pressure'])

plt.scatter(op_data['temperature'], op_data['pressure'], alpha=0.6)
plt.xlabel('Temperature (K)')
plt.ylabel('Pressure (atm)')
plt.title('Process Operating Points')
plt.show()

**1a.** Scale the data and apply K-Means with k=3. Plot the results colored by cluster.

In [None]:
# Your code here


**1b.** Use the elbow method to determine the optimal number of clusters. Plot inertia vs k for k=1 to 8.

In [None]:
# Your code here


**1c.** Calculate the silhouette score for k=2, 3, 4, 5. Which k gives the best score?

In [None]:
# Your code here


**1d.** Report the cluster centers (in original units). What do they represent physically?

In [None]:
# Your code here


## Problem 2: Hierarchical Clustering (3 points)

Analyze catalyst similarity.

In [None]:
np.random.seed(42)

catalyst_props = pd.DataFrame({
    'sample': [f'Cat_{i}' for i in range(20)],
    'surface_area': [120, 125, 180, 175, 190, 150, 155, 145, 200, 210,
                    115, 130, 185, 170, 195, 160, 140, 148, 205, 195],
    'pore_volume': [0.3, 0.32, 0.5, 0.48, 0.52, 0.4, 0.38, 0.42, 0.55, 0.58,
                   0.28, 0.35, 0.51, 0.47, 0.53, 0.41, 0.39, 0.43, 0.56, 0.54],
    'acidity': [1.2, 1.1, 2.1, 2.0, 2.2, 1.5, 1.6, 1.4, 2.4, 2.5,
               1.0, 1.3, 2.15, 1.95, 2.25, 1.55, 1.45, 1.5, 2.45, 2.35]
})

catalyst_props.head()

**2a.** Create a dendrogram using Ward linkage. How many natural clusters do you see?

In [None]:
# Your code here


**2b.** Apply Agglomerative Clustering to get 3 clusters. Which catalysts are in each cluster?

In [None]:
# Your code here


**2c.** Describe the characteristics of each cluster (e.g., "high surface area, high pore volume").

In [None]:
# Your code here


## Problem 3: DBSCAN and Outliers (3 points)

Identify anomalous batches.

In [None]:
np.random.seed(42)

# Normal batches in two operating modes
mode1 = np.random.multivariate_normal([50, 350], [[4, 0], [0, 100]], 60)
mode2 = np.random.multivariate_normal([70, 450], [[4, 0], [0, 100]], 40)

# Add some outliers (equipment malfunctions)
outliers = np.array([[30, 380], [90, 400], [55, 550], [45, 250]])

all_batches = np.vstack([mode1, mode2, outliers])
batch_data = pd.DataFrame(all_batches, columns=['yield', 'temperature'])

plt.scatter(batch_data['yield'], batch_data['temperature'], alpha=0.6)
plt.xlabel('Yield (%)')
plt.ylabel('Temperature (K)')
plt.show()

**3a.** Apply DBSCAN with eps=0.5 and min_samples=5 to scaled data. How many outliers (label=-1) are detected?

In [None]:
# Your code here


**3b.** Plot the results with outliers highlighted in a different color.

In [None]:
# Your code here


**3c.** When would you use DBSCAN vs K-Means for clustering process data?

*Your answer here:*

