In [1]:
import warnings
warnings.filterwarnings('ignore')


import pandas as pd
import numpy as np
from plotnine import *

from sklearn.preprocessing import StandardScaler #Z-score variables

from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture

from sklearn.metrics import silhouette_score

%matplotlib inline

# GMM

## Some Review

### Gaussian Mixture Models
Expectation Maximization with Gaussian Mixture Models (called EM for short) is a clustering algorithm that's similar to k-means except it doesn't assume spherical variance within clusters. That means clusters can be ellipses rather than just spherical. For example the graph on the left shows roughly spherical clusters, whereas the graph on the right shows non-spherical clusters.

<img src="https://drive.google.com/uc?export=view&id=1BslkqKXSuxYNcpAhFLlsVsBSdWUCWY3W"/>

Discussion: Which algorithm would work better for the cluster on the right? Which algorithm would work better for the clusters on the left?


The process for fitting EM is similar to k-means except for two main differences:

1. Instead of estimating ONLY cluster means/centers, we also estimate the variance for each predictor.
2. Instead of hard assignment (where each data point belongs to only 1 cluster), GMMs use soft assignment (where each data point has a probability of being in EACH cluster).
    - because there is no hard assignment, the cluster centers/means and variances are calculated using EVERY data point weighted by the probability that the data point belongs to that cluster. Data points that are unlikely to belong to a cluster barely affect the center/mean and variance of that cluster, whereas data points that are very likely to belong to a cluster have a larger influence on the center/mean and variance of that cluster.

This means that when clusters are NOT spherical, EM will be able to accomodate that, while k-means will not.

### Math Takeaways
We only do math when it'll help you understand the algorithm! So here are some takeaways this math should help you understand:

- GMM does soft assignment, every data point belongs to every cluster with some probability
- Data points that are more likely to be in a cluster have more influence over its parameters
- GMM uses the EM algorithm to iteratively update the cluster distributions. It first assigning a responsibility to each data point (Expectation step), and then using them to calculate weighted means and variances for each cluster (Maximization step)
- Responsibilities measure the probability of a data point being in each cluster (technically the posterior probability).
- Responsibilities contain information about how common a cluster is as well as the likelihood of a data point belonging to that cluster


Let's redo our Beyonce clustering with GMM!

In [4]:
# 1. Load the data + standardize
beyonce = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/Beyonce_data.csv")

predictors = ["energy", "danceability", "valence"]

X = beyonce[predictors]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 2. Create empty model
gmm = #

# 3. Fit model + predict
labels = gmm.fit_predict(X_scaled)


# Add cluster labels to the original DataFrame
X["clusters"] = labels

# 5: Plot the results. We are plotting each combo of predictors
print(ggplot(X, aes(x="energy", y="danceability", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Energy", y="Danceability", title="GMM Clustering Results for K = 4",
           color="Clusters"))

print(ggplot(X, aes(x="energy", y="valence", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Energy", y="Valence", title="GMM Clustering Results for K = 4",
           color="Clusters"))

print(ggplot(X, aes(x="valence", y="danceability", color="factor(clusters)")) +
      geom_point() + theme_minimal() +
      labs(x="Valence", y="Danceability", title="GMM Clustering Results for K = 4",
           color="Clusters"))

Just like in K-Means, it is not always clear how many clusters to use. In K-Means, we calculated the silhouette scores of many Ks and selected the best. While you *can* calculate a silhouette score for a GMM model, it might not be the best option. GMM allows for oblong clusters, meaning cohesion + separation might not be the best choice for a performance metric.

Another measure of cluster performance is the [**Bayesian Information Criterion**](https://github.com/scikit-learn/scikit-learn/blob/main/sklearn/mixture/_gaussian_mixture.py#L862) (BIC). The BIC measures how well fit your model is, where **lower** values of BIC are better.

$$ BIC = \underbrace{- 2 log(\hat{L})}_\text{goodness of fit} + \underbrace{k*log(N)}_\text{complexity penalty} $$

- $\hat{L}$ is the maximum likelihood of the model ($\prod_{n = 1}^N \sum_{k=1}^K w_k \mathcal{N}(x_n | \mu_k, \Sigma_k)$; from above)
- $N$ is the number of data points
- $k$ is the *number of parameters* in the model (just remember, the more clusters, the more parameters)

When a clustering solution is *good*, it's likelihood will be *high*. So $-2 log(\hat{L})$ will be *low*.


 BIC also **penalizes complexity** by adding on the $k*log(N)$ term. The more parameters we have to estimate ($k$) the higher $k*log(N)$ will be, thus BIC *penalizes* models for having a lot of parameters. If adding parameters doesn't improve the fit of the model (measured by $- 2 log(\hat{L})$), we don't want them. This is similar to **LASSO** and **Ridge** penalties, where we have to have things "pull their weight" in order for them to be "worth" the penalty.

In summary, we choose models with lower BIC values.



Let's try clustering the wine data and measuring the BIC scores.

In [6]:
wine = pd.read_csv("https://raw.githubusercontent.com/cmparlettpelleriti/CPSC392ParlettPelleriti/master/Data/wineLARGE.csv")

# drop and reset rows
wine.dropna(inplace = True)
wine.reset_index(inplace = True)

# grab data we want to cluster
feats = ["citric.acid", "residual.sugar"]

X = wine[feats]

# standardize
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# create dictionary to
metrics = {"bic": [], "k": []}

for i in range(2,20):
    gmm = #
    labels = gmm.fit_predict(X_scaled)
    bic_score = #

    metrics["bic"].append(bic_score)
    metrics["k"].append(i)

df = pd.DataFrame(metrics)


In [8]:
print(ggplot(df, aes(x = "k", y = "bic")) +
  geom_line() + theme_minimal() +
    labs(x = "K", y = "BIC Score",
         title = "BIC Scores for Different Ks"))

In [9]:
gmm = GaussianMixture(#)

labels = gmm.fit_predict(X_scaled)

# Add cluster labels to the original DataFrame
X["clusters"] = labels

# Plot the results. We are plotting each combo of predictors
print(ggplot(X, aes(x = "citric.acid", y = "residual.sugar", color = "factor(clusters)")) +
      geom_point() +
      theme_minimal() +
      scale_color_discrete(name = "Cluster") +
      labs(x = "Citric Acid",
           y = "Residual Sugar",
           title = "#TODO Cluster Solution"))

## ICA


Now, you're going to fit multiple clustering algorithms on each dataset below.

For each dataset:
- Make a ggplot of the data
- fit a K-Means Model
- fit a GMM
- make a ggplot with colored clusters for each model

Either choose k by making a plot and using your own judgement, or by trying out different k's and seeing which works best using the BIC.

See how well GMM and KM perform. Do both do well? Does one do better than the other? Do both do poorly?

### Very Distinct Clusters

In [None]:
d1 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/GMM_Classwork_01.csv")



In [None]:
# K-Means


In [None]:
# GMM


TODO: Reflection

 Do both do well? Does one do better than the other? Do both do poorly?





### Cluster in Cluster

In [None]:
d2 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/KMEM4.csv")



In [None]:
# K-Means


In [None]:
# GMM


TODO: Reflection

 Do both do well? Does one do better than the other? Do both do poorly?

### Oblong Clusters

In [None]:
d3 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/GMM_Classwork_02.csv")


In [None]:
# K-Means


In [None]:
# GMM


TODO: Reflection

 Do both do well? Does one do better than the other? Do both do poorly?

### Clusters with different variances

In [None]:
d4 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/GMM_Classwork_03.csv")


In [None]:
# K-Means


In [None]:
# GMM


TODO: Reflection

 Do both do well? Does one do better than the other? Do both do poorly?

### Uneven sized Clusters

In [None]:
d5 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/GMM_Classwork_04.csv")


In [None]:
# K-Means


In [None]:
# GMM


TODO: Reflection

 Do both do well? Does one do better than the other? Do both do poorly?

TODO: Reflection

What cautions will you now take when doing K-Means? In other words, what issues did this classwork present that might change how you apply clustering to real data?

### Cluster Stability
You may have already noticed this, but K-Means and EM will often give different solutions each time it runs. Run the following cells multiple times and notice how different (or not) the results are.

In [10]:
d6 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/KMEM6.csv")

feat = ["x", "y"]
X = d6[feat]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

km = KMeans(n_clusters=9)
pred = km.fit_predict(X_scaled)

ggplot(d6, aes("x", "y", color = pred)) + geom_point() + theme_minimal() + theme(legend_position = "none")

In [11]:
d7 = pd.read_csv("https://raw.githubusercontent.com/katherinehansen2/CPSC392Hansen/refs/heads/main/data/KMEM5.csv")

feat = ["x", "y"]
X = d7[feat]

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

gmm = GaussianMixture(n_components=9)
pred = gmm.fit_predict(X_scaled)

ggplot(d7, aes("x", "y", color = pred)) + geom_point() + theme_minimal() + theme(legend_position = "none")

TODO: Reflection

What do you think could cause this instability?

## Final Thoughts

I hope this classwork doesn't scare you away from clustering. Clustering is an incredibly useful tool! However, it's not a perfect tool, and like all the other models we've learned, you have to be careful and thoughtful in how you apply it. Now that you've completed this classwork, you should have a much better idea of the cautions you may need to take.