# Week 3 Class 2
#In-Class Activity: Gaussian Mixture Models (GMM) with EM

**Pairs:** 2 students per group  



##Team Members NAME:


1.   Member 1
2.   Member 2



##Objectives
- Implement **K-Means** (hard clustering) and **GMM (EM)** (soft clustering) on the Wine dataset.
- Compare **ARI** and **silhouette** between the two methods.
- Interpret differences and explain EM succinctly.


##Imports
These are common packages you will need. You may add more if necessary.


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.datasets import load_wine
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.mixture import GaussianMixture
from sklearn.metrics import adjusted_rand_score, silhouette_score

# Plotting
plt.rcParams.update({'figure.figsize': (6, 4), 'axes.grid': True})


##Dataset Loading & Scaling
We load the Wine dataset and standardize features. **Labels are for evaluation only.**


In [None]:
#Data Loading & Scaling
wine = load_wine()
X = wine.data  # features
y = wine.target  # labels (evaluation only)
feature_names = list(wine.feature_names)

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

print(f"Samples: {X.shape[0]}, Features: {X.shape[1]}")
print("First 3 standardized rows:\n", np.round(X_scaled[:3], 3))


##Visualization
Use two feature indices to visualize clusters in 2D. Edit `feat_i` and `feat_j` below if desired.


In [None]:
def plot_clusters_2d(X_scaled, labels, feature_names, feat_i=0, feat_j=1, title=""):
    if labels is None:
        raise ValueError("'labels' is None. Run your algorithm to generate labels first.")
    if not isinstance(feat_i, int) or not isinstance(feat_j, int):
        raise ValueError("Feature indices must be integers.")
    if feat_i == feat_j:
        raise ValueError("Choose two different feature indices for 2D plotting.")
    if feat_i < 0 or feat_j < 0 or feat_i >= X_scaled.shape[1] or feat_j >= X_scaled.shape[1]:
        raise ValueError("Feature index out of range.")

    x = X_scaled[:, feat_i]
    y2 = X_scaled[:, feat_j]
    plt.figure()
    plt.scatter(x, y2, c=labels, s=24)
    plt.xlabel(feature_names[feat_i])
    plt.ylabel(feature_names[feat_j])
    plt.title(title if title else f"Clusters on {feature_names[feat_i]} vs {feature_names[feat_j]}")
    plt.show()

# Choose features to visualize
feat_i = 0  # e.g., 'alcohol'
feat_j = 6  # e.g., 'flavanoids'
print(f"Using features: {feature_names[feat_i]} (index {feat_i}) vs {feature_names[feat_j]} (index {feat_j})")


---
## Step 1 â€” Quick Summary
In 1â€“2 sentences, explain **why scaling matters** for this activity.


_Write your short summary here._

---
## Step 2 â€” K-Means Baseline (k=3)
**Instructions:**
1. Fit **KMeans(n_clusters=3, random_state=0)** on `X_scaled`.
2. Save predicted labels as `kmeans_labels`.
3. Compute **ARI** against `y` and **silhouette_score** using `X_scaled` and `kmeans_labels`.
4. Print both metrics (3 decimals).


In [None]:
# --- K-Means (Student) ---
# TODO: Implement K-Means, produce `kmeans_labels`, compute ARI and silhouette.

kmeans_labels = None  # replace with your labels


### ðŸ“Š Visualization â€” K-Means (Provided)
This cell will visualize your **K-Means** clustering using the two selected features.


In [None]:
#Viz K-Means
if kmeans_labels is None:
    raise ValueError("Run Step 2 first to set `kmeans_labels`.")
plot_clusters_2d(X_scaled, kmeans_labels, feature_names, feat_i=feat_i, feat_j=feat_j, title="K-Means Clusters")


---
## Step 3 â€” Gaussian Mixture Model with EM (k=3) â€” *Student implements*
**Reminder:**
- **E-step:** compute responsibilities (soft assignments).
- **M-step:** update means, covariances, and mixture weights; iterate until convergence.

**Instructions:**
1. Fit **GaussianMixture(n_components=3, random_state=0)** on `X_scaled`.
2. Predict labels as `gmm_labels` (e.g., `gmm.predict(X_scaled)` or argmax over responsibilities).
3. Compute **ARI** and **silhouette** for GMM predictions.
4. (Optional) Inspect convergence info such as `gmm.n_iter_` and `gmm.lower_bound_`.


In [None]:
# GMM with EM
# TODO: Implement GMM with EM, produce `gmm_labels`, compute ARI and silhouette.

gmm_labels = None  # replace with your labels


###Visualization â€” GMM
This cell will visualize your **GMM** clustering using the two selected features.


In [None]:
if gmm_labels is None:
    raise ValueError("Run Step 3 first to set `gmm_labels`.")
plot_clusters_2d(X_scaled, gmm_labels, feature_names, feat_i=feat_i, feat_j=feat_j, title="GMM (EM) Clusters")


---
## Step 4 â€” Compare & Interpret
Create a small table/dict with both models' metrics and write a brief comparison (4â€“6 sentences).


In [None]:
#Metric Summary
# TODO: Summarize metrics for K-Means and GMM into a small structure and print.
# results = {
#     'kmeans': {'ari': kmeans_ari, 'silhouette': kmeans_sil},
#     'gmm': {'ari': gmm_ari, 'silhouette': gmm_sil},
# }
# results


_Write your comparison here._

---
## Step 5 â€” EM in Your Own Words (Student)
In 2â€“3 sentences, explain how **E-step** and **M-step** work inside GMM on this dataset.


_Write your explanation here._

---
## Step 6 â€” Real-World Notes (Pair Work)
List 3â€“4 bullet points on when to prefer **GMM (EM)** over K-Means, and when K-Means might still be a good choice.


-
-
-
-


##Final Deliverables, Check your notebook covers the following
1. Printed **ARI** and **silhouette** for both K-Means and GMM.
2. A short comparison paragraph.
3. A concise explanation of EM in your own words.
4. Real-world bullet points.

