# Clustering `penguins` dataset with *K-means*

**Goal:** Given we know there are observations belonging to **3 different species** in the `penguins` dataset, use *K-means* to cluster these data points in their **original 4-dimensional space**, and visualize the cluster labeling in 2D scatter plots. Interpret the results, and take notes on **strange patterns** caused by the two-dimensional outputs, if there's any.

In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

RAND = 42

## Dataset

In [None]:
X_scaled = pd.read_csv("../datasets/penguins/simple/X_scaled.csv", index_col=0, header=0)
y = pd.read_csv("../datasets/penguins/simple/y.csv", index_col=0, header=0)

In [None]:
# TODO: Do some exploratory data analysis
# ...

## K-means model

In [None]:
from sklearn.cluster import KMeans

# TODO: Find clusters in the dataset with K-means
# TODO: experiment with hyperparameters

# first, create model
kmeans = KMeans(n_clusters=?, max_iter=?, n_init=?, random_state=RAND)

# then, fit the data
kmeans.fit(X_scaled)

# then, access assigned cluster labels
clustering_result = kmeans.labels_
clustering_ids = np.unique(clustering_result)

## Visualize the clustering in 2D

In [None]:
def visualize_kmeans_clustering(model:KMeans, X, f1='bill_length_mm', f2='bill_depth_mm', figsize=(6, 5)):
	_clustering_result = model.labels_
	_clustering_ids = np.unique(_clustering_result)

	plt.figure(figsize=figsize)
	cmap = plt.cm.tab10
	colors = cmap(_clustering_result)

	F1 = X.loc[:, f1]
	F2 = X.loc[:, f2]

	plt.scatter(F1, F2, s=40, color=colors, alpha=.75)
	plt.title(f"Features {f1} vs. {f2}")
	plt.xlabel(f1)
	plt.ylabel(f2)

	# legend
	handles = [
		plt.Line2D([], [], marker='o', linestyle='', label=f'Cluster {i}', color=cmap(i))
		for i in _clustering_ids
	]
	plt.legend(handles=handles, loc='center left', bbox_to_anchor=(1.02, 0.90))
	plt.gca().set_aspect("equal")
	plt.show()

In [None]:
# TODO: Observe the clustering result in (pairwise) 2D plots
# TODO: What do you observe?
# TODO: If K-means looks for spherical clusters, and samples get assigned to the nearest centroid,
# 		how come points with different colors mix together without a clear separation?

visualize_kmeans_clustering(...)