# Clustering Penguins using K-means

In this section, we'll explore clustering using the popular **penguins dataset**. The penguins dataset contains measurements for three penguin species: *Adelie*, *Chinstrap*, and *Gentoo*. We’ll use **K-means clustering** to group similar penguins together based on their physical measurements. This notebook demonstrates how unsupervised learning can help us identify natural groupings in data without using labels.

![image.png](https://www.gabemednick.com/post/penguin/featured_hu23a2ff6767279debab043a6c8f0a6157_878472_720x0_resize_lanczos_2.png)

### Objectives
1. **Data Preparation**: Load and clean the dataset, focusing on the relevant features.
2. **Standardization**: Standardize the features to improve clustering performance, as K-means is sensitive to the scale of the data.
3. **Applying K-means**: Perform K-means clustering and add the cluster labels to the dataset.
4. **Visualization with PCA**: Use **Principal Component Analysis (PCA)** to reduce the dimensions of the data for a clear 2D visualization.
5. **Comparison with Actual Species**: Visualize the clusters alongside the true species labels to assess clustering accuracy.

### Dataset Features
For clustering, we’ll use the following numerical features:
- **Bill Length (mm)**
- **Bill Depth (mm)**
- **Flipper Length (mm)**
- **Body Mass (g)**

These features will help us capture similarities and differences between individual penguins. Through clustering, we’ll see if K-means can naturally group these penguins in a way that aligns with their species.

### 1. Load the Libraries and the Dataset
To load the data we are going to use `seaborn` which will provide the dataset.

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib.patches as mpatches
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
import seaborn as sns

# Load the penguins dataset
df = sns.load_dataset("penguins")

# Drop rows with missing values
df = df.dropna()

### 2. Feature Selection
**Feature selection** is a process in machine learning and data analysis used to identify and select the most relevant variables, or "features," from a dataset for building a predictive model. By focusing on the most impactful features, feature selection improves model performance, reduces computational cost, and minimizes overfitting by eliminating irrelevant or redundant data. Since K-means clusters is done on numerical data, what would be good features?

In [None]:
# Feature selection: We'll use only the numerical features for clustering
X = df[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

# Drop rows with any missing values in these features (if necessary)
X = X.dropna()

### 3. Standardization
**Standardization** is a preprocessing technique used to transform each feature in a dataset so that it has a **mean of zero** and a **standard deviation of one**. This is particularly important in K-means clustering, as it ensures that all features contribute equally to the clustering process. K-means relies on Euclidean distance to assign points to clusters, so unstandardized features with larger scales (e.g., Bill length in mm versus body mass in grams) could disproportionately influence the results. In mathematical terms, for each feature $x$ in the dataset, the standardized value $z$ for each data point is calculated as:
$$
z = \frac{x - \mu}{\sigma}
$$
Where:
- $\mu$ is the mean of the feature $x$,
- $\sigma$ is the standard deviation of the feature $x$.

##### Step-by-Step Process
1. **Calculate the Mean** $\mu$ of each feature: This gives the average value of each feature across all data points.
2. **Calculate the Standard Deviation** $\sigma$ of each feature: This measures the spread or variability of the feature values around the mean.
3. **Apply the Transformation**: For each data point, subtract the feature's mean and divide by its standard deviation.

After standardization, each feature has a **mean of 0** and a **standard deviation of 1**, placing all features on a similar scale. This ensures that K-means clustering (or other algorithms that use distance measures) is not disproportionately influenced by features with larger numerical ranges, leading to more meaningful clusters that reflect the true relationships between data points.

##### How to Intepret:
- Values close to 0 indicate that the feature value is near the mean for that feature.
- Negative values indicate that the feature value is below the mean.
- Positive values indicate that the feature value is above the mean

In [None]:
# Standardize the features to improve clustering performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 4. Apply K-means Clustering
Applying K-means is pretty straightforward.

In [None]:
# This line initializes a K-means model with n_clusters=3 because there are three species
kmeans = KMeans(n_clusters=3)

# X_scaled represents the standardized feature data 
# Fits the K-means model to X_scaled, finding the centroids for the specified number of clusters
# Predicts the cluster assignment for each data point.
# Assign the cluster to the dataframe
df['cluster'] = kmeans.fit_predict(X_scaled)

# Examine the results
df

### 5. Visualize the Clusters using PCA (Principal Component Analysis) for Dimensionality Reduction 

**Principal Component Analysis (PCA)** is a technique for reducing the number of dimensions (features) in a dataset while preserving as much of the dataset's variability as possible. In the context of the penguin dataset, which contains features like `bill_length_mm`, `bill_depth_mm`, `flipper_length_mm`, and `body_mass_g`, PCA can help visualize patterns or clusters by reducing these multiple features into two or three dimensions.

##### Why Use PCA?

With multiple features, data visualization and interpretation become challenging, especially in higher-dimensional space. By reducing the dataset to two or three dimensions, PCA enables us to:
- Visualize the data in a more interpretable way.
- Identify patterns, clusters, or similarities between data points.
- Simplify the dataset for downstream analysis or clustering (e.g., K-means clustering).

##### How PCA Works

1. **Standardization**: PCA begins by standardizing the features, ensuring that each feature has a mean of 0 and a standard deviation of 1. This step is crucial because PCA is sensitive to the scale of the data.

2. **Covariance Matrix Calculation**: PCA calculates the covariance matrix, which captures how features vary together. Features with high covariance contain redundant information, so PCA seeks to combine them.

3. **Eigenvalues and Eigenvectors**: From the covariance matrix, PCA calculates eigenvalues and eigenvectors. 
   - **Eigenvectors** represent directions in the feature space (called "principal components") along which the data has the most variance.
   - **Eigenvalues** indicate the amount of variance along each principal component.

4. **Selecting Principal Components**: PCA ranks principal components by their eigenvalues, selecting the top components that capture the majority of the dataset's variability.

5. **Transforming the Data**: The original features are projected onto the selected principal components, effectively reducing the dataset's dimensionality.

In [None]:
# Use PCA for dimensionality reduction to 2D
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

In [None]:
# Plot the PCA-transformed data with cluster labels
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='viridis', s=25)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Penguin Clusters (PCA-Reduced Dimensions)')

# Create legend for species
unique_species = df['species'].unique()
species_colors = plt.cm.viridis(df['cluster'].unique() / max(df['cluster'].unique()))
legend_patches = [mpatches.Patch(color=color, label=species) for color, species in zip(species_colors, unique_species)]
plt.legend(handles=legend_patches, title="Species")

plt.show()

In [None]:
# Take a look at the actual species
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=pd.factorize(df['species'])[0], cmap='viridis', s=25)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Penguin Species Distribution (PCA-Reduced Dimensions)')

# Create legend for species
unique_species = df['species'].unique()
species_colors = plt.cm.viridis([i / (len(unique_species) - 1) for i in range(len(unique_species))])
legend_patches = [mpatches.Patch(color=color, label=species) for color, species in zip(species_colors, unique_species)]
plt.legend(handles=legend_patches, title="Species")

plt.show()