# Clustering Pokémon using K-means Part 01

In this notebook, we’ll explore **K-means clustering** to group Pokémon based on their attributes. Clustering helps us discover natural groupings in the data without requiring predefined labels, allowing us to understand patterns within the Pokémon universe based on similar strengths and capabilities.

![image.png](https://images.lobbes.nl/images/landingspagina/00-bij-extra-tekst/pokemon-extra.jpg)

### Objectives

1. **Data Preparation**: Load and clean the Pokémon dataset, selecting relevant features such as HP, Attack, Defense, and Speed.
2. **Standardization**: Standardize the selected features to ensure that each attribute contributes equally to the clustering process.
3. **Finding Optimal Clusters**: Use the **Elbow Method** to determine the optimal number of clusters for the dataset, which helps in achieving meaningful groupings.
4. **Applying K-means Clustering**: Perform K-means clustering on the standardized data and assign each Pokémon to a cluster.
5. **Visualization with PCA**: Use **Principal Component Analysis (PCA)** to reduce the dataset to two dimensions, making it easier to visualize the clusters.
6. **Cluster Analysis**: Interpret the clusters by comparing the average attributes within each cluster, helping to identify unique groups or "types" of Pokémon based on their attributes.

### Dataset Features

We’ll focus on the following numerical features for clustering:
- **HP**: The health points of a Pokémon.
- **Attack**: Physical strength used to damage opponents.
- **Defense**: Ability to withstand physical attacks.
- **Sp. Atk**: Special attack strength for non-physical moves.
- **Sp. Def**: Defense against non-physical moves.
- **Speed**: How quickly a Pokémon can act in battles.

By clustering Pokémon based on these features, we can group those with similar traits, potentially uncovering roles like "high-defense Pokémon," "speedy attackers," or "balanced all-rounders."

### Steps Overview

1. **Elbow Method**: First, we’ll plot the **Elbow Curve** to determine the ideal number of clusters. The "elbow" point, where the rate of decrease in distortion slows, gives a good estimate for the optimal \( k \).
2. **Applying K-means Clustering**: With the chosen \( k \), we’ll assign each Pokémon to a cluster.
3. **Visualizing Clusters**: We’ll reduce the dataset dimensions with PCA to visualize the clusters in a 2D plot.
4. **Cluster Interpretation**: Finally, we’ll examine the average stats of Pokémon within each cluster to interpret the groupings.



### 1. Load the Libraries and the Dataset

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Pokémon dataset
df = pd.read_csv('pokemon.csv')

# Inspect the first few rows of the dataset
df.head()

### 2. Feature Selection and Standardize

In [None]:
# Selecting relevant features for clustering
X = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

# Drop rows with any missing values in these features (if necessary)
X = X.dropna()

In [None]:
# Standardize the features to improve clustering performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

### 3. Determine Optimal Number of Clusters
While it was easy to determine the number of clusters for the Penguins, this dataset proves more tricky. How many clusters would we need. In order to solve this we can use the Elbow Technique. The Elbow Technique is a method used to determine the optimal number of clusters (denoted as $k$) in K-means clustering. Choosing the right $k$ is essential for producing meaningful and interpretable clusters, as too few or too many clusters can lead to poor results. The Elbow Technique helps by showing how the clustering quality changes as $k$ increases.

##### How the Elbow Technique Works

1. **Calculate Inertia (Distortion) for Each $k$**: 
   - Inertia, or **within-cluster sum of squares (WCSS)**, measures how closely data points in a cluster are to the cluster’s center (centroid). It is calculated as the sum of squared distances from each point to its assigned centroid.
   - As the number of clusters increases, inertia generally decreases, as data points are closer to their centroids in smaller clusters.

2. **Plot Inertia Against $k$**:
   - Compute inertia for a range of $k$ values (e.g., 1 to 10).
   - Plot $k$ on the x-axis and the inertia on the y-axis.

3. **Identify the "Elbow" Point**:
   - Look for the point on the plot where the inertia decreases sharply and then starts to level off. This point often resembles an "elbow" shape.
   - The idea is that after this point, adding more clusters provides diminishing returns in reducing inertia, indicating that increasing $k$ further doesn’t lead to significantly better clustering.

4. **Choose $k$ at the Elbow**:
   - The optimal number of clusters is generally chosen at the elbow point, as this $k$ provides a good balance between minimizing inertia and keeping a manageable number of clusters.

##### Things to Consider
While the technique provides an optimal number, manual inspection of the graph could lead to other choices. And [Schubert (2022)](https://arxiv.org/abs/2212.12189) would argue to stop using the elbow criterion for k-means all together! The papers explores better alternatives such as the variance-ratio criterion (VRC) of Calinski and Harabasz, the Bayesian Information Criterion (BIC), or the Gap statistics which should be preferred instead. However, the Elbow Technique remains one of the most used methods to determine the optimal cluster and performs well enough from personal experience.




In [None]:
distortions = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_scaled)
    distortions.append(kmeans.inertia_)

In [None]:
# Plot the elbow graph
plt.figure(figsize=(8, 5))
plt.plot(K, distortions, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Distortion')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
optimal_k = distortions.index(min(distortions, key=lambda x: abs(x - (sum(distortions) / len(distortions))))) + 1
print("The optimal number of clusters (k) is:", optimal_k)

### 4. Apply K-means Clustering

In [None]:
# Initialize the model with optimal number of clusters
kmeans = KMeans(n_clusters=4)

# Predict and assign
df['cluster'] = kmeans.fit_predict(X_scaled)

### 5. Visualize the Clusters using PCA (Principal Component Analysis) for Dimensionality Reduction 

In [None]:
# Reducing dimensions with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

In [None]:

# Plotting the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='viridis', s=10)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Pokémon Clusters (PCA-Reduced Dimensions)')
plt.show()

### Cluster Analysis
Finally, we can examine the clusters by grouping the Pokémons based on their assigned cluster labels and looking for commonalities.

In [None]:
# Explore a cluster
df[df['cluster'] == 2]

In [None]:
# Display cluster statistics to interpret the results
cluster_summary = df.groupby('cluster')[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].mean()
cluster_summary