# Clustering Pokémon using K-means Part 02

But what about categorical data? In this notebook, we’ll further explore **K-means clustering** to group Pokémon based on both their numerical attributes (e.g., HP, Attack, Speed) and their categorical **types** (e.g., Fire, Water, Electric). 

![image.png](https://images.lobbes.nl/images/landingspagina/00-bij-extra-tekst/pokemon-extra.jpg)

### Objectives

1. **One-Hot Encoding**: Use one-hot encoding to transform `type 1` and `type 2` categorical columns into binary features. This allows K-means to treat each type as an individual feature.

#### Categorical Features (Pokémon Types)
- **Type 1**: Primary type of each Pokémon (e.g., Water, Fire, Electric).
- **Type 2**: Secondary type, where applicable (e.g., Flying for Charizard or Poison for Bulbasaur).

Incorporating these categorical type features with numerical attributes allows us to group Pokémon in a more nuanced way. By clustering based on both types and stats, we may discover groupings like "high-defense Water Pokémon" or "speedy Electric Pokémon."

### 1. Load the Libraries and the Dataset

In [None]:
# Import necessary libraries
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

# Load the Pokémon dataset
df = pd.read_csv('pokemon.csv')

# Inspect the first few rows of the dataset
df.head()

### 2. Numerical Feature Selection

In [None]:
# Selecting relevant features for clustering
numerical_features = df[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']]

# Drop rows with any missing values in these features (if necessary)
numerical_features = numerical_features.dropna()

### 3. Categorical Feature Selection with One Hot Encoding

**One-Hot Encoding** is a technique used to convert categorical data (text or labels) into a numerical format that can be utilized by machine learning models. It transforms each unique category in a column into a separate binary feature, where each row is marked with a `1` or `0` to indicate the presence or absence of that category.

For example, in the context of Pokémon data, suppose we have a `type 1` column with categories like *Water*, *Fire*, and *Grass*. One-hot encoding would convert these into separate binary columns: *type_Water*, *type_Fire*, and *type_Grass*. Each Pokémon would then have a `1` in the column for its respective type and `0`s in the others.

##### How One-Hot Encoding Works

1. **Identify Unique Categories**: One-hot encoding first identifies all unique values (categories) in a categorical column.
2. **Create Binary Columns**: Each unique category becomes a new binary column, with a `1` or `0` to indicate if that category is present in each row.
3. **Drop the Original Column**: Once the categorical data is encoded, the original categorical column is typically removed to avoid redundancy.

##### Example of One-Hot Encoding

For a `type 1` column with values:

| type 1 |
|--------|
| Water  |
| Fire   |
| Grass  |
| Water  |

The one-hot encoded version would look like this:

| type_Water | type_Fire | type_Grass |
|------------|-----------|------------|
| 1          | 0         | 0          |
| 0          | 1         | 0          |
| 0          | 0         | 1          |
| 1          | 0         | 0          |


In [None]:
# One-hot encode 'type 1' and 'type 2' columns
type_1_dummies = pd.get_dummies(df['Type 1'], prefix='type')
type_2_dummies = pd.get_dummies(df['Type 2'], prefix='type')

# Combine the one-hot encoded columns, using max to avoid duplicate columns
types_combined = type_1_dummies + type_2_dummies

# Replace values greater than 1 with 1 to maintain binary encoding
types_combined = types_combined.where(types_combined <= 1, 1)

# Merge the combined types with the numerical features
df_features = pd.concat([numerical_features.reset_index(drop=True), types_combined], axis=1)



In [None]:
# Standardize the features to improve clustering performance
scaler = StandardScaler()
X_scaled = scaler.fit_transform(df_features)

### 4. Determine Optimal Number of Clusters


In [None]:
distortions = []
K = range(1, 11)
for k in K:
    kmeans = KMeans(n_clusters=k)
    kmeans.fit(X_scaled)
    distortions.append(kmeans.inertia_)

In [None]:
# Plot the elbow graph
plt.figure(figsize=(8, 5))
plt.plot(K, distortions, marker='o')
plt.xlabel('Number of clusters (k)')
plt.ylabel('Distortion')
plt.title('Elbow Method for Optimal k')
plt.show()

In [None]:
optimal_k = distortions.index(min(distortions, key=lambda x: abs(x - (sum(distortions) / len(distortions))))) + 1
print("The optimal number of clusters (k) is:", optimal_k)

### 5. Apply K-means Clustering

In [None]:
# Initialize the model with optimal number of clusters
kmeans = KMeans(n_clusters=5)

# Predict and assign
df['cluster'] = kmeans.fit_predict(X_scaled)

### 6. Visualize the Clusters using PCA (Principal Component Analysis) for Dimensionality Reduction 

In [None]:
# Reducing dimensions with PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)

In [None]:
# Plotting the clusters
plt.figure(figsize=(10, 6))
plt.scatter(X_pca[:, 0], X_pca[:, 1], c=df['cluster'], cmap='viridis', s=10)
plt.xlabel('PCA Component 1')
plt.ylabel('PCA Component 2')
plt.title('Pokémon Clusters (PCA-Reduced Dimensions)')
plt.show()

### Cluster Analysis
Finally, we can examine the clusters by grouping the Pokémons based on their assigned cluster labels and looking for commonalities.

In [None]:
# Explore a cluster
df[df['cluster'] == 1].head(10)

In [None]:
# Display cluster statistics to interpret the results
cluster_summary = df.groupby('cluster')[['HP', 'Attack', 'Defense', 'Sp. Atk', 'Sp. Def', 'Speed']].mean()
cluster_summary