# K-means in Python

This notebook will walk you through implementing the k-means algorithm using the `scikit` learn framework.

Start by importing the `pandas`, `numpy`, `seaborn`, and `matplotlib` libraries as before, and then also import the necessary functions from the `sklean` library.

In [None]:
import warnings
warnings.simplefilter(action='ignore', category=FutureWarning)

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans

Let's load up the same penguin data, but this time we'll remove any categorical variables from the data set so we can see how well the k-means algorithm can cluster our data without knowing the species information.

In [None]:
# Load the data
penguins = pd.read_csv('data/penguins_unlabeled.csv')

Next, we'll clean up the data a bit by removing rows with missing values, and only select the columns that deal with length and mass measurements.

In [None]:
# Preprocess the data
penguins = penguins.dropna()  # Remove rows with missing values

# Select features for clustering
X = penguins[['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g']]

Then, these columns need to be scaled so that they're all on a similar scale. `sklean` comes with a standard scaler which will convert value in a column to its corresponding z-score.

In [None]:
# Normalize the features
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
X_scaled

In [None]:
# Apply k-means clustering
kmeans = KMeans(n_clusters=3, init='k-means++')
penguins['cluster'] = kmeans.fit_predict(X_scaled)
penguins

In [None]:
# Visualize the results (using the first two features)
sns.scatterplot(data=penguins, x='bill_length_mm', y='flipper_length_mm', hue='cluster', palette="viridis")

# Add labels
plt.xlabel('Bill Length (mm)')
plt.ylabel('Bill Depth (mm)')
plt.title('K-means Clustering of Penguins')
plt.show()

In [None]:
# Print cluster centers
cluster_centers = scaler.inverse_transform(kmeans.cluster_centers_)
cluster_centers_df = pd.DataFrame(cluster_centers, columns=['bill_length_mm', 'bill_depth_mm', 'flipper_length_mm', 'body_mass_g'])
print("Cluster Centers:")
print(cluster_centers_df)

In [None]:
# Print the number of data points in each cluster
print("\nCluster Sizes:")
print(penguins['cluster'].value_counts().sort_index())