# K-Means Clustering: Grouping Data Without Labels

Welcome to the ninth notebook in our **Machine Learning Basics for Beginners** series! After learning about Support Vector Machines for classification, let's explore **K-Means Clustering**, a popular unsupervised learning algorithm used to group data points into clusters without predefined labels.

**What You'll Learn in This Notebook:**
- What K-Means Clustering is and when to use it.
- How K-Means works in simple terms.
- A hands-on example of grouping customers based on shopping behavior.
- An interactive exercise to adjust the number of clusters and observe changes.
- Visualizations to understand how data points are grouped into clusters.

Let's get started!

## 1. What is K-Means Clustering?

**K-Means Clustering** is an unsupervised learning algorithm that groups similar data points into a specified number of clusters (K) without using predefined labels. It’s used to find natural patterns or structures in data.

- **Goal**: Partition data into K clusters where each data point belongs to the cluster with the nearest center (centroid), minimizing the distance within clusters.
- **When to Use It**: Use K-Means when you have unlabeled data and want to discover inherent groupings, such as segmenting customers, organizing documents, or identifying patterns in images.
- **Examples**:
  - Grouping customers into similar buying behavior segments for targeted marketing.
  - Organizing articles or posts into topics based on content similarity.
  - Compressing image colors by grouping similar pixel colors into clusters.

**Analogy**: Imagine you’re organizing a pile of mixed fruits on a table into groups without knowing their types. You decide to make, say, 3 groups (K=3) and start placing fruits close to each other based on size or color. Over time, you adjust the groups so that similar fruits are together. K-Means does this by grouping data points based on their similarity.

## 2. How Does K-Means Clustering Work?

K-Means Clustering might seem tricky at first, but it’s based on a straightforward iterative process. Here’s how it works step by step:

1. **Choose K**: Decide the number of clusters (K) you want to form. This is a user-defined parameter and often requires some experimentation.
2. **Initialize Centroids**: Randomly place K points (called centroids) in the data space as the initial centers of the clusters.
3. **Assign Points to Clusters**: For each data point, calculate the distance to all centroids and assign the point to the cluster with the closest centroid.
4. **Update Centroids**: Recalculate the new center (centroid) of each cluster by taking the average of all points assigned to that cluster.
5. **Repeat**: Repeat steps 3 and 4 until the centroids no longer move significantly (convergence) or a maximum number of iterations is reached.
6. **Result**: The data points are grouped into K clusters, and each cluster has a centroid representing its center.

**Analogy**: Think of K-Means as organizing a party where you want to form K dance groups. You start by randomly picking K people as the leaders (centroids). Everyone joins the closest leader’s group. Then, each leader moves to the center of their group. People might switch groups if another leader is now closer. This repeats until the groups stabilize.

**Key Advantage**: K-Means is simple, fast, and works well for spherical or compact clusters. However, it can be sensitive to the initial placement of centroids and requires you to specify K in advance.

## 3. Example: Grouping Customers with K-Means

Let’s apply K-Means to a small dataset representing customer shopping behavior. We’ll group customers based on two features: annual spending (in thousands of dollars) and frequency of purchases (number of times per year).

**Dataset** (simplified):
- Annual Spending: 15, 20, 50, 55, 18
- Purchase Frequency: 5, 8, 20, 25, 6

We’ll use Python’s `scikit-learn` library to apply K-Means Clustering with K=2 (let’s say we want two customer segments). Focus on the steps and output, not the code details.

**Instructions**: Run the code below to see how K-Means groups customers and visualizes the clusters and centroids.

In [None]:
# Import necessary libraries
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

# Our small dataset
X = np.array([[15, 5], [20, 8], [50, 20], [55, 25], [18, 6]])  # Features: spending, frequency

# Create and fit the K-Means model with K=2
model = KMeans(n_clusters=2, random_state=42)
model.fit(X)

# Get cluster labels for each data point
labels = model.labels_
print(f"Cluster assignments for each customer: {labels}")

# Get the centroids of the clusters
centroids = model.cluster_centers_
print(f"Centroids (centers) of the clusters:\n{centroids}")

# Visualize the clusters and centroids
plt.scatter(X[labels == 0][:, 0], X[labels == 0][:, 1], color='blue', label='Cluster 0', alpha=0.8)
plt.scatter(X[labels == 1][:, 0], X[labels == 1][:, 1], color='red', label='Cluster 1', alpha=0.8)
plt.scatter(centroids[:, 0], centroids[:, 1], color='black', marker='x', s=200, label='Centroids')
plt.xlabel('Annual Spending (thousands of $)')
plt.ylabel('Purchase Frequency (times/year)')
plt.title('K-Means Clustering: Customer Segments (K=2)')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
print("- Blue dots are customers in Cluster 0.")
print("- Red dots are customers in Cluster 1.")
print("- Black 'X' marks are the centroids (centers) of each cluster.")
print("- Customers are grouped based on similarity in spending and frequency.")

## 4. Interactive Exercise: Adjust the Number of Clusters

Now it’s your turn to experiment with K-Means! In this exercise, you can choose the number of clusters (K) and see how the grouping of customers changes. You’ll also add a new customer data point to see which cluster it gets assigned to.

**Instructions**:
- Run the code below.
- Enter a value for K (number of clusters, between 1 and 5).
- Enter values for a new customer’s Annual Spending and Purchase Frequency.
- Observe how the clusters form and which cluster the new customer belongs to.

In [None]:
# Interactive exercise for K-Means Clustering
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

print("Welcome to the 'Adjust Number of Clusters' Exercise!")
print("You’ll choose the number of clusters (K) and see how customers are grouped.")

# Original dataset
X = np.array([[15, 5], [20, 8], [50, 20], [55, 25], [18, 6]])

# Ask user for number of clusters (K)
try:
    k = int(input("Enter number of clusters (K, between 1 and 5): "))
    if k < 1 or k > 5:
        raise ValueError("K must be between 1 and 5.")
except ValueError as e:
    print(f"Invalid input: {e}. Defaulting to K=2.")
    k = 2

# Ask user to add a new customer data point
try:
    new_spend = float(input("Enter Annual Spending for new customer (e.g., 30): "))
    new_freq = float(input("Enter Purchase Frequency for new customer (e.g., 10): "))
    new_customer = np.array([[new_spend, new_freq]])
    X_with_new = np.vstack([X, new_customer])
    print(f"Added customer: Spending={new_spend}, Frequency={new_freq}.")
except ValueError:
    new_customer = np.array([[30, 10]])
    X_with_new = np.vstack([X, new_customer])
    print(f"Invalid input. Defaulting to Spending=30, Frequency=10.")

# Fit K-Means with chosen K
model = KMeans(n_clusters=k, random_state=42)
model.fit(X_with_new)

# Get cluster labels
labels = model.labels_
new_customer_cluster = labels[-1]
print(f"Cluster assignments for all points (including new customer): {labels}")
print(f"New customer assigned to Cluster {new_customer_cluster}")

# Get centroids
centroids = model.cluster_centers_
print(f"Centroids (centers) of the clusters:\n{centroids}")

# Visualize the clusters and centroids
colors = ['blue', 'red', 'green', 'purple', 'orange']
for i in range(k):
    plt.scatter(X_with_new[labels == i][:, 0], X_with_new[labels == i][:, 1], 
                color=colors[i], label=f'Cluster {i}', alpha=0.8)
plt.scatter(new_customer[0][0], new_customer[0][1], color='black', marker='*', s=200, label='New Customer')
plt.scatter(centroids[:, 0], centroids[:, 1], color='black', marker='x', s=200, label='Centroids')
plt.xlabel('Annual Spending (thousands of $)')
plt.ylabel('Purchase Frequency (times/year)')
plt.title(f'K-Means Clustering: Customer Segments (K={k})')
plt.legend()
plt.grid(True)
plt.show()

print("Look at the plot above:")
for i in range(k):
    print(f"- {colors[i].capitalize()} dots are customers in Cluster {i}.")
print("- Black '*' is the new customer you added.")
print("- Black 'X' marks are the centroids (centers) of each cluster.")
print("- Customers are grouped based on similarity in spending and frequency.")

## 5. Key Considerations for K-Means Clustering

K-Means is a versatile and easy-to-use clustering algorithm, but it has some limitations and considerations to keep in mind:

- **Choosing K**: The number of clusters (K) must be specified in advance. Picking the wrong K can lead to poor clustering (too few clusters merge distinct groups; too many split similar groups). Techniques like the "elbow method" can help find an optimal K by plotting the within-cluster sum of squares.
- **Sensitive to Initialization**: The random placement of initial centroids can affect the final clusters. Different runs might yield different results. Using `random_state` (as in our code) or algorithms like K-Means++ for smarter initialization can help.
- **Assumes Spherical Clusters**: K-Means works best when clusters are roughly spherical and of similar size. It struggles with elongated or irregularly shaped clusters.
- **Sensitive to Outliers**: Outliers can distort centroids and pull clusters away from meaningful groupings. Preprocessing to remove or handle outliers is often necessary.
- **Feature Scaling**: Like many distance-based algorithms, K-Means is sensitive to the scale of features. If one feature has a much larger range, it can dominate the clustering. Normalize or standardize features before clustering.

**Analogy**: K-Means is like sorting laundry into piles. If you pick too few piles (K), you mix socks with shirts. If you pick too many, you split similar items unnecessarily. If a giant blanket (outlier) is in the mix, it might mess up your piles. And if you measure socks by length and shirts by weight without adjusting, your grouping won’t make sense.

Despite these challenges, K-Means is widely used for its simplicity and effectiveness, especially as a starting point for clustering tasks.

## 6. Key Takeaways

- **K-Means Clustering** is an unsupervised learning algorithm that groups data into a specified number of clusters (K) based on similarity, without needing labels.
- It works by iteratively assigning points to the nearest centroid and updating centroids until convergence.
- Use it for tasks like customer segmentation, document grouping, or image compression when you want to find natural patterns in unlabeled data.
- Be aware of limitations: you must choose K, it’s sensitive to initial centroids and outliers, assumes spherical clusters, and requires scaled features.

You’ve now learned a fundamental unsupervised learning technique! K-Means introduces the concept of finding structure in data without guidance, which is powerful for exploratory analysis.

**What's Next?**
Move on to **Notebook 10: Naive Bayes** to learn about a probabilistic classification algorithm based on Bayes’ theorem. See you there!