# HPXPy K-Means Clustering Demo

This notebook implements K-Means clustering using HPXPy, demonstrating:
1. Iterative MapReduce pattern (assign points -> update centroids)
2. Data-parallel operations across large datasets
3. How computation naturally partitions for distributed execution

K-Means is a classic distributed computing benchmark because:
- Data can be partitioned across localities (each owns a subset of points)
- Each iteration has a Map phase (local) and Reduce phase (global)
- Communication is minimal: only centroid updates need synchronization

**Note:** For proper scalability testing with multiple thread counts, run the `kmeans_clustering_demo.py` script instead.

In [None]:
import time
import numpy as np
import hpxpy as hpx

hpx.init(num_threads=4)

## Generate Synthetic Clustered Data

In [None]:
n_points = 500_000
n_clusters = 10
n_iterations = 20

print(f"Configuration:")
print(f"  Data points: {n_points:,}")
print(f"  Clusters: {n_clusters}")
print(f"  Iterations: {n_iterations}")

# Generate clustered data
np.random.seed(42)

points_per_cluster = n_points // n_clusters
data_list = []
for i in range(n_clusters):
    center = np.random.randn(2) * 10
    cluster_points = center + np.random.randn(points_per_cluster, 2)
    data_list.append(cluster_points)

data_np = np.vstack(data_list).astype(np.float64)
np.random.shuffle(data_np)

print(f"\nGenerated {len(data_np):,} 2D data points")

## K-Means with HPXPy

In [None]:
# Extract x and y coordinates
x_np = data_np[:, 0].copy()
y_np = data_np[:, 1].copy()

# Convert to HPXPy arrays
x = hpx.from_numpy(x_np)
y = hpx.from_numpy(y_np)

# Initialize centroids (random points from data)
np.random.seed(123)
centroid_idx = np.random.choice(n_points, n_clusters, replace=False)
centroids_x = x_np[centroid_idx].copy()
centroids_y = y_np[centroid_idx].copy()

# Warm up
_ = hpx.sum(x)

# Time K-Means iterations
start = time.perf_counter()

for iteration in range(n_iterations):
    # === MAP PHASE: Assign each point to nearest centroid ===
    min_distances = None
    assignments_np = None
    
    for k in range(n_clusters):
        # Distance squared: (x - cx)^2 + (y - cy)^2
        dx = x - centroids_x[k]
        dy = y - centroids_y[k]
        dist_sq = dx * dx + dy * dy
        
        if min_distances is None:
            min_distances = dist_sq
            assignments_np = np.zeros(n_points, dtype=np.float64)
        else:
            # Update assignments where this centroid is closer
            dist_sq_np = dist_sq.to_numpy()
            min_dist_np = min_distances.to_numpy()
            closer = dist_sq_np < min_dist_np
            assignments_np[closer] = k
            min_distances = hpx.from_numpy(np.minimum(min_dist_np, dist_sq_np))
    
    # === REDUCE PHASE: Update centroids ===
    for k in range(n_clusters):
        # Create mask for points in cluster k
        mask = (assignments_np == k).astype(np.float64)
        mask_hpx = hpx.from_numpy(mask)
        
        # Sum of coordinates for points in this cluster
        sum_x = float(hpx.sum(x * mask_hpx))
        sum_y = float(hpx.sum(y * mask_hpx))
        count = float(hpx.sum(mask_hpx))
        
        if count > 0:
            centroids_x[k] = sum_x / count
            centroids_y[k] = sum_y / count

elapsed = time.perf_counter() - start

# Compute final inertia (sum of squared distances to centroids)
total_inertia = float(hpx.sum(min_distances))

print(f"\nK-Means Results:")
print(f"  Time: {elapsed*1000:.2f} ms")
print(f"  Inertia: {total_inertia:.2f}")

## Compare with NumPy

In [None]:
def numpy_kmeans(data, n_clusters, n_iterations):
    """NumPy reference K-Means implementation."""
    n_points = len(data)
    x, y = data[:, 0], data[:, 1]
    
    # Initialize centroids
    np.random.seed(123)
    centroid_idx = np.random.choice(n_points, n_clusters, replace=False)
    centroids_x = data[centroid_idx, 0].copy()
    centroids_y = data[centroid_idx, 1].copy()
    
    start = time.perf_counter()
    
    for _ in range(n_iterations):
        # Compute distances to all centroids
        distances = np.zeros((n_points, n_clusters))
        for k in range(n_clusters):
            distances[:, k] = (x - centroids_x[k])**2 + (y - centroids_y[k])**2
        
        # Assign to nearest centroid
        assignments = np.argmin(distances, axis=1)
        
        # Update centroids
        for k in range(n_clusters):
            mask = assignments == k
            if np.sum(mask) > 0:
                centroids_x[k] = np.mean(x[mask])
                centroids_y[k] = np.mean(y[mask])
    
    elapsed = time.perf_counter() - start
    
    # Compute inertia
    min_distances = np.min(distances, axis=1)
    inertia = np.sum(min_distances)
    
    return elapsed, inertia

np_time, np_inertia = numpy_kmeans(data_np, n_clusters, n_iterations)

print(f"NumPy K-Means:")
print(f"  Time: {np_time*1000:.2f} ms")
print(f"  Inertia: {np_inertia:.2f}")
print(f"\nSpeedup: {np_time/elapsed:.2f}x")

## Visualization

In [None]:
try:
    import matplotlib.pyplot as plt
    
    # Subsample for plotting
    sample_size = min(5000, n_points)
    indices = np.random.choice(n_points, sample_size, replace=False)
    
    plt.figure(figsize=(10, 8))
    
    # Color by assignment
    plt.scatter(data_np[indices, 0], data_np[indices, 1], 
                c=assignments_np[indices], cmap='tab10', s=1, alpha=0.5)
    
    # Plot centroids
    plt.scatter(centroids_x, centroids_y, c='black', marker='x', s=200, linewidths=3)
    
    plt.xlabel('x')
    plt.ylabel('y')
    plt.title(f'K-Means Clustering ({n_clusters} clusters, {n_points:,} points)')
    plt.show()
except ImportError:
    print("matplotlib not available for visualization")

## Distributed K-Means: How It Scales Across Nodes

K-Means has a natural MapReduce structure perfect for distribution:

### Iteration Structure

**MAP PHASE (Local - No Communication)**
- Each locality processes its local data points
- Compute distance from each point to all K centroids
- Assign each point to nearest centroid
- Compute local partial sums: Σx, Σy, count per cluster

**REDUCE PHASE (Global - Minimal Communication)**
- All-reduce to combine partial sums
- Communication: Only 3×K floats per locality per iteration!

### Scaling Projection

| Localities | Points/Node | Communication | Expected Speedup |
|------------|-------------|---------------|------------------|
| 1 | 1,000,000 | 0 | 1x |
| 4 | 250,000 | 120 floats | ~4x |
| 16 | 62,500 | 480 floats | ~16x |
| 64 | 15,625 | 1920 floats | ~60x |
| 256 | 3,906 | 7680 floats | ~200x |

Communication overhead is O(K) per iteration, independent of data size! This makes K-Means near-perfectly scalable for large datasets.

In [None]:
hpx.finalize()
print("Demo complete!")