# **Problem Statement**  
## **28. Apply KMeans to a dataset and evaluate clusters using Silhouette Score.**

Apply KMeans clustering to a dataset and evaluate the quality of clustering using the Silhouette Score.

The solution should:
- Cluster data using KMeans
- Compute Silhouette Score
- Compare brute-force and optimized implementations

### Constraints & Example Inputs/Outputs

### Constraints
- Dataset is numerical
- K ≥ 2
- Euclidean distance
- No label information available (unsupervised)

### Example Input:
```python
X = [[1,2], [1,4], [1,0],
     [10,2], [10,4], [10,0]]
k = 2

```

Expected Output:
```python
Cluster Labels: [0, 0, 0, 1, 1, 1]
Silhouette Score: ~0.7

```

### Solution Approach

**Step 1: Initialize Centroids**
- Randomly choose k points as centroids.

**Step 2: Assign Clusters**
- Assign each point to the nearest centroid using Euclidean distance.

**Step 3: Update Centroids**
- Recompute centroids as the mean of assigned points.

**Step 4: Iterate Until Convergence**
- Repeat assignment and update steps until centroids stop changing.

**Step 5: Evaluate Using Silhouette Score**
Measures:
- Intra-cluster cohesion
- Inter-cluster separation

### Solution Code

In [1]:
# Import Libraries
import numpy as np
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score


In [2]:
# Approach 1: Brute Force Approach (KMeans from Scratch)
# Step 1: Distance Function
def euclidean_distance(a, b):
    return np.sqrt(np.sum((a - b) ** 2))

# Step 2: Brute Force KMeans
def kmeans_bruteforce(X, k, max_iters=100):
    X = np.array(X)
    n_samples, n_features = X.shape
    
    # Randomly initialize centroids
    np.random.seed(42)
    centroids = X[np.random.choice(n_samples, k, replace=False)]
    
    for _ in range(max_iters):
        # Assign clusters
        clusters = [[] for _ in range(k)]
        labels = []
        
        for x in X:
            distances = [euclidean_distance(x, c) for c in centroids]
            cluster = np.argmin(distances)
            clusters[cluster].append(x)
            labels.append(cluster)
        
        # Update centroids
        new_centroids = np.array([
            np.mean(cluster, axis=0) if cluster else centroids[i]
            for i, cluster in enumerate(clusters)
        ])
        
        if np.allclose(centroids, new_centroids):
            break
        
        centroids = new_centroids
    
    return np.array(labels), centroids


In [3]:
# Run Brute Force KMeans
X = np.array([[1,2], [1,4], [1,0],
              [10,2], [10,4], [10,0]])

labels_brute, centroids_brute = kmeans_bruteforce(X, k=2)
labels_brute, centroids_brute


(array([0, 1, 0, 0, 1, 0]),
 array([[5.5, 1. ],
        [5.5, 4. ]]))

In [4]:
# Silhouette Score (Brute Force)
silhouette_brute = silhouette_score(X, labels_brute)
silhouette_brute


-0.14822328775848606

### Alternative Solution

In [5]:
# Approach 2: Optimized Approach (sklearn KMeans)
# Optimized KMeans
kmeans = KMeans(n_clusters=2, random_state=42)
labels_opt = kmeans.fit_predict(X)
centroids_opt = kmeans.cluster_centers_

labels_opt, centroids_opt


(array([0, 0, 0, 1, 1, 1], dtype=int32),
 array([[ 1.,  2.],
        [10.,  2.]]))

In [6]:
# Silhouette Score (Optimized)
silhouette_opt = silhouette_score(X, labels_opt)
silhouette_opt


0.7133477791749615

### Alternative Approaches

**Brute Force Alternatives**
- Different centroid initialization
- More iterations
- Distance optimizations

**Optimized / Advanced**
- KMeans++
- DBSCAN
- Hierarchical Clustering
- Gaussian Mixture Models (GMM)

### Test Case

In [7]:
# Test Case 1: Correct Number of Clusters
assert len(set(labels_brute)) == 2
print("Test Case 1 Passed")


Test Case 1 Passed


In [8]:
# Test Case 2: Silhouette Score Range
assert -1 <= silhouette_brute <= 1
print("Test Case 2 Passed")


Test Case 2 Passed


In [9]:
# Test Case 3: Optimized vs Brute Force Consistency
assert len(set(labels_opt)) == len(set(labels_brute))
print("Test Case 3 Passed")


Test Case 3 Passed


In [11]:
# Test Case 4: Larger Random Dataset
np.random.seed(0)
X_large = np.random.rand(100, 2)

labels_large, _ = kmeans_bruteforce(X_large, k=3)
score_large = silhouette_score(X_large, labels_large)

assert score_large > 0
print("Test Case 4 Passed")


Test Case 4 Passed


## Complexity Analysis

### Brute Force KMeans
- Time: O(n × k × d × i)
- Space: O(n + k)

### Optimized (sklearn)
- Time: O(n × k × d × i)
- Space: O(n + k)

Where:
- n = samples
- k = clusters
- d = dimensions
- i = iterations

#### Thank You!!