# 3. Unsupervised Learning
In unsupervised learning, we are given a training data set $\{x^{(1)}, \ldots, x^{(n)}\}$ and our task is to group the data into a few cohesive groups or *clusters*. Here, $x^{(i)} \in \mathbb{R}^d$ for $i = 1, \ldots, n$; but no labels $y^{(i)}$ are given. 

## 3.1 K-Means Algorithm
The k-means clustering algorithm is a popular unsupervised learning algorithm, which is commonly used in various fields such as image segmentation, market segmentation, and customer profiling. In this algorithm, we first randomly initialize $k$ cluster centroids. We then assign each data point in our training samples to its nearest cluster centroid. Finally, each cluster centroid is updated by taking the average of the data points assigned to it, and this process keeps repeated until convergence. More formally:

1. Initialize cluster centroids $\mu_1, \ldots, \mu_k \in \mathbb{R}^d$ randomly.
1. Repeat until convergence:

    1. for every $i$, set 
    $$c^{(i)} := \text{argmin}_{j} \|x^{(i)} - \mu_j \|$$
    2. for every $j$, set 
    $$\mu_j = \frac{\sum_{i = 1}^{n} \mathbb{1}_{\{c^{(i)} = j\}}x^{(i)}}{\sum_{i = 1}^{n} \mathbb{1}_{\{c^{(i)} = j\}}}.$$
    
Below you can find an implementation of this algorithm.   

In [10]:
import numpy as np
import collections

In [19]:
class KMeans:
    def __init__(self, num_clusters = 3, iterations = 100):
        self.num_clusters = num_clusters
        self.iterations = iterations
    
    def fit(self, X):
        n_samples, d_features = X.shape
        self.centroids = X[np.random.choice(n_samples, self.num_clusters, replace=False)]
        for i in range(self.iterations):
            clusters = collections.defaultdict(list)
            for j in range(n_samples):
                distances = np.linalg.norm(X[j] - self.centroids, axis = 1)
                clusters[np.argmin(distances)].append(X[j])
            for k in clusters:
                cluster_avg = np.mean(clusters[k], axis=0)
                self.centroids[k] = cluster_avg
                
        return
    
        
    def predict(self, X):
        clusters = []
        for j in range(len(X)):
            distances = np.linalg.norm(X[j] - self.centroids, axis = 1)
            clusters.append(np.argmin(distances))
        return clusters

In [20]:
x1 = np.random.randn(5,2) + 5
x2 = np.random.randn(5,2) - 5
X = np.concatenate([x1,x2], axis=0)

# Initialize the KMeans object with k=3
kmeans = KMeans(num_clusters=2)

# Fit the k-means model to the dataset
kmeans.fit(X)

cluster_assignments = kmeans.predict(X)

# Print the cluster assignments
print(cluster_assignments)

# Print the learned centroids
print(kmeans.centroids)

[1, 1, 1, 1, 1, 0, 0, 0, 0, 0]
[[-4.71687224 -4.56277632]
 [ 5.73413132  5.9735165 ]]
