# Worksheet 05

Name:  Haoxiang Huo
UID: U13668934

### Topics

- Cost Functions
- Kmeans

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`


Iteration 1:
Assignments: {0, 0.5} and {1.5, 2, 6, 6.5, 7};
Updated centroids: [0.25, 4.4];

Iteration 2:
Assignments: {0, 0.5, 1.5, 2} and {6, 6.5, 7};
Updated centroids: [1, 6.5];

Iteration 3:
Assignments don't change from Iteration 2 and centroids remain [1, 6.5]. Algorithm converges.

Final Clusters:
Cluster 1: [0, .5, 1.5, 2];
Cluster 2: [6, 6.5, 7]

b) Describe in plain english what the cost function for k means is.

The cost function for k-means measures how far data points are from the center of their assigned clusters. It's a way to gauge how well the points fit within their clusters. The goal of the k-means algorithm is to minimize this cost function, meaning we want to make sure that data points are as close as possible to the center of their respective clusters. If all the data points are right at the center of their clusters, the cost would be zero, which would be an ideal scenario. But in most cases, points are spread out, and the cost function helps us understand how spread out they are.

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

The order in which data points are processed might affect the clustering outcome, especially in online or mini-batch versions of k-means. On the other hand, if the data isn't well-separated or if there are no clear cluster boundaries, k-means can produce different clustering results. This is especially true if the number of specified clusters (K) doesn't align well with the actual structure in the data.

d) Does Lloyd's Algorithm always converge? Why / why not?

Yes, always converge. The objective of Lloyd's algorithm (or k-means) is to minimize the sum of squared distances from each point to its assigned centroid. In each step of the algorithm, either:

a) A point's assignment to a centroid changes, which means that the distance to its new centroid is shorter than the distance to its old centroid. This ensures a decrease in the overall cost.

Or, 
b) the centroids are recalculated, which by definition will be at the mean of its assigned points, minimizing the sum of squared distances for that cluster.

Given that there's a finite number of possible ways to assign data points to centroids, and in each step the overall cost either decreases or remains the same, the algorithm will eventually reach a point where no further improvement is possible, ensuring convergence.

e) Follow along in class the implementation of Kmeans

In [None]:
import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt
import sklearn.datasets as datasets

centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)

class KMeans():

    def __init__(self, data, k):
        self.data = data
        self.k = k
        self.assignment = [-1 for _ in range(len(data))]
        self.snaps = []
    
    def distance(self, x, y):
        return np.linalog.norm(x - y)
    
    def snap(self, centers):
        TEMPFILE = "temp.png"

        fig, ax = plt.subplots()
        ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
        ax.scatter(centers[:,0], centers[:, 1], c='r')
        fig.savefig(TEMPFILE)
        plt.close()
        self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))

    def initialize(self):
        return self.data[np.random.choice(range(len(self.data)), self.k, replace=False)]
    
    def assign(self, centers):
        for i in range(len(self.data)):
            min = self.distance(centers[0], self.data[i])
            self.assignment[i] = 0
            for j in range(1, len(centers)):
                dist = self.distance(centers[j], self.data[i])
                if dist < min:
                    min = dist
                    self.assignment[i] = j
        return 
    
    def is_diff_clusters(self,centers, new_centers):
        for i in range(len(centers)):
            if self.distance(centers[i], new_centers[i]) != 0:
                return True
        return False
    
    def get_centers(self):
        centers = []
        for i in set(self.assignment):
            clusters = [self.data[j] for j in range(len(self.data)) if self.assignment[j] == i]
            centers.append(np.mean(clusters, axis=0))
        return np.array(centers)
    
    
    def lloyds(self):
        centers = self.initialize()
        self.assign(centers)
        self.snap(centers)
        new_centers = self.get_centers()
        while self.is_diff_clusters(centers, new_centers):
            self.assign(new_centers)
            self.snap(new_centers)
            centers = new_centers.copy()
            new_centers = self.get_centers()
        return
            

kmeans = KMeans(X, 6)
kmeans.lloyds()
images = kmeans.snaps

images[0].save(
    'kmeans.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)