# Worksheet 05

Name: Yuhan Peng <br>
UID: U25596256

### Topics

- Cost Functions
- Kmeans

### Cost Function

Solving Data Science problems often starts by defining a metric with which to evaluate solutions were you able to find some. This metric is called a cost function. Data Science then backtracks and tries to find a process / algorithm to find solutions that can optimize for that cost function.

For example suppose you are asked to cluster three points A, B, C into two non-empty clusters. If someone gave you the solution `{A, B}, {C}`, how would you evaluate that this is a good solution?

Notice that because the clusters need to be non-empty and all points must be assigned to a cluster, it must be that two of the three points will be together in one cluster and the third will be alone in the other cluster.

In the above solution, if A and B are closer than A and C, and B and C, then this is a good solution. The smaller the distance between the two points in the same cluster (here A and B), the better the solution. So we can define our cost function to be that distance (between A and B here)!

The algorithm / process would involve clustering together the two closest points and put the third in its own cluster. This process optimizes for that cost function because no other pair of points could have a lower distance (although it could equal it).

### K means

a) (1-dimensional clustering) Walk through Lloyd's algorithm step by step on the following dataset:

`[0, .5, 1.5, 2, 6, 6.5, 7]` (note: each of these are 1-dimensional data points)

Given the initial centroids:

`[0, 2]`

1. Randomly pick k centers {𝝻1, … , 𝝻k}<br>
2. Assign each point in the dataset to its closest center<br>
3. Compute the new centers as the means of each cluster<br>
4. Repeat 2 & 3 until convergence<br>

1. centroids: [0, 2]<br>
    groups: [0, 0.5], [1.5, 2, 6, 6.5, 7]<br>
2. centroids: [0.25, 4.6]<br>
    groups: [0, 0.5, 1.5, 2], [6, 6.5, 7]<br>
3. centroids: [1, 6.5]<br>
    groups: [0, 0.5, 1.5, 2], [6, 6.5, 7]<br>
4. centroids: [1, 6.5]<br>
    converge. 

b) Describe in plain english what the cost function for k means is.

It is the number of points that minimize the cost function.

c) For the same number of clusters K, why could there be very different solutions to the K means algorithm on a given dataset?

1. **Initialization**: One of the primary reasons is the initialization of the centroids. K-means is sensitive to the initial placement of centroids. If centroids are initialized differently, the algorithm might converge to different local optima. This is why multiple runs of k-means with different initializations are often performed, and the solution with the best (lowest) objective value is chosen.

2. **Local Optima**: The k-means objective function has multiple local optima. Depending on the starting conditions, the algorithm might converge to any of these optima. Thus, two runs might produce two different solutions, both of which are locally optimal but not necessarily globally optimal.

3. **Order of Data Points**: The order in which data points are considered can also influence the clustering outcome, especially in online or incremental versions of k-means. If data points are processed in a different sequence in two separate runs, centroids might be updated differently leading to different final solutions.

4. **Convergence Criteria**: The stopping criteria for the algorithm can also affect the final solution. For example, if one run is stopped after a fixed number of iterations while another run is allowed to continue until the assignments no longer change, the solutions might differ.

5. **Choice of Distance Metric**: Even though the canonical k-means algorithm uses the Euclidean distance, variations of the algorithm might use other distance metrics. The choice of metric can influence the shape and compactness of clusters and, therefore, the final solution.

6. **Tie-breaking**: If two centroids are equidistant from a data point, a decision has to be made regarding which cluster the data point should be assigned to. Different runs might resolve such ties differently.

7. **Presence of Noise and Outliers**: K-means is sensitive to noise and outliers. A slight perturbation in the data, or a slightly different handling of outliers, can lead to different clustering solutions.

8. **Data Preprocessing**: The way data is preprocessed, including scaling, normalization, or transformation, can influence the relative distances between data points, leading to different clustering solutions.

d) Does Lloyd's Algorithm always converge? Why / why not?

Yes. While the objective function decreases monotonically, there are only a finite number of ways to assign points to centroids. So, in practice, the algorithm will eventually reach a state where assignments don't change, ensuring convergence.

e) Follow along in class the implementation of Kmeans

In [1]:
import numpy as np
from PIL import Image as im
import matplotlib.pyplot as plt
import sklearn.datasets as datasets

centers = [[0, 0], [2, 2], [-3, 2], [2, -4]]
X, _ = datasets.make_blobs(n_samples=300, centers=centers, cluster_std=1, random_state=0)
print(X)

class KMeans():

    def __init__(self, data, k):
        self.data = data
        self.k = k
        self.assignment = [-1 for _ in range(len(data))]
        self.snaps = []
    
    def snap(self, centers):
        # print(5)
        TEMPFILE = "temp.png"

        fig, ax = plt.subplots()
        ax.scatter(X[:, 0], X[:, 1], c=self.assignment)
        ax.scatter(centers[:,0], centers[:, 1], c='r')
        fig.savefig(TEMPFILE)
        # print(TEMPFILE)
        plt.close()
        self.snaps.append(im.fromarray(np.asarray(im.open(TEMPFILE))))

    def initialize(self):
        return self.data[np.random.choice(range(len(self.data)),self.k, replace = False)]
    
    def distance(self,x,y):
        return  np.linalg.norm(x-y)
    
                         
    def assign(self, centers):
        for i in range(len(self.data)):
            delta = [float('inf'),0]
            for j in range(len(centers)):
                distance = self.distance(centers[j],self.data[i])
                if distance<delta[0]:
                    delta[0] = distance 
                    delta[1] = j 
            
            self.assignment[i] = delta[1]


            
    def get_centers(self):
        centers = []

        for i in set(self.assignment):
            cluster = []

            for j in range(len(self.data)):
                
                if self.assignment[j] ==i:
                    cluster.append(data[j])
            x = 0
            y = 0
            for delta in range(len(cluster)):
                x+=cluster[delta][0]
                y+=cluster[delta][1]
            centers.append([x/len(cluster), y/len(cluster)])
        
        return np.array(centers)

            
    def is_diff_centers(self,centers, new_centers):
        n = len(centers)
        flag = 0
        for i in range(n):
            if centers[i][0]!=new_centers[i][0]:
                flag = 1
        
        if flag ==1:
            return True 
        return False



    def lloyds(self):
        # ...
        # print(15)
        centers = self.initialize()
        self.assign(centers)
        self.snap(centers)
        new_centers = self.get_centers()

        while self.is_diff_centers(centers,new_centers):
            # print(10)
            
            self.assign(new_centers)
            centers = new_centers
            self.snap(centers)
            new_centers = self.get_centers()

        
        return 

            

kmeans = KMeans(X, 4)
kmeans.lloyds()
images = kmeans.snaps
# print(kmeans.snaps)

images[0].save(
    'kmeans.gif',
    optimize=False,
    save_all=True,
    append_images=images[1:],
    loop=0,
    duration=500
)

[[ 0.95008842 -0.15135721]
 [ 3.95591231  2.39009332]
 [-3.35343175  0.38352581]
 [ 1.5444675   2.01747916]
 [ 0.46566244 -1.53624369]
 [ 2.15650654  2.23218104]
 [ 3.31913688 -4.88241882]
 [ 0.66778835 -5.96862469]
 [ 1.45713852  2.41605005]
 [ 0.67229476  0.40746184]
 [ 2.94246812 -4.26759475]
 [-0.10321885  0.4105985 ]
 [-0.69456786 -0.14963454]
 [-0.23960406 -3.59850094]
 [ 2.52327666  1.82845367]
 [ 1.75798017 -2.48173883]
 [ 1.48248096 -4.97882986]
 [-3.56931205  2.26990435]
 [-3.39522898  0.84057948]
 [-3.97110457  2.3148172 ]
 [ 0.76103773  0.12167502]
 [ 0.85253135  1.56217996]
 [-0.50965218 -0.4380743 ]
 [-0.43515355  1.84926373]
 [-3.46684555  0.58309389]
 [-0.34791215  0.15634897]
 [-2.629175    2.14206181]
 [ 1.69098703  0.32399619]
 [-2.1994352   2.07826018]
 [ 2.28634369  2.60884383]
 [-3.01568211  2.16092817]
 [ 2.03700572 -3.23209759]
 [ 0.89561666  2.05216508]
 [-3.19065349  1.60515049]
 [ 0.0481959  -4.65989173]
 [ 0.4170616   2.61037938]
 [ 1.60055097  2.37005589]
 

NameError: name 'data' is not defined