# Python Assignment --- $K$-Means

## Description

Implement $K$-Means algorithm to a specific 2-dimensional dataset.

*Use `Run All` to generate data and find the standard output.*

## Generating data

In this section, `k` and `rep` are given as the value of $k$ and the number of iterations, and `data` is given as a list of two-element lists, which can be interpreted as a list of points. Additional `cid` is also given, as the indices of initial points.

In [None]:
import random

In [None]:
mu = [[0., 0.], [2., 3.], [-1, 4.], [10., 10.]]
sigma = [[1., 1.], [0.7, 0.9], [0.6, 0.4], [2., 3.]]
num = [50, 150, 100, 200]
k = 4

In [None]:
seed = 0
random.seed(seed)

In [None]:
data = []
for i in range(k):
    data += [[random.gauss(mu[i][j], sigma[i][j]) for j in range(2)] for r in range(num[i])]

In [None]:
print(data[:5])

In [None]:
rep = 40
cid = [0, 1, 2, 3]

(Some simple visualization, which you may skip)

In [None]:
import matplotlib.pyplot

In [None]:
matplotlib.pyplot.scatter([point[0] for point in data], [point[1] for point in data])
matplotlib.pyplot.show()

## Key

Skip this part for your first reading. The key is used to generate standard output.

In [None]:
def key_distance2(point1, point2):
    return sum((point1[i] - point2[i])**2 for i in range(2))

In [None]:
def key_centroid(data, pid):
    return [sum(data[p][j] for p in pid) / len(pid) for j in range(2)]

In [None]:
def key_k_means(k, data, cid, rep):

    center = [data[cid[i]] for i in range(k)]

    for r in range(rep):

        bucket = [[] for i in range(k)]

        for i, e in enumerate(data):
            min_bucket, min_distance2 = -1, float("inf")
            for b in range(k):
                d = key_distance2(center[b], e)
                if d < min_distance2:
                    min_bucket, min_distance2 = b, d
            bucket[min_bucket].append(i)

        center = [key_centroid(data, bucket[i]) for i in range(k)]

    return bucket

In [None]:
key_bucket = key_k_means(k, data, cid, rep)

## Standard output

In [None]:
for i in range(k):
    print("{0}: {1!r}".format(i, key_bucket[i]))

## Your implementation

Your answer should be given as a list `bucket` of $k$ lists, which represent the $k$ clusters and contain the indices in each cluster respectively. Try to achieve the standard output shown in the last section.

In [None]:
# Your code here

In [None]:
for i in range(k):
    print("{0}: {1!r}".format(i, bucket[i]))