# Week 10: K-means Clustering

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

In this lab we'll implement the k-means algorithm in 3 parts:
    
    
- Initialise centroids
- Assign points to centroids
- Update centroid locations to cluster means

</div>

## Setup

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import os
import requests
from IPython.core.display import HTML

In [None]:
# Load stylesheet
HTML(requests.get('https://raw.githubusercontent.com/melbournebioinformatics/COMP90014/main/data/2023/style/custom.css').text)

In [None]:
# Handy function to fetch our files
def fetch_file(url, outpath='.'):
    response = requests.get(url)
    if response.status_code == 200:
        print('File found!')
        # Get the filename from the URL
        filename = os.path.basename(url).split('?', 1)[0]
        # Construct the filepath using the specified directory and filename
        filepath = os.path.join(outpath, filename)
        # Create the directory if it doesn't exist
        if not os.path.exists(outpath):
            print(f'Creating output dir: {outpath}')
            os.makedirs(outpath)
        # Check if the file already exists in the specified directory
        if os.path.exists(filepath):
            print(f'{filename} already exists in {outpath}. Skip download.')
        else:
            with open(filepath, 'wb') as f:
                f.write(response.content)
                f.close()
            print(f'Saved to: {filepath}')
    else:
        print(f'File not found: Code {response.status_code}')

In [None]:
files = ['kmeans_utilities.py']

for filename in files:
    url = f'https://github.com/melbournebioinformatics/COMP90014/blob/main/data/2023/Workshop_10/src/{filename}?raw=true'
    fetch_file(url,outpath='src')

## Data 

We will represent points in the space as tuples, and use lists of tuples for our dataset, like so:

In [None]:
# Five two-dimensional points
example_data_2d = [(2,3),(5,3.4),(1.3,0.2),(3.1,3),(2.2,4)]

# Five three-dimensional points
example_data_3d = [(2,3,1.2),(5,3.4,4),(1.3,0.2,5.2),(3.1,3,3),(2.2,4,2)]

These algorithms can be written more efficiently using numpy arrays. All the hints are set up to assume that you will write your functions using lists of tuples, but if you're experienced with Python, you can try using numpy arrays instead. Numpy implements vectorised maths and so it will be faster to run, and in many cases more concise to write.

We'll use 2D data points so it's easy to visualise our results. This code generates some "real" clusters probabilistically, and a smattering of random points all over the space.

In [None]:
# cluster1 is centred at (1,1) and has standard deviation 0.2, and 40 points
cluster1 = np.random.randn(40,2)*0.2+np.array([[1,1]])

# cluster2 is centred at (2,1), and has standard deviation 0.2, and 40 points
cluster2 = np.random.randn(40,2)*0.2+np.array([[2,1]])

# cluster3 is centred at (1.5,2), and has standard deviation 0.2, and 40 points
cluster3 = np.random.randn(40,2)*0.2+np.array([[1.5,2]])

background = np.random.uniform(low=[0,0],high=[3,3],size=(30,2))

In [None]:
# Merge our different datasets into one array
points_array = np.concatenate([cluster1,cluster2,cluster3,background])

# We'll represent the points as a list of tuples
points = [tuple(p) for p in points_array]

# Show the first five
print(points[:5])

In [None]:
# Plot the points from our list of tuples
x_values = [x for (x,y) in points]
y_values = [y for (x,y) in points]
plt.scatter(x_values, y_values)

In [None]:
# Note: If using a numpy array instead of a list, we'd write
plt.scatter(points_array[:,0],points_array[:,1])

## K-means 

The function to initialise centroids is provided for you. This function returns a list of k centroids, randomly placed. Notice that it is completely random - we could improve this function by trying to space the centroids far apart from one another.

In [None]:
def initialise_centroids(data, k):
    """
    Place centroids randomly into range of data of arbitrary dimension.
    Takes a list of N data points.
    Returns a list of k centroids, each of which will be a tuple of the same
    dimensionality as the data points.
    """
    d = len(data[0])
    
    # Make the data into a numpy array
    arr = np.array(data)
    minvals = np.min(data,axis=0)
    maxvals = np.max(data,axis=0)
    centroids = np.random.uniform(low=minvals,high=maxvals,size=(k,d))
    
    # Return our centroids as a list of tuples
    
    return [tuple(c) for c in centroids]

### Exercise 1: Assign points to centroids

<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">
<b>Challange:</b> Complete the function `assign_points()`. Given a list of k centroids and a list of N points, find which is the closest centroid to each point. Your function should return a list of N integers. Each integer should be a number from 0 to k-1, corresponding to the closest centroid for that point.
    
- [ ] Take in an array of shape (k,d) representing centroid coordinates, 
- [ ] and an array of shape (N,d) representing data coordinates.
- [ ] Assign each point to its closest centroid.
- [ ] Return a list of N integers where each value is between 0 and k-1, corresponding to the closest centroid for that point.
    
</div>



In [None]:
# This function is provided for you. It calculates Euclidean distance between two points.
from src.kmeans_utilities import euclidean_distance, plot_kmeans

print(euclidean_distance((1,1),(3,3)))

In [None]:
def assign_points(centroids, data):
    """
    Assign each point to its closest centroid.
    Take in an array of shape (k,d) representing centroid coordinates,
    and an array of shape (N,d) representing data coordinates.
    Return a list or array of N values where each value is between 0 and k-1
    and represents the centroid that the data point has been assigned to.
    """
    closest_centroids = []
    
    # Here is code for a wrong answer, which just assigns every point to cluster 0
    # (i.e. to the first centroid in the list).
    # Change this code to assign points to their nearest centroids.
    N = len(data)
    closest_centroids = [0]*N
    
    ### BEGIN SOLUTION

    for point in data:
        distances = []
        
        for centroid_position in centroids:
            dist = euclidean_distance(point,centroid_position)
            distances.append(dist)
        
        min_dist = min(distances)
        closest_centroid = distances.index(min_dist)
        closest_centroids.append(closest_centroid)
        
    ### END SOLUTION        
    
    return closest_centroids

In [None]:
# Should return [0, 1, 0, 1, 1]
example_centroids = [(2,2),(4,4)]
assign_points(example_centroids, example_data_2d)

In [None]:
# Should return [1, 0, 1, 1, 1]
example_centroids = [(5,2,0),(3,1,1)]
assign_points(example_centroids, example_data_3d)

In [None]:
# Let's test our function by assigning datapoints to their nearest randomly initialised centroid.

# Init centroids 
random_centroids = initialise_centroids(points, 3)
# Assign points to centroid
clusters = assign_points(random_centroids, points)
# Plot the clusters
plot_kmeans(points, random_centroids, clusters, 3)

### Exercise 2: Calculate mean centroids


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

<b>Challange:</b> Complete the function `calculate_mean_centroids()`. This function should take in the list of data points, the list of assignments to clusters, and k, and return the a list of centroids.

The function `average_point()` is provided for you. Given a list of points, it finds the mean. You need to pass it the correct points for each cluster.

- [ ] Take as input: list of data point coordinates; list of cluster assignments; value of k
- [ ] Find the new mean centroid position for each cluster 
- [ ] Return a list of k centroid coordinates

</div>




In [None]:
# This function is provided for you
from src.kmeans_utilities import average_point

In [None]:
def calculate_mean_centroids(data, assignments, k):
    """
    Take list of N data points (a list of tuples)  
    and a list of N centroid assignments, 
    and return a list of k centroids.
    """
    
    # Here is an incorrect solution that just sets each centroid to (0,0)
    # (or (0,0,0), or (0,0,0,0) etc, depending on the dimension of the data points)
    # Replace this code so that the averages are calculated for each cluster.
    N = len(data)
    d = len(data[0])
    print(N)
    print(d)
    # Set points to zero (wrong!)
    zero_centroid = tuple([0]*d) 
    centroids = [zero_centroid]*k
    return centroids
    
    ### BEGIN SOLUTION
    updated_centroids = []
    
    for cluster in range(k):
        points = [point for point, assignment in zip(data, assignments) if assignment == cluster]
        updated_centroids.append(average_point(points))
    
    ### END SOLUTION   
    
    return updated_centroids

In [None]:
# Should return centroids [( 1.65,  1.6 ), (3.43333333,  3.46666667)]
calculate_mean_centroids(example_data_2d, [0, 1, 0, 1, 1], 3)

In [None]:
# Should return centroids [(5, 3.34, 4), (2.15,2.55,2.85)]
calculate_mean_centroids(example_data_3d, [1, 0, 1, 1, 1], 2)

In [None]:
# Assign datapoints to a new randomly initialised centroid as before:

# Init centroids 
random_centroids = initialise_centroids(points, 3)

# Assign points to centroid
clusters = assign_points(random_centroids, points)

# Plot the clusters
plot_kmeans(points, random_centroids, clusters, 3)

In [None]:
# Now let's update centroid locations to mean point of each cluster:

updated_centroids = calculate_mean_centroids(points, clusters, 3)

# Plot the clusters with updated centroid locations 
# Note: We have not updated the cluster assignments yet, just the centroids!

plot_kmeans(points, updated_centroids, clusters, 3)

### Exercise 3: K-means clustering


<div style="color: rgb(27,94,32); background: rgb(200,230,201); border: solid 1px rgb(129,199,132); padding: 10px;">

<b>Challange:</b> Complete the `kmeans()` function to carry out k-means clustering. You can use the functions you created in the first two exercises. You only need to fill in the missing lines in the iterative loop.

- [ ] Take as input: List of data coordinates as tuples; value of k
- [ ] Return a tuple of (<b>centroids</b>, <b>cluster_assignments</b>)
- [ ] <b>centroids</b> is a list of centroid points, and each centroid is a tuple
- [ ] <b>cluster_assignments</b> a list of N numbers representing cluster assignments,
     where each number is between 0 and k-1
    
</div>

In [None]:
# These functions are provided for you (and used already below)
from src.kmeans_utilities import points_equal, plot_kmeans

In [None]:
def kmeans(data, k):
    """
    Implement k-means clustering on a given set of points.
    data should be a list of N points, where each point is a tuple.
    Returns a tuple of (centroids, cluster_assignments), where 
    centroids is a list of centroid points, and each centroid is a tuple; and
    cluster_assignments a list of N numbers representing cluster assignments,
    where each number is between 0 and k-1.
    """
    
    N = len(data)
    d = len(data[0])
    centroids = initialise_centroids(data, k)
    cluster_assignments = assign_points(centroids, data)
    old_centroids = [(0,)*d]*k  # unlikely to be equal to centroids at start
    
    # We will use the stopping condition "no change in centroid location"
    while not points_equal(centroids, old_centroids):
        old_centroids = centroids
        
        ### Fill in the iterative k-means steps 
        # Optionally, if you'd like to plot what is happening at each step, 
        # uncomment the following line
        #plot_kmeans(data, centroids, assignments, k)
        
        ### BEGIN SOLUTION
    
        cluster_assignments = assign_points(centroids, data)
        centroids = calculate_mean_centroids(data, cluster_assignments, k)
        print(centroids)
        #plot_kmeans(data, centroids, cluster_assignments, k)
        
        ### END SOLUTION  
        
    return (centroids, cluster_assignments)


Now test on the 2D data we created at the top of the notebook:

In [None]:
k = 3
centroids, assignments = kmeans(points, k)
print('List of centroid coordinates: ')
print(centroids)
print('List of cluster assignments: ')
print(assignments)

In [None]:
# And finally let's view our final cluster assignments:
plot_kmeans(points, centroids, assignments, k)

## Bonus - DBScan
We're going to use the sklearn package to implement DBSCAN using the tutorial outlined [here](https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html#sphx-glr-auto-examples-cluster-plot-dbscan-py)

In [None]:
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs, make_circles, make_moons
from sklearn.preprocessing import StandardScaler

In [None]:
# Generate sample data
centers = [[1, 1], [-1.5, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,random_state=0)
# Other shapes to test:
#X, labels_true = make_circles(n_samples=750, factor=0.5, noise=0.05)
#X, labels_true = make_moons(n_samples=750, noise=0.05)

X = StandardScaler().fit_transform(X)

**Go ahead and play with the epsilon values and min sample values below to check the effects they have on the clusters**

In [None]:
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X) ## <- CHANGE THESE VALUES AND PLOT THE EFFECTS 

core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: {}'.format(n_clusters_))
print('Estimated number of noise points: {}'.format( n_noise_))

In [None]:
# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]
    class_member_mask = (labels == k)

    #plot the cluster values
    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    #plot the noise values
    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

## Revision Questions

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 1:</b> Which of the following conditions are possible termination criteria for K-Means?

A. After a fixed number of iterations are done.
    
B. After point assignment to clusters no longer changes between iterations.

C. After centroids no longer change (or change very little) between iterations.
 
</div>

=== BEGIN MARK SCHEME ===

A,B,C 

=== END MARK SCHEME ===

YOUR ANSWER HERE

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 2:</b> Which type of points are removed by the DBSCAN algorithm?

A. Noise points
    
B. Points within epsilon distance of a core point
    
C. Border points with fewer than minPts within epsilon distance
    
</div>

=== BEGIN MARK SCHEME ===

A

=== END MARK SCHEME ===

YOUR ANSWER HERE

<div style="color: rgb(0,96,100); background: rgb(178,235,242); border: solid 1px rgb(77,208,225); padding: 10px;">
<b>Question 3:</b> Which of the following is true of DBSCAN?
    
    
A. DBSCAN can find arbitrarily shaped clusters.
    
B. DBSCAN is not sensitive to input parameters.
    
C. DBSCAN has a notion of noise and is robust to outliers.
    
D. DBSCAN does not require one to specify the number of clusters.
    
</div>

=== BEGIN MARK SCHEME ===

A, C, D

=== END MARK SCHEME ===

YOUR ANSWER HERE