# BMI ML Bootcamp #2 - Unsupervised Learning

In [None]:
import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import make_blobs
from sklearn.metrics import silhouette_score

from sklearn.cluster import AgglomerativeClustering
from scipy.cluster.hierarchy import dendrogram

### Make blobby dataset

Generate a random dataset to cluster. If n_samples is an int and centers is None, 3 clusters are generated. If n_samples is a list of ints, len(n_samples) will be generated, each with the number of points at that index,

In [None]:
X, y = make_blobs(n_samples = ?, centers = ?)
plt.scatter(X[:,0], X[:,1])

## k-Means Clustering

The skeleton of K-means clustering is implemented for you below. Fill in the blanks (or feel free to erase what I did, if it interferes with your implementation). Do you get the clusters that you could expect? How does changing k impact the clusters that you get?

In [None]:
def assign_points(centroids, data):
    clusters = []
    for i in range(len(centroids)):
        clusters.append([])
        
    for point in data:
        distances = [np.linalg.norm(point - centroid) for centroid in centroids] 
        clusters[np.argmin(distances)].append(point)
    return clusters

In [None]:
def run_kmeans(k, data):
    
    #initialize random centroids
    new_centroids = ?
    
    #starting clusters
    clusters = ?
    
    old_centroids = [np.full((2,), np.nan)] * 3
    
    iterations = 0
    while not np.array_equal(old_centroids,new_centroids): #if the centroids moved
        old_centroids = ? #Save old centroids
        new_centroids = ? #Calculate new centroids
        clusters = ? #assign new clusters
        iterations += 1
    
    # By this point, clusters should be a list of lists
    labels = [[sum((point == cpoint).all() for cpoint in cluster) for cluster in clusters].index(1) for point in data]
    return labels, iterations
    

Show a scatter plot of your data, colored by cluster label. Does the labelling look reasonable?

Run k-means clustering 10 times. How many iterations does the algorithm take to converge, on average?

Now implement k-means++ for the initial cluster selection, then run the k-means++ 10 times. Do you see a decrease in the number of iteractions that the algorithm takes to converge?
(https://en.wikipedia.org/wiki/K-means%2B%2B has the k-means initialization pseudocode)

In [None]:
def run_kmeans_pp(k, data):
    
    # initialize centroids using k-means approach
    new_centroids = ?

    # Fill in the rest from your k-means implementation

In [None]:
Plot the initial k-means centroids and k-means++ centroids. Do the k-means++ centroids look different

## Evaluating clusters

Run k-means with various values of k, and plot the silhouette score for the resulting clusters. What value of k seems to be the best? Does that match your intution when visually inspecting the data?

## Agglomerative clustering

Using the scikit-learn agglomerative clustering implementation, cluster your data for a few values of n_clusters. Does this give you similar clusters to those you saw in k-means?

In [None]:
fig, axes = plt.subplots(1, 4, figsize=(20,5))

for ax_index, i in enumerate(range(?)):
    labels = AgglomerativeClustering(n_clusters = ?).fit_predict(X)
    axes[ax_index].scatter(X[:,0], X[:,1], c = labels)

In [None]:
#taken from https://scikit-learn.org/stable/auto_examples/cluster/plot_agglomerative_dendrogram.html

def plot_dendrogram(model, **kwargs):
    # Create linkage matrix and then plot the dendrogram

    # create the counts of samples under each node
    counts = np.zeros(model.children_.shape[0])
    n_samples = len(model.labels_)
    for i, merge in enumerate(model.children_):
        current_count = 0
        for child_idx in merge:
            if child_idx < n_samples:
                current_count += 1  # leaf node
            else:
                current_count += counts[child_idx - n_samples]
        counts[i] = current_count

    linkage_matrix = np.column_stack([model.children_, model.distances_,
                                      counts]).astype(float)

    # Plot the corresponding dendrogram
    dendrogram(linkage_matrix, **kwargs)

Using the plot_dendrogram function above, visualize the clusters given a distance threshold of 0 and n_clusters = None. Does this reflect the same clusters that you saw?

Scikit-learn has a bunch of other clustering algorithms, including DBScan, nearest neighbor, affinity propagation, etc. This tutorial shows their performance: https://scikit-learn.org/stable/auto_examples/cluster/plot_cluster_comparison.html. Try importing a few and playing around with them. Which are the fastest? Do some fail on globular clusters, like those you generated?