# Getting Set Up
##### If we are going to compare clustering algorithms we’ll need a few things; first some libraries to load and cluster the data, and second some visualization tools so we can look at the results of clustering.


In [1]:
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import sklearn.cluster as cluster
import time

%matplotlib inline
sns.set_context('poster')
sns.set_color_codes()
plot_kwds = {'alpha': 0.25, 's': 80, 'linewidths': 0}


ModuleNotFoundError: No module named 'seaborn'

# Loading the Data
##### We need some data to work with. In order to make this more interesting, we’ll use an artificial dataset that will give clustering algorithms a challenge – some non-globular clusters, some noise, etc.; the sorts of things we expect to crop up in messy real-world data. 
##### The dataset is two-dimensional for visualization purposes.


In [None]:
data = np.load('clusterable_data.npy')

plt.scatter(data.T[0], data.T[1], c='b', **plot_kwds)
frame = plt.gca()
frame.axes.get_xaxis().set_visible(False)
frame.axes.get_yaxis().set_visible(False)
plt.show()


# Utility Function for Clustering and Plotting
##### To start, let's set up a utility function to perform clustering and plot the results. We will also time the clustering algorithm and display the time taken.


In [None]:
def plot_clusters(data, algorithm, args, kwds):
    start_time = time.time()
    labels = algorithm(*args, **kwds).fit_predict(data)
    end_time = time.time()
    
    palette = sns.color_palette('deep', np.unique(labels).max() + 1)
    colors = [palette[x] if x >= 0 else (0.0, 0.0, 0.0) for x in labels]
    
    plt.scatter(data.T[0], data.T[1], c=colors, **plot_kwds)
    frame = plt.gca()
    frame.axes.get_xaxis().set_visible(False)
    frame.axes.get_yaxis().set_visible(False)
    plt.title('Clusters found by {}'.format(str(algorithm.__name__)), fontsize=24)
    plt.text(-0.5, 0.7, 'Clustering took {:.2f} s'.format(end_time - start_time), fontsize=14)
    plt.show()


# K-Means Clustering
##### K-Means is a commonly used clustering algorithm. It partitions the data into a specified number of clusters, which we know is six for our dataset. Let's see how K-Means performs.


In [None]:
plot_clusters(data, cluster.KMeans, (), {'n_clusters': 6})


# Affinity Propagation Clustering
##### Affinity Propagation uses a graph-based approach to determine clusters based on exemplars. It does not require the number of clusters to be specified, but other parameters like 'preference' need to be adjusted.


In [None]:
plot_clusters(data, cluster.AffinityPropagation, (), {'preference': -5.0, 'damping': 0.95})


# Mean Shift Clustering
##### Mean Shift is another centroid-based algorithm that does not require specifying the number of clusters. It uses kernel density estimation to place centroids at the maxima of the density function.


In [None]:
plot_clusters(data, cluster.MeanShift, (0.175,), {'cluster_all': False})


# Spectral Clustering
##### Spectral Clustering transforms the data using the Laplacian of the graph induced by the distances between points. It then uses a standard clustering algorithm like K-Means on the transformed space.


In [None]:
plot_clusters(data, cluster.SpectralClustering, (), {'n_clusters': 6})


# Agglomerative Clustering
##### Agglomerative Clustering starts with each point as its own cluster and merges the closest pairs of clusters iteratively. It can be used to build a hierarchy of clusters and then cut it to form a flat clustering.


In [None]:
plot_clusters(data, cluster.AgglomerativeClustering, (), {'n_clusters': 6, 'linkage': 'ward'})


# DBSCAN Clustering
##### DBSCAN is a density-based clustering algorithm that forms clusters based on dense regions in the data. It does not require the number of clusters to be specified but has other parameters like epsilon.


In [None]:
plot_clusters(data, cluster.DBSCAN, (), {'eps': 0.025})


# HDBSCAN Clustering
##### HDBSCAN extends DBSCAN by allowing for varying density clusters and removing the need to specify an epsilon parameter. It uses a hierarchy of clusters and selects the most stable clusters.


In [None]:
import hdbscan

plot_clusters(data, hdbscan.HDBSCAN, (), {'min_cluster_size': 15})
