# Unsupervised Learning: Clustering

This notebook contains an example implementation of DBSCAN

Based in Machine learning for physics and Astronomy, Viviana Acquaviva (2023) and Jake Vanderplas' book [Python Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/05.00-machine-learning.html).



In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import make_blobs #create blobs
from sklearn.cluster import KMeans
from scipy.spatial.distance import cdist
from sklearn import metrics

### "Smiley face" distribution
We generate blobs arranged as a smiley face. The blobs are not convex, that is, for any two points within the set, the line segment connecting them is also entirely within the set.


In [None]:
from math import pi, cos, sin
from random import random

def point(h, k, r):
    theta = random() * 2 * pi
    return h + cos(theta) * r, k + sin(theta) * r + 0.2*random()

xy = [point(1,2,1) for _ in range(100)]

In [None]:
X1, y1 = make_blobs(n_samples=10, centers=[(0.5,2.5)],
                       cluster_std=0.05, random_state=1)

X2, y2 = make_blobs(n_samples=10, centers=[(1.5,2.5)],
                       cluster_std=0.05, random_state=2)

X3, y3 = make_blobs(n_samples=10, centers=[(1,1.7)],
                       cluster_std=0.05, random_state=2)

In [None]:
X3_stretch = np.array([X3[:,0]*3, X3[:,1]]) #for the mouth :)

In [None]:
plt.axes().set_aspect('equal', 'datalim')
plt.scatter(*zip(*xy))
plt.scatter(X1[:,0],X1[:,1])
plt.scatter(X2[:,0],X2[:,1])
plt.scatter(X3_stretch.T[:,0]-1.9,X3_stretch.T[:,1])

plt.show()

All the sets of points in just the array X


In [None]:
X = np.vstack([xy,X1,X2,np.array([X3_stretch.T[:,0]-1.9,X3_stretch.T[:,1]]).T])

In [None]:
#X

### Clustering with k-means

In [None]:
kmeans = KMeans(n_clusters=4, n_init = 10, random_state=32) #you can change k as you wish
kmeans.fit(X)
y_kmeans = kmeans.predict(X)
centers = kmeans.cluster_centers_

plt.scatter(X[:, 0], X[:, 1], c=y_kmeans, s=10, cmap='viridis')
plt.scatter(centers[:, 0], centers[:, 1], c='black', s=100, alpha=0.5);

These blobs we created are not convex and do not have a globular shape, so K-means does not perform well.








###  Now with DBSCAN

In [None]:
from sklearn.cluster import DBSCAN

#Code adapted from: https://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) is a clustering algorithm that identifies dense areas of points in the data space as clusters, allowing the detection of groups of any shape while isolating points that do not belong to any cluster (outliers). It uses two key parameters: eps (the maximum distance to consider points as neighbors) and min_samples (the minimum number of neighbors required for a point to be considered a core point). Clusters are built recursively from core points connected by nearby neighbors, while points in low-density areas are identified as outliers. This makes DBSCAN particularly useful for data where clusters are not globular or convex and where noise or isolated points are present.

In each iteration, DBSCAN selects a point and evaluates whether it has enough neighbors within the eps distance to qualify as a core point. If it is a core point, the cluster expands to include all its direct neighbors and their core neighbors recursively. If the point does not have enough neighbors, it is marked as a potential outlier. The process repeats until all points are classified into clusters or labeled as outliers.

In [None]:
# #############################################################################
# Calculate DBSCAN
db = DBSCAN(eps=0.25, min_samples=2).fit(X) #parameters: eps and min_samples
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)

# #############################################################################

#
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

Let's see how the results change if we change the values of `eps` parameter

In [None]:
# #############################################################################
# Calculate DBSCAN

for i,eps in enumerate([0.2, 0.25, 0.3, 0.35]): #iterates for several eps values

    plt.figure(figsize = (6,6))

    db = DBSCAN(eps=eps, min_samples=2).fit(X)

    core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
    core_samples_mask[db.core_sample_indices_] = True
    labels = db.labels_

# Number of clusters in labels, ignoring noise if present
    n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
    n_noise_ = list(labels).count(-1)

    print('Estimated number of clusters: %d' % n_clusters_)
    print('Estimated number of noise points: %d' % n_noise_)

# #############################################################################



    unique_labels = set(labels)
    colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
    for k, col in zip(unique_labels, colors):
        if k == -1:
        # Black used for noise.
            col = [0, 0, 0, 1]

        class_member_mask = (labels == k)

        xy = X[class_member_mask & core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=10)

        xy = X[class_member_mask & ~core_samples_mask]
        plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

    plt.title('$\epsilon$ = %0.2f; estimated number of clusters: %d' % (eps, n_clusters_))

    plt.savefig('DBSCAN_'+str(i)+'.pdf', dpi = 300)



The conclusion here is that a clustering scheme and its evaluation are challenging. To make sense of the clustering scheme, we need some understanding of the structure of our data—but this is precisely what we aim to uncover when applying clustering. These algorithms and their results must be approached with caution.