DBSCAN is a very simple yet powerful algorithm capable of identifying any number of clusters of any shape. It is robust to outliers, and it has just two hyperparameters (eps and min_samples).This algorithm works well if all the clusters are dense enough and if they are well separated by low-density regions.

In [1]:
from sklearn.cluster import DBSCAN
from sklearn.datasets import make_moons

In [2]:
X, y = make_moons(n_samples=1000, noise=0.05)

In [3]:
dbscan = DBSCAN(eps=0.05, min_samples=5) 

dbscan.fit(X)

DBSCAN(algorithm='auto', eps=0.05, leaf_size=30, metric='euclidean',
       metric_params=None, min_samples=5, n_jobs=None, p=None)

In [4]:
dbscan.labels_ # label of all the instances

array([ 1,  0,  1,  1,  2,  1,  3,  0,  2,  0,  0,  4,  5,  1,  4,  3,  5,
        2,  2,  3,  2,  1,  3,  2,  7,  2,  3,  2,  2, -1,  9,  1,  3,  5,
        0,  2,  2,  1,  3,  5,  2,  0,  5,  3,  5,  1,  0,  2,  4,  2,  3,
        2, -1, -1,  0, -1, -1,  3, -1,  3,  4,  1,  1,  2,  2,  3,  5,  3,
        0,  0,  3,  1,  4,  2,  0,  5,  2,  4,  3,  1,  2,  0,  1,  5,  1,
        4,  2,  2,  1,  1,  3,  3,  2, -1,  5,  3, -1,  5,  1,  3,  1,  1,
        3,  4, -1,  1,  4,  5,  6, -1, -1,  2,  2,  1,  2,  2,  2,  2, -1,
        4,  6,  3,  1,  2,  8,  6,  1,  4,  1,  4,  2,  2,  6,  1, -1, -1,
        3,  3,  6,  1,  0,  1,  6,  1,  3,  4,  3,  3,  0,  3,  3,  0,  2,
        1,  1,  5,  0,  1,  3,  3,  4,  2,  1,  2,  1,  4,  3,  1,  0,  3,
        3,  3,  2,  1,  3,  1,  1,  4,  2,  3,  0,  1,  5,  0,  4,  4,  4,
        4,  7,  4,  3, -1,  2,  5,  4,  1,  0,  1,  3,  6,  3,  3,  1,  2,
        2, -1,  1,  2, -1,  3,  1,  2,  0,  6,  3,  3,  1,  3,  5,  3,  0,
        0,  1, -1,  1, -1

In [5]:
dbscan.labels_.shape

(1000,)

### if instances have a cluster index equal to –1, which means that they are considered as anomalies by the algorithm. 

The indices of the core instances are available in the core_sample_indices_ instance variable, and the core instances themselves are available in the components_ instance variable

In [6]:
len(dbscan.core_sample_indices_)

780

In [7]:
len(dbscan.components_)

780

### NOTE! 
Somewhat surprisingly, the DBSCAN class does not have a predict() method, although it has a fit_predict() method. In other words, it cannot predict which cluster a new instance belongs to. This implementation decision was made because different classification algorithms can be better for different tasks, so the authors decided to let the user choose which one to use. Moreover, it’s not hard to implement. For example, let’s train a KNeighborsClassifier:

In [9]:
from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors=50)

In [12]:
knn.fit(dbscan.components_, dbscan.labels_[dbscan.core_sample_indices_])

KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=50, p=2,
                     weights='uniform')

Now, given a few new instances, we can predict which cluster they most likely
belong to and even estimate a probability for each cluster:

In [13]:
import numpy as np

X_new = np.array([[-0.5, 0], [0, 0.5], [1, -0.1], [2, 1]]) 
knn.predict(X_new)

array([3, 0, 4, 1])

In [14]:
knn.predict_proba(X_new)

array([[0.3 , 0.  , 0.  , 0.66, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.04],
       [0.96, 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.  , 0.04, 0.  , 0.  ],
       [0.  , 0.12, 0.  , 0.  , 0.56, 0.28, 0.  , 0.  , 0.  , 0.04, 0.  ],
       [0.  , 0.98, 0.  , 0.  , 0.  , 0.  , 0.  , 0.02, 0.  , 0.  , 0.  ]])