# DBSCAN

Continuing with clustering algorithms, there is another one that, unlike k-Means, does not require you to specify the number of clusters in advance.

This algorithm is known as DBSCAN, or *Density-Based Spatial Clustering of Applications with Noise*.

This algorithm groups elements within a set based on their density in space. Points that are close to each other will be considered part of the same cluster, while points that are very far apart will be considered as noise.

The implementation of `DBSCAN` is within the `sklearn.cluster` module:

In [None]:
from sklearn.cluster import DBSCAN


Let's generate a dataset with 5 data blobs, originally 5 clusters:

In [None]:
from sklearn.datasets import make_blobs

X, y_true = make_blobs(n_samples=500, centers=5, cluster_std=0.8, random_state=0)


As with `DBSCAN`, it does not require specifying the number of clusters, so we can initialize it with its default values â€“ and then we'll use `fit_predict` to obtain the assigned clusters:

In [None]:
dbscan = DBSCAN()
labels = dbscan.fit_predict(X)


If we review the labels, you'll see that some have the value `-1`, these are the ones that were identified as noise:

In [None]:
labels


We can visualize the clusters with the following function.

In [None]:
from utils import view_dbscan

view_dbscan(X, y_true, [("Etiquetas predichas", dbscan)])


## Arguments of `DBSCAN`

DBSCAN has several arguments, but the most important ones to consider are:

 - `eps`: The neighborhood radius that defines the maximum distance between two points to be considered neighbors.
 - `min_samples`: The minimum number of points required to form a cluster. Values of `min_samples` that are too small can result in very small clusters and unwanted noise, while values that are too large can cause fewer points to be grouped.

## Visualization of the arguments

### `eps`

The neighborhood radius that defines the maximum distance between two points to be considered neighbors. Values of `eps` that are too small can cause fewer points to be grouped or even all points to be classified as noise, while values that are too large can group points that shouldn't be together.

In [None]:
dbscan = DBSCAN()
labels = dbscan.fit_predict(X)

eps_list = [0.01, 0.1, 0.3, 0.5, 1]

trained_dbscans = []
for eps_value in eps_list:
    dbscan = DBSCAN(eps = eps_value)
    dbscan.fit(X)
    trained_dbscans.append((f"eps = {eps_value}", dbscan))

view_dbscan(X, y_true, trained_dbscans)


### `min_samples`

The minimum number of points required to form a cluster. `min_samples` values that are too small can result in very small clusters and unwanted noise, while values that are too large can cause fewer points to be clustered.

In [None]:
dbscan = DBSCAN()
labels = dbscan.fit_predict(X)

min_samples_list = [1, 3, 5, 20, 50]

trained_dbscans = []
for min_samples_value in min_samples_list:
    dbscan = DBSCAN(min_samples = min_samples_value)
    dbscan.fit(X)
    trained_dbscans.append((f"min_samples = {min_samples_value}", dbscan))

view_dbscan(X, y_true, trained_dbscans)


## To choose hyperparameters

To measure the quality of our hyperparameter selection in dbscan, we can use the metrics we previously discussed such as the Silhouette coefficient, the Calinski-Harabasz index, or the Davies-Bouldin index to find the best hyperparameter configuration.

You can also use secondary, business-related metrics to define the best values.

## Compared to k-Means

DBSCAN is more suitable than K-means in situations where the number of clusters is unknown, clusters have non-convex shapes or different densities, and the data contains noise or outliers. In general, DBSCAN is a good choice when a more automated solution is desired, one that is less sensitive to ad-hoc assumptions about the number of clusters and the shape of the data.

Now you know two clustering algorithms available in Scikit Learn. It's time to explore other unsupervised learning techniques.