# Density-Based Spatial Clustering of Applications with Noise (DBSCAN)

The DBSCAN algorithm is a clustering algorithm that works really well for datasets that have regions of high density.

The model can take array-like objects, either in host as NumPy arrays or in device (as Numba or cuda_array_interface-compliant), as well  as cuDF DataFrames.

For information about the cuDF format, refer to the [cuDF documentation](https://rapidsai.github.io/projects/cudf/en/latest/)

For information about cuML's DBSCAN implementation: https://rapidsai.github.io/projects/cuml/en/latest/api.html#dbscan

In [None]:
import os

import numpy as np
import cupy as cp

from sklearn import datasets

import pandas as pd
import cudf as gd

from sklearn.datasets import make_blobs

from sklearn.metrics import adjusted_rand_score

from sklearn.cluster import DBSCAN as skDBSCAN
from cuml.cluster import DBSCAN as cumlDBSCAN
cp.cuda.Device(3).use()

## Define Parameters

In [None]:
n_samples = 100000
n_features = 128

eps = 3
min_samples = 2

## Generate Data

### Host

In [None]:
host_data, host_labels = make_blobs(
   n_samples=n_samples, n_features=n_features, centers=5, random_state=7)

host_data = pd.DataFrame(host_data)
host_labels = pd.Series(host_labels)

### Device

In [None]:
device_data = gd.DataFrame.from_pandas(host_data)
device_labels = gd.Series(host_labels)

## Scikit-learn Model

In [None]:
%%time
clustering_sk = skDBSCAN(eps=eps,
                         min_samples=min_samples,
                         algorithm="brute",
                         n_jobs=-1)
clustering_sk.fit(host_data)

## cuML Model

In [None]:
%%time
clustering_cuml = cumlDBSCAN(eps=eps,
                             min_samples=min_samples)
clustering_cuml.fit(device_data)

## Evaluate Results

In [None]:
%%time
cuml_score = adjusted_rand_score(host_labels, clustering_cuml.labels_)
sk_score = adjusted_rand_score(host_labels, clustering_sk.labels_)