### DBSCAN

* **D**ensity **B**ased **S**patial **C**lustering of **A**pplications with **N**oise (DBSCAN)

* It is a density based non-parametric algorithm where, given a set of points in some space, DBSCAN groups the points together that are closely packed by marking as outliers points that lie alone in a low density regions.

* DBSCAN is one of the most common clustering algorithms and also most cited in scientific literature.

<img src="https://media.geeksforgeeks.org/wp-content/uploads/PicsArt_11-17-08.07.10-300x300.jpg">

**Credits** - Image from Internet

### MinPts & Eps: Density

* `MinPts` and `Eps` are the hyperparameters and helpful to measure the density.

* **Density at point `P`** → total number of points within a hypersphere (circle in 2D) of radius `Eps` around `P`.

* **Dense region** → a hypersphere (circle in 2D) of radius `Eps` that contains atleast `MinPts` points.

* **Sparse region** → a hypersphere (circle in 2D) of radius `Eps` that containds less than `MinPts` points.

### Core, Border, and Noise

https://cs.wmich.edu/alfuqaha/summer14/cs6530/lectures/ClusteringAnalysis.pdf

* Given a set of point $D = \{x_i\}$, `MinPts`, aand `Eps`; we can easily determine whether a point is core point or border point or a noise point.

* **Core point** → point `P` is said to be a core point if `P` has greater or equal to `MinPts` points in an `Eps` radius around it.
    - A core point always belongs to a dense region.

* **Border point** → point `P` is said to be a border point iff
    - `P` itself is not a core point
    - `P` belongs to a neighborhood of a core point `Q`
    - $\text{dist}(P, Q) \leq \text{Eps}$

* **Noise point** → any point which neither a core point nor a border point

* **Density Edge**
    - edge represents connection
    - if `P` and `Q` are core points and the dist(`P`, `Q`) is less than or equal to `Eps`, then we connect `P` and `Q` with an edge. Therefore, it is known as a density edge.

* **Density Connected Points**
    - if `P` and `Q` are core points and there exists other points such as `P1`, `P2`, `P3` where there is a **density edge** between `P-P1`, `P1-P2`, `P3-Q`; then `P` and `Q` are said to be density connected points.

### DBSCAN Algorithm

1. $\forall x_i \in D$ label each point if it is core point or border point or noise point.
    - Range Query (**vvimp**)

2. Remove all the noise points from the data. Noise points belong to sparse regions.

3. For each point `P` that is not assigned any cluster (**Repeat**).
    - create a new cluster with `P`
    - add all the points that are density connected to `P` into this new cluster

4. Take each border point and assign it to nearest core point's cluster.

### Hyperparameters

* **MinPts**
    - rule of thumb → `MinPts` $\geq d+1$; where $d$ is dimensionality
    - typically, `MinPts` should be roughly equal to $2d$
    - if the dataset has more noisy points, `MinPts` should be larger value and thus it removes noisy points
    - `MinPts` are often chosen by a domain expert

* **Eps**
    - let's assume `MinPts` is 4
    - find $d_i$ where is distance between $x_i$ and the fourth nearest points
    - sort $x_i$ and $d_i$ in increasing order and plot the daṭa
    - by using the elbow / knee method, determine `Eps`
    - if $d_i$ is high, then the chance for $x_i$ to be a noisy point is high

### Merits & De-Merits

https://en.wikipedia.org/wiki/DBSCAN#Advantages

**Merits**

* Resistant to noise.
* Can handle clusters of different sizes and shapes.
* Robust to outliers.

**De-Merits**

* Varying densities.
* High-dimensional data.
* Sensitive to change in the hyperparameters.

### Time & Space Complexity

* **Time Complexity** - $O(n \log(n))$
    - $\log(n)$ → to execute the Range Query
    - $n$ → Range Query is executed `n` times

* **Space Complexity** - $O(n)$