# DBSCAN from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X` (whose `N` rows are samples and `d` columns are features)
- `eps`: the $\epsilon$ used for finding neighborhood
- `min_samples`: a sample is considered as a core sample if its $\epsilon$-ball contains at least `min_sample` samples (including itself)

**Output:**  
- `y_new`: an array of shape `(N,)` that records the labels in of each sample, where `-1` stands for a noise 
- `core_indices`: an array of shape `(n_core_samples,)` that stores the indices of the core samples

**Steps:**
1. Build a tuple `nbrhoods` that stores `i: array of indices of its neighbors` .  
Here two points are neighbors if the distance between them is less than or equal to `eps` .  
A point is considered as its neighbor.
2. If sample `i` has at least `min_samples` neighbors, then it is called a core sample.  
Store the indices of core samples in the array `core_indices` .
3. Set `label_num = 0`.  Label every sample as with `-1`.  For each sample `i`, do the following [DFS](https://en.wikipedia.org/wiki/Depth-first_search):
    1. if sample `i` is a core labeled by `-1`, label it with `label_num`; otherwise, skip the following steps and move on the the next sample.
    2. let `stack = neighbors of sample i`
    3. take (and remove) the first element `j` in `stack`
    4. if `j` is labeled by `-1`, label it with `label_num`  
    5. if `j` is a core, insert the neighbors of `j` at the beginning of `stack`  
    6. repeat Steps C, D, E until `stack` is empty
    7. `label_num += 1`
4. Return the labels as `y_new` .


## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 
    2. 
    3. ...

## Code

In [None]:
### your answer here

## Test
Take some sample data from [MDS-with-scikit-learn](MDS-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

In [None]:
### results with your code

In [None]:
### results with existing packages

## Comparison

##### Exercise 1
With the same `eps` and `min_samples`, the array `core_indices` generated by your function is supposed to be the same as `model.core_sample_indices_` .  
Check if this is true.

In [None]:
### your answer here

##### Exercise 2
Let `core_indices` be the array generated by your function.  
Then `X[core_indices]` is supposed to be the same as `model.components_` .  
Check if this is true.

In [None]:
### your answer here

##### Exercise 3
Let `y_my` be the output label of your function.  
Let `y_new` be the label given by `sklearn.cluster.DBSCAN` .

###### 3(a)
The noices `y_my == -1` and `y_new == -1` are supposed to be the same.  
Check if this is true.

In [None]:
### your answer here

###### 3(b)
Although `y_my` and `y_new` might be different, they indicate the same clustering.  
That is, the partitions are the same, but a group my have different labels in `y_my` and `y_new` .  
Check if this is true.

In [None]:
### your answer here