# DBSCAN from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `eps`: the $\epsilon$ used for finding neighborhood
- `min_samples`: a sample is considered as a core sample if its $\epsilon$-ball contains at least `min_sample` samples (including itself)
- `draw`: boolean, return a illustrative figure or not

**Output:**  
A tuple `(y_new, core_indices, fig)`  or `(y_new, core_indices)` depending on `draw` or not.    
- `y_new`: an array of shape `(N,)` that records the labels in of each sample, where `-1` stands for a noise 
- `core_indices`: an array of shape `(n_core_samples,)` that stores the indices of the core samples
- `fig`: an illustrative figure showing the data points, DFS tree, core samples, and noises

**Steps:**
1. Build a list `nbrhoods` whose `i`-th element is the array of the indices of its neighbors.  
Here two points are neighbors if the distance between them is less than or equal to `eps` .  
A point is considered as its neighbor.
2. If sample `i` has at least `min_samples` neighbors, then it is called a core sample.  
Store the indices of core samples in the array `core_indices` .
3. Set `label_num = 0`.  Label every sample as with `-1`.  For each sample `i`, do the following [DFS](https://en.wikipedia.org/wiki/Depth-first_search):
    1. if sample `i` is a core labeled by `-1`, label it with `label_num`; otherwise, skip the following steps and move on the the next sample.
    2. let `stack = [i]`
    3. take (and remove) the last element `j` in `stack`
    4. if `j` is labeled by `-1`, label it with `label_num`; moreover, if `j` is a core, insert the neighbors of `j` at the end of `stack`  
    5. repeat Steps C, D, E until `stack` is empty
    6. `label_num += 1`

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 
    2. 
    3. ...

## Code

In [None]:
from sklearn.cluster import DBSCAN
dbs = DBSCAN
del DBSCAN

In [None]:
### your answer here
import matplotlib.animation as animation
import matplotlib.cm as cm

def DBSCAN_animation(X, y_new, animation_array, interval=10, **kwargs):
    fig, ax = plt.subplots()
    ax.scatter(*X.T, c='k')
    colors = cm.rainbow(np.linspace(0, 1, (max(y_new) + 1) * 2))
    def animate(step):
        if step[0] == 'arrow':
            ax.arrow(*step[1][0], *(step[1][1] - step[1][0]), color='k', head_width=0.07, length_includes_head=True, **kwargs)
        elif step[0] == 'core':
            ax.scatter(*step[1], c=colors[step[2]], **kwargs)
        elif step[0] == 'not core':
            ax.scatter(*step[1], c=colors[-(step[2] + 1)], **kwargs)
        elif step[0] == 'outlier':
            ax.scatter(*step[1], c='k', s=100, marker='x', **kwargs)
    return animation.FuncAnimation(fig, animate, animation_array, interval=interval, repeat=False)

def DBSCAN(X, eps, min_samples, animation=False, **kwargs):
    nbrhoods = [[] for i in range(X.shape[0])]
    for i in range(X.shape[0] - 1):
        for j in range(i + 1, X.shape[0]):
            if np.linalg.norm(X[i] - X[j]) < eps:
                nbrhoods[i].append(j)
                nbrhoods[j].append(i)
    is_core = [l >= min_samples - 1 for l in map(len, nbrhoods)]
    label_num = 0
    y_new = [-1] * X.shape[0]
    animation_array = []
    for i in range(X.shape[0]):
        if not is_core[i] or y_new[i] != -1:
            continue
        stack = [i]
        y_new[i] = label_num
        if animation:
            animation_array.append(('core', X[i], label_num))
        while stack:
            current_node = stack[-1]
            if not nbrhoods[current_node]:
                stack.pop()
                continue
            next_node = nbrhoods[current_node].pop()
            if y_new[next_node] != -1:
                continue
            y_new[next_node] = label_num
            if animation:
                animation_array.append(('arrow', (X[current_node], X[next_node]), label_num))
            if not is_core[next_node]:
                if animation:
                    animation_array.append(('not core', X[next_node], label_num))
                continue
            stack.append(next_node)
            if animation:
                animation_array.append(('core', X[next_node], label_num))
        label_num += 1
    if animation:
        for i, y in enumerate(y_new):
            if y == -1:
                animation_array.append(('outlier', X[i], -1))
    if animation:
        return DBSCAN_animation(X, y_new, animation_array, **kwargs)
    return y_new, [i for i, core in enumerate(is_core) if core]

## Test
Take some sample data from [DBSCAN-with-scikit-learn](DBSCAN-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

##### Name of the data
Description of the data.

##### Jephian:
You are supposed to add some description of the data here.

In [None]:
# test data
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

In [None]:
# animation
%matplotlib notebook
DBSCAN(X, eps = 0.5, min_samples = 5, animation = True, interval = 5)

In [None]:
### results with your code
y_my, core = DBSCAN(X, eps = 0.5, min_samples = 5)

In [None]:
### results with existing packages
model = dbs(eps = 0.5, min_samples = 5)
y_new = model.fit_predict(X)

## Comparison

##### Exercise 1
With the same `eps` and `min_samples`, the array `core_indices` generated by your function is supposed to be the same as `model.core_sample_indices_` .  
Check if this is true.

In [None]:
### your answer here
print((core == model.core_sample_indices_).all())

##### Exercise 2
Let `core_indices` be the array generated by your function.  
Then `X[core_indices]` is supposed to be the same as `model.components_` .  
Check if this is true.

In [None]:
### your answer here
print((X[core] == model.components_).all())

##### Exercise 3
Let `y_my` be the output label of your function.  
Let `y_new` be the label given by `sklearn.cluster.DBSCAN` .

###### 3(a)
The noices `y_my == -1` and `y_new == -1` are supposed to be the same.  
Check if this is true.

In [None]:
### your answer here
y_my = np.array(y_my)
print(((y_new == -1) == (y_my == -1)).all()) 

###### 3(b)
Although `y_my` and `y_new` might be different, they indicate the same clustering.  
That is, the partitions are the same, but a group my have different labels in `y_my` and `y_new` .  
Check if this is true.

In [None]:
### your answer here
%matplotlib inline

fig1 = plt.figure()
ax1 = fig1.add_subplot(1,1,1)
fig2 = plt.figure()
ax2 = fig2.add_subplot(1,1,1)

fig1.suptitle("My DBSCAN")
ax1.scatter(*X.T, c=y_my)
ax1.scatter(*X[y_my == -1].T, c='k', s=100, marker='x')

fig2.suptitle("sklearn DBSCAN")
ax2.scatter(*X.T, c=y_new)
ax2.scatter(*X[y_my == -1].T, c='k', s=100, marker='x')

##### Jephian:
Well done.