# DBSCAN from scratch

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

## Algorithm
**Input:**  
- `X`: an array of shape `(N,d)` whose rows are samples and columns are features
- `eps`: the $\epsilon$ used for finding neighborhood
- `min_samples`: a sample is considered as a core sample if its $\epsilon$-ball contains at least `min_sample` samples (including itself)
- `draw`: boolean, return a illustrative figure or not

**Output:**  
A tuple `(y_new, core_indices, fig)`  or `(y_new, core_indices)` depending on `draw` or not.    
- `y_new`: an array of shape `(N,)` that records the labels in of each sample, where `-1` stands for a noise 
- `core_indices`: an array of shape `(n_core_samples,)` that stores the indices of the core samples
- `fig`: an illustrative figure showing the data points, DFS tree, core samples, and noises

**Steps:**
1. Build a list `nbrhoods` whose `i`-th element is the array of the indices of its neighbors.  
Here two points are neighbors if the distance between them is less than or equal to `eps` .  
A point is considered as its neighbor.
2. If sample `i` has at least `min_samples` neighbors, then it is called a core sample.  
Store the indices of core samples in the array `core_indices` .
3. Set `label_num = 0`.  Label every sample as with `-1`.  For each sample `i`, do the following [DFS](https://en.wikipedia.org/wiki/Depth-first_search):
    1. if sample `i` is a core labeled by `-1`, label it with `label_num`; otherwise, skip the following steps and move on the the next sample.
    2. let `stack = [i]`
    3. take (and remove) the last element `j` in `stack`
    4. if `j` is labeled by `-1`, label it with `label_num`; moreover, if `j` is a core, insert the neighbors of `j` at the end of `stack`  
    5. repeat Steps C, D, E until `stack` is empty
    6. `label_num += 1`

## Pseudocode
Translate the algorithm into the pseudocode.  
This helps you to identify the parts that you don't know how to do it.  

    1. 找每個點的鄰居
    2. 鄰居數量足夠就為core
    3. 將core標號 直到全部core都標號，剩的點為-1(外部點)

## Code

In [None]:
def dist_mtx(X, Y=None):
    """Return the distance matrix between rows of X and rows of Y
    
    Input:  
        X: an array of shape (N,d)
        Y: an array of shape (M,d)
            if None, Y = X
           
    Output:
        the matrix [d_ij] where d_ij is the distance between  
        the i-th row of X and the j-th row of Y
    """
    if isinstance(Y, np.ndarray):
        pass
    elif Y == None:
        Y = X.copy()
    else:
        raise TypeError("Y should be a NumPy array or None") 
    X_col = X[:, np.newaxis, :]
    Y_row = Y[np.newaxis, :, :]
    diff = X_col - Y_row
    dist = np.sqrt(np.sum(diff**2, axis=-1))
    return dist

In [None]:
def DBSCAN_(X,eps=0.5,min_samples=5) :
    #step1,2
    N,d = X.shape
    dist = dist_mtx(X)                       #(N,N)的距離矩陣
    adj = (dist < eps)                       #(N,N)的0,1矩陣 距離小於eps為1
    core_mask = (adj.sum(axis=1) >= min_samples)        #判斷是否為core 距離小於eps的數量大於等於min_samples即為core
    core_indices = np.where(core_mask)[0]               #找出core的底標 np.where的結果會用小括號包起來，所以要加一個[0]
    #np.where的結果是一個tuple結構，需要使用[]存取其中的元素，而core_mask是一維的，所以tuple中只有一個元素。
    nbrhoods = [np.where(adj[i])[0] for i in range(N)]  #找出所有點的鄰居的下標
    #step3
    y_new = -np.ones((N,), dtype=int)         #先全部設為-1
    label_num = 0                             #從第0個編號開始
    for i in range(N):                        #跑過所有的點
        if y_new[i] ==-1 and core_mask[i]:    #還沒被標號的core
            #DFS
            stack = [i]                       #放進stack中  #stack是將被找的點
            while stack != []:                #stack非空就一直找下去
                j = stack.pop()               #將stack中最後一項取出並移除
                if y_new[j] == -1:            #代表此點尚未被編號
                    y_new[j] = label_num      #將此點編號
                    if core_mask[j]:          #如果此點為core，代表可以將鄰居變色(標號)
                        stack += list(nbrhoods[j])    #將其鄰居都加進stack中等待被找到
            label_num +=1                     #DFS做完後，代表找到此標號的所有點，將標下一個號碼     
    return y_new,core_indices                 #回傳值
            

## Test
Take some sample data from [DBSCAN-with-scikit-learn](DBSCAN-with-scikit-learn.ipynb) and check if your code generates similar outputs with the existing packages.

In [None]:
mu1 = np.array([2.5,0])                    #隨機的資料點
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

In [None]:
#from sklearn.datasets import load_iris     #花的資料集
#iris = load_iris()
#X = iris.data

##### Name of the data
Description of the data.

In [None]:
y_my,core_indices= DBSCAN_(X,eps=0.5,min_samples=5)    #手寫

In [None]:
from sklearn.cluster import DBSCAN                     #套件
model = DBSCAN(eps=0.5,min_samples=5)
y_new = model.fit_predict(X)

## Comparison

##### Exercise 1
With the same `eps` and `min_samples`, the array `core_indices` generated by your function is supposed to be the same as `model.core_sample_indices_` .  
Check if this is true.

In [None]:
(model.core_sample_indices_==core_indices).all()       #core的下標一樣

##### Exercise 2
Let `core_indices` be the array generated by your function.  
Then `X[core_indices]` is supposed to be the same as `model.components_` .  
Check if this is true.

In [None]:
(model.components_ == X[core_indices]).all()           #core的值皆一樣

##### Exercise 3
Let `y_my` be the output label of your function.  
Let `y_new` be the label given by `sklearn.cluster.DBSCAN` .

###### 3(a)
The noices `y_my == -1` and `y_new == -1` are supposed to be the same.  
Check if this is true.

In [None]:
((y_my == -1)==(y_new == -1)).all()

###### 3(b)
Although `y_my` and `y_new` might be different, they indicate the same clustering.  
That is, the partitions are the same, but a group my have different labels in `y_my` and `y_new` .  
Check if this is true.

In [None]:
(y_my == y_new).all()                 #剛好相同

In [None]:
fig = plt.figure(figsize=(8,4))       
axs = fig.subplots(1,2)
axs[0].scatter(X[:,0], X[:,1], c=y_my)
axs[0].set_title("myDBSCAN")
axs[1].scatter(X[:,0], X[:,1], c=y_new)
axs[1].set_title("DBSCAN")