# DBSCAN with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt

In [None]:
def dist_mtx(X, Y=None):
    """Return the distance matrix between rows of X and rows of Y
    
    Input:  
        X: an array of shape (N,d)
        Y: an array of shape (M,d)
            if None, Y = X
           
    Output:
        the matrix [d_ij] where d_ij is the distance between  
        the i-th row of X and the j-th row of Y
    """
    if isinstance(Y, np.ndarray):
        pass
    elif Y == None:
        Y = X.copy()
    else:
        raise TypeError("Y should be a NumPy array or None") 
    X_col = X[:, np.newaxis, :]
    Y_row = Y[np.newaxis, :, :]
    diff = X_col - Y_row
    dist = np.sqrt(np.sum(diff**2, axis=-1))
    return dist

## Code
```python
from sklearn.cluster import DBSCAN
model = DBSCAN(<parameters>)
y_new = model.fit_predict(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

## Parameters
- `eps`: the $\epsilon$ used for finding neighborhood
- `min_samples`: a sample is considered as a core sample if its $\epsilon$-ball contains at least `min_sample` samples (including itself)

## Attributes
- `core_sample_indices_`: an array of shape `(n_core_samples,)` that stores the indices of the core samples
- `components_`: an array of shape `(n_core_samples, n_features)` that stores the core samples as rows
- `labels_`: an array of shape `(n_samples,)` that stores the label of each sample, where `-1` stands for noise

## Sample data

##### Exercise 1
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
```

###### 1(a)
Apply the DBSCAN algorithm to `X` with the default setting and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

from sklearn.cluster import DBSCAN
model = DBSCAN()
y_new = model.fit_predict(X)

plt.axis("equal")
plt.scatter(*X.T,c=y_new)
print(model.core_sample_indices_.shape) #core 的數量

###### 1(b)
Apply the DBSCAN algorithm to `X` with `eps=1` and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
model = DBSCAN(eps=1)     #eps取太大所以大部分點都分到同一類
y_new = model.fit_predict(X)

plt.axis("equal")
plt.scatter(*X.T,c=y_new)
print(model.core_sample_indices_.shape)

##### 1(c)
Apply the DBSCAN algorithm to `X` with `min_samples=10` and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
model = DBSCAN(min_samples=10)
y_new = model.fit_predict(X)

plt.axis("equal")
plt.scatter(*X.T,c=y_new)
print(model.core_sample_indices_.shape)

###### 1(d)
Finding an appropriate `eps` is the main task for the DBSCAN algorithm.  
Let `dist` be the distance matrix between rows in `X` .  
The code below find the average of the distances between one sample to its $k$-th nearest sample.
```python
k = 10
sort_dist = dist.copy()
sort_dist.partition(k)
sort_dist[:,:k+1].max(axis=1).mean()
```
This can be a reference for the choice of `eps` .

In [None]:
dist = dist_mtx(X, X)  #距離矩陣i,j項為i到j的距離
print(dist)

k = 10
sort_dist = dist.copy()
sort_dist.partition(k)  #把前10小的數字排在前10個位置，順序不重要
sort_dist[:,:k+1].max(axis=1).mean() #取每一行中第十小的數字(距離)取平均

###### 1(e)
Finding an appropriate `min_samples` is another task for the DBSCAN algorithm.  
Let `dist` be the distance matrix between rows in `X` .
The code below generate the histogram of the number of neighbors inside the $\epsilon$-balls centered at each sample.
```python
eps = 0.5
n_nbrs = np.sum(dist < eps, axis=1)
plt.hist(n_nbrs)
```
This can be a reference for the choice of `min_samples` .

In [None]:
n_nbrs

In [None]:
eps = 0.5
n_nbrs = np.sum(dist < eps, axis=1)  #計算與每個點距離小於eps的數量個數
plt.hist(n_nbrs)                

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target
```

###### 2(a)
Plot the points (rows) in `X` with `c=y_iris` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  

In [None]:
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target

plt.axis("equal")
plt.scatter(X[:,1], X[:,3], c=y_iris)

###### 2(b)
Apply the DBSCAN algorithm to `X` with the default setting and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  
Try your best to find appropriate `eps` and `min_samples` so that the results is similar to `y_iris` .

In [None]:
dist = dist_mtx(X, X)
k = 11
sort_dist = dist.copy()
sort_dist.partition(k)
print(sort_dist[:,:k+1].max(axis=1).mean())

eps = 0.5
n_nbrs = np.sum(dist < eps, axis=1)
plt.figure(1)
plt.hist(n_nbrs)
print(n_nbrs.mean())

In [None]:
model = DBSCAN(eps=0.5, min_samples=11)
y_new = model.fit_predict(X)

plt.axis("equal")
plt.scatter(X[:,1], X[:,3], c=y_new)

##### Exercise 3
Let  
```python
from PIL import Image
img = Image.open('incrediville-side.jpg')
arr = np.array(img.resize((int(img.size[0]/30), int(img.size[1]/30))))
m,n,c = arr.shape
X = arr.reshape(-1,3)
```
Decide appropriate `eps` and `min_samples` so that  
```python
img = (y_new == 0).reshape(m, n)
plt.imshow(img, cmap='Greys')
```
gives good image segmentation.

In [None]:
from PIL import Image
img = Image.open('incrediville-side.jpg')
arr = np.array(img.resize((int(img.size[0]/30), int(img.size[1]/30)))) #先壓縮尺寸再轉成矩陣

m,n,c = arr.shape       #位置和顏色
X = arr.reshape(-1,3)   #3 RGB

model = DBSCAN(eps=5)
y_new = model.fit_predict(X) 

img = (y_new ==0 ).reshape(m, n) #還原為原大小
plt.imshow(img, cmap='Greys')

## Experiments

##### Exercise 4
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

model = DBSCAN()
y_new = model.fit_predict(X)
```

###### 4(a)
Print `model.core_sample_indices_.shape` and `model.components_.shape` .  
Confirm that `X[model.core_sample_indices_]` and `model.components` are the same.  
Can you tell how many points that are not noise nor core?

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

model = DBSCAN()
y_new = model.fit_predict(X)

plt.axis("equal")
plt.scatter(*X.T, c=y_new)
print(model.core_sample_indices_.shape)   
print(model.components_.shape)

print(sum(model.labels_==-1))  #noise 的個數 noise 的label是-1

print((X[model.core_sample_indices_] == model.components_).all()) #確認X中的core 和 components中一樣

print("not noise nor core = 200 -core - noise ", 200 -sum(model.labels_==-1) - (*model.core_sample_indices_.shape,))

###### 4(b)
Plot the points (rows) in `X` with `c=y_new` .  
Plot the core samples with `c='r'` and `s=10` .

In [None]:
plt.axis("equal")
plt.scatter(*X.T, c=y_new)
plt.scatter(X[model.core_sample_indices_,0], X[model.core_sample_indices_,1], c='r', s=10) #把core塗紅大小10 

###### 4(c)
Use `model.labels_` to find out the noise samples.  
Adding upon your previous figure, plot the core samples with `c='k'`, `s=100`, and `marker='x'`.  

In [None]:
noise = X[model.labels_ == -1]   #noise samples
plt.axis("equal")
plt.scatter(*X.T, c=y_new)
plt.scatter(X[model.core_sample_indices_,0], X[model.core_sample_indices_,1], c='k', s=100, marker='x')
plt.scatter(*noise.T)

##### Exercise 5
Let  
```python
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
```
Apply the DBSCAN algorithm to `X` with `eps=1`  
(or other settings that you like)  
and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Is it a good clustering?  
For this data, would you choose DBSCAN or KMeans?

In [None]:
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]

from sklearn.cluster import KMeans,DBSCAN
model1 = DBSCAN(eps=1)
y_new1 = model1.fit_predict(X)

model2 = KMeans(2)
y_new2 = model2.fit_predict(X)

In [None]:
fig = plt.figure(figsize=(8,4))
axs = fig.subplots(1,2)
axs[0].scatter(*X.T, c=y_new1)
axs[0].set_title("DBSCAN")
axs[1].scatter(*X.T, c=y_new2)
axs[1].set_title("KMeans")

In [None]:
print("選DBSCAN分的比較好")