# DBSCAN with scikit learn

![Creative Commons License](https://i.creativecommons.org/l/by/4.0/88x31.png)  
This work by Jephian Lin is licensed under a [Creative Commons Attribution 4.0 International License](http://creativecommons.org/licenses/by/4.0/).

In [None]:
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import KMeans

In [None]:
def dist_mtx(X, Y):
    X_col = X[:, np.newaxis, :]
    Y_row = Y[np.newaxis, :, :]
    dist = np.linalg.norm(X_col - Y_row, axis=-1)
    return dist

## Code
```python
from sklearn.cluster import DBSCAN
model = DBSCAN(<parameters>)
y_new = model.fit_predict(X)
```

[Official Reference](https://scikit-learn.org/stable/modules/generated/sklearn.cluster.DBSCAN.html)

## Parameters
- `eps`: the $\epsilon$ used for finding neighborhood
- `min_samples`: a sample is considered as a core sample if its $\epsilon$-ball contains at least `min_sample` samples (including itself)

## Attributes
- `core_sample_indices_`: an array of shape `(n_core_samples,)` that stores the indices of the core samples
- `components_`: an array of shape `(n_core_samples, n_features)` that stores the core samples as rows
- `labels_`: an array of shape `(n_samples,)` that stores the label of each sample, where `-1` stands for noise

## Sample data

##### Exercise 1
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
```

###### 1(a)
Apply the DBSCAN algorithm to `X` with the default setting and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
### your answer here

from sklearn.cluster import DBSCAN
model = DBSCAN()

mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
y_new = model.fit_predict(X)


plt.axis('equal')
plt.scatter(*X.T, c=y_new)
plt.scatter(*model.components_.T, c='r', s=10)

In [None]:
model = KMeans(2)
y_KMeans = model.fit_predict(X)
plt.axis('equal')
plt.scatter(*X.T, c=y_KMeans)

##### Veronica 

You forgot to answer this one.
```python
print(model.core_sample_indices_.shape)
```

###### 1(b)
Apply the DBSCAN algorithm to `X` with `eps=1` and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
### your answer here
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])








model = DBSCAN(eps=1)
y_DBSCAN = model.fit_predict(X)
plt.axis('equal')
plt.scatter(*X.T, c=y_DBSCAN)


##### Veronica 

You forgot to answer this one.
```python
print(model.core_sample_indices_.shape)
```

##### 1(c)
Apply the DBSCAN algorithm to `X` with `min_samples=10` and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Print `model.core_sample_indices_.shape` .

In [None]:
### your answer here
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

eps, min_samples = 1, 10

dist = dist_mtx(X, X)
adj = (dist <= eps) 
core_mask = (adj.sum(axis=1) >= min_samples) 
cores = X[core_mask]
noise_mask = np.all(dist_mtx(X, cores) > eps, axis=1) 
noises = X[noise_mask]



y_cores = model.fit_predict(cores)
plt.axis('equal')
plt.scatter(*X.T, c='y')
plt.scatter(*cores.T, c=y_cores)
plt.scatter(*noises.T, c='purple', s=10)

In [None]:
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])
model = DBSCAN(min_samples=10)
y = model.fit_predict(X)
plt.scatter(*X.T, c=y)
print(model.core_sample_indices_.shape)

###### 1(d)
Finding an appropriate `eps` is the main task for the DBSCAN algorithm.  
Let `dist` be the distance matrix between rows in `X` .  
The code below find the average of the distances between one sample to its $k$-th nearest sample.
```python
k = 10
sort_dist = dist.copy()
sort_dist.partition(k)
sort_dist[:,:k+1].max(axis=1).mean()
```
This can be a reference for the choice of `eps` .

In [None]:
### your answer here
k = 2
sort_dist = dist.copy()
sort_dist.partition(k)
sort_dist[:,:k+1].max(axis=1).mean()


##### Veronica 

From the question, `k` should be changed to 10.

###### 1(e)
Finding an appropriate `min_samples` is another task for the DBSCAN algorithm.  
Let `dist` be the distance matrix between rows in `X` .
The code below generate the histogram of the number of neighbors inside the $\epsilon$-balls centered at each sample.
```python
eps = 0.5
n_nbrs = np.sum(dist < eps, axis=1)
plt.hist(n_nbrs)
```
This can be a reference for the choice of `min_samples` .

In [None]:
### your answer here
eps = 0.247
n_nbrs = np.sum(dist < eps, axis=1)
plt.hist(n_nbrs)

##### Veronica 

From the question, `eps` should be changed to 0.5 .

##### Exercise 2
Let  
```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target
```

###### 2(a)
Plot the points (rows) in `X` with `c=y_iris` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  

In [None]:
### your answer here
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target

A=np.zeros([2,150])
A[0]=X.T[0]
A[1]=X.T[1]
A=A.T




plt.axis('equal')
plt.scatter(*A.T, c=y_iris)###chose row 0,1


##### Veronica 

You can just write the following code.

```python
from sklearn.datasets import load_iris
iris = load_iris()
X = iris.data
y_iris = iris.target

plt.axis("equal")
plt.scatter(X[:,1], X[:,2], c=y_iris)
```

###### 2(b)
Apply the DBSCAN algorithm to `X` with the default setting and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
(Each row is in $\mathbb{R}^4$.  
Just pick arbitrary two coordinates to plot the points.)  
Try your best to find appropriate `eps` and `min_samples` so that the results is similar to `y_iris` .

In [None]:
k = 3
sort_dist = dist.copy()
sort_dist.partition(k)
eps=sort_dist[:,:k+1].max(axis=1).mean()
eps
n_nbrs = np.sum(dist < eps, axis=1)
xr=plt.hist(n_nbrs)
print(xr)


min_samples = 3


##### Exercise 3
Let  
```python
from PIL import Image
img = Image.open('incrediville-side.jpg')
arr = np.array(img.resize((int(img.size[0]/30), int(img.size[1]/30))))
m,n,c = arr.shape
X = arr.reshape(-1,3)
```
Decide appropriate `eps` and `min_samples` so that  
```python
img = (y_new == 0).reshape(m, n)
plt.imshow(img, cmap='Greys')
```
gives good image segmentation.

In [None]:
### your answer here
from PIL import Image
img = Image.open('incrediville-side.jpg')
arr = np.array(img.resize((int(img.size[0]/30), int(img.size[1]/30))))
m,n,c = arr.shape
X = arr.reshape(-1,3)

In [None]:
dist = dist_mtx(X, X)


k = 3
sort_dist = dist.copy()
sort_dist.partition(k)
eps=sort_dist[:,:k+1].max(axis=1).mean()


n_nbrs = np.sum(dist < eps, axis=1)
plt.hist(n_nbrs)

In [None]:
min_samples = 49

In [None]:
adj = (dist <= eps) 
core_mask = (adj.sum(axis=1) >= min_samples) 
cores = X[core_mask]
noise_mask = np.all(dist_mtx(X, cores) > eps, axis=1) 
noises = X[noise_mask]

In [None]:
y_new = model.fit_predict(X)
img = (y_new == 0).reshape(m, n)
plt.imshow(img, cmap='Greys')

##### Exercise 4
Let  
```python
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

model = DBSCAN()
y_new = model.fit_predict(X)
```

###### 4(a)
Print `model.core_sample_indices_.shape` and `model.components_.shape` .  
Confirm that `X[model.core_sample_indices_]` and `model.components` are the same.  
Can you tell how many points that are not noise nor core?

In [None]:
mu1 = np.array([2.5,0])
cov1 = np.array([[1.1,-1],
                [-1,1.1]])
mu2 = np.array([-2.5,0])
cov2 = np.array([[1.1,1],
                [1,1.1]])
X = np.vstack([np.random.multivariate_normal(mu1, cov1, 100), 
               np.random.multivariate_normal(mu2, cov2, 100)])

model = DBSCAN()
y_new = model.fit_predict(X)

In [None]:
model = DBSCAN()
y_new = model.fit_predict(X)

print(X.shape)
model = DBSCAN()
model.fit(X)
print(model.components_.shape)

plt.scatter(X[:,0], X[:,1], c=y_new)
print(model.core_sample_indices_.shape)

##### Jephian

So the number of samples that is not a core nor a noise can be obtained by the following.
```python
n_sample = X.shape[0]
n_core = model.core_sample_indices_.shape[0]
n_noise = np.sum(model.labels_ == -1)
n_sample - n_core - n_noise
```

###### 4(b)
Plot the points (rows) in `X` with `c=y_new` .  
Plot the core samples with `c='r'` and `s=10` .

In [None]:
plt.axis("equal")
plt.scatter(*X.T, c=y_new)
plt.scatter(X[model.core_sample_indices_,0], X[model.core_sample_indices_,1], c='r', s=10)

###### 4(c)
Use `model.labels_` to find out the noise samples.  
Adding upon your previous figure, plot the core samples with `c='k'`, `s=100`, and `marker='x'`.  

In [None]:
noise = X[model.labels_ == -1]   #noise samples
plt.axis("equal")
plt.scatter(*X.T, c=y_new)
plt.scatter(X[model.core_sample_indices_,0], X[model.core_sample_indices_,1], c='k', s=100, marker='x')
plt.scatter(*noise.T)

##### Jephian

```python
plt.axis("equal")
plt.scatter(X[:,0], X[:,1], c=y_new)
plt.scatter(X[model.core_sample_indices_,0], X[model.core_sample_indices_,1], c='r', s=10)
plt.scatter(X[model.labels_==-1,0], X[model.labels_==-1,1], c='k', s=100, marker='x')
```

##### Exercise 5
Let  
```python
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]
```
Apply the DBSCAN algorithm to `X` with `eps=1`  
(or other settings that you like)  
and get the prediction `y_new` .  
Plot the points (rows) in `X` with `c=y_new` .  
Is it a good clustering?  
For this data, would you choose DBSCAN or KMeans?

In [None]:
### your answer here

In [None]:
X = 5 * np.random.randn(1000,2)
lengths = np.linalg.norm(X, axis=1)
band1 = (lengths > 1) & (lengths <2)  
band2 = (lengths > 3) & (lengths <4)  
X = X[band1 | band2, :]



In [None]:
model = DBSCAN(eps=1)
y_DBSCAN = model.fit_predict(X)
plt.axis('equal')

plt.scatter(*X.T, c=y_DBSCAN)


In [None]:
from sklearn.cluster import KMeans
model = KMeans(2)
y_new = model.fit_predict(X)

X_test = np.random.rand(1000, 2) * 8 - np.array([4,4])
y_test_new = model.predict(X_test)

plt.axis('equal')
plt.scatter(*X_test.T, c=y_test_new, s=10, alpha=0.1)
plt.scatter(*X.T, c=y_new)
plt.scatter(*model.cluster_centers_.T, c='r')