## Density-based clustering

In density-based clustering the approach is different compared to distributed clustering. We need to implement all functions from scratch. The libraries that we are going to use are the same as in previous example, but in this case we have also the random package that is used to shuffle the objects in the data set.

In [1]:
import random
import numpy as np
import pandas as pd
from math import sqrt

DBScan is an example of a density-based clustering method. The goal is to find all element where the neighborhood is defined as:
\begin{equation}
    N_{\epsilon}:{q|d(p,q)\leq\epsilon},
\end{equation}
where $p$ and $q$ are two elements of the training data set and $\epsilon$ is the neighborhood distance. For the data set used before and $\epsilon$ to 0.25 we get the regions like in figure below.

![density](./../images/density.png)

Let's setup the variables as in previous examples. The are three new ones like ```distance_matrix```, ```max_distance```, ```number_of_cluster```, and ``min_points``. The first one is clear, the second is a parameter that can be changed, depending on that how many neighborhood elements we would like to concider. The next variable is about the number of clusters that are calculated during clustering. It's not the exact number of clusters, but allow us count the clusters during clustering. Last variable is the number of points that needs to within a neighbourhood to be classified as non-border object. Boarded points are the points that are the farest points from the cluster, but it's not the noise.

In [2]:
%store -r data_set

assignation = np.zeros(len(data_set))
distance_matrix = np.zeros((len(data_set), len(data_set)))
max_distance = 0.25
number_of_cluster = 0
min_points = 2

We need the distance function that we used in previous example to calculate the distance matrix:    

In [3]:
def calculate_distance(x,v):
    return sqrt((x[0]-v[0])**2+(x[1]-v[1])**2)

To calculate the distance matrix we use the calculate_distance that we used previously:

In [4]:
def calculate_distance_matrix():
    distance_matrix = np.zeros((len(data_set),len(data_set)))
    for i in range(len(data_set)):
        for j in range(len(data_set)):
            distance_matrix[i, j] = calculate_distance(data_set[i], data_set[j])
    return distance_matrix

The next step is to get closest elements in the feature space:

In [5]:
def get_closest_elements(distance_matrix, element_id):
    element_distances = distance_matrix[element_id]
    filtered = {}
    iter = 0
    for element in element_distances:
        if element < max_distance:
            filtered[iter] = element
        iter = iter + 1
    return filtered

The last step before cluster function is to define funtions that mark the elements in our data set that are known to be a noise or were already visited by our method.

In [6]:
def set_as_noise(assignation,element_id):
    assignation[element_id] = -1
    return assignation
    
def set_visited(elements, assignation, number_of_clusters):    
    for element_id in elements.keys():
        assignation[element_id] = number_of_clusters 
    return assignation

Combine it all together:

In [7]:
def cluster_density(assignation):
    number_of_cluster = 0
    distance_matrix = calculate_distance_matrix()
    element_ids = list(range(len(data_set)))
    random.shuffle(element_ids)
    for i in element_ids:
        if assignation[i] != 0:
            continue
        closest = get_closest_elements(distance_matrix, i)
        if len(closest) < min_points:
            assignation = set_as_noise(assignation,i)
        else:
            assignation = set_visited(closest, assignation, number_of_cluster)
            number_of_cluster = number_of_cluster + 1
    return assignation

Ready to cluster:

In [8]:
new_assignation_density = cluster_density(assignation)

The number of cluster is the size of unique cluster ids that are in ``new_assignation_density`` minus noise.

In [9]:
print("Number of clusters: "+ str(len(np.unique(new_assignation_density))-1))

Number of clusters: 2


The noise is marked with -1. The other objects have the cluster number assigned.

In [10]:
print(new_assignation_density)

[ 2.  2.  2.  2.  2.  2.  1.  1. -1. -1.]


In [11]:
%store new_assignation_density

Stored 'new_assignation_density' (ndarray)
