## DBSCAN Clustering
#### Lewis Sears

**Density-based spatial clustering of applications with noise**, commonly DBSCAN, is a popular unsupervised clustering technique that is fundamentally different than $k$-means clustering. Instead of iteratively creating centroids estimating the centers of our desired centroids, DBSCAN creates $\epsilon$ neighborhoods around each data point and analyzes the crossover of the points. We label these points as *core* and *boundary* points based on how many other data points are in the neighborhood. Points with no other points in their neighborhood are considered *noise* points. There are naturally two hyperparameters to the algorithm:
1. How large do we make $\epsilon$ to accurately capture the topology of clusters without capturing noise?
2. For a data point $x_{i}$ with an neighborhood $N_{\epsilon}(x_{i})$, how many other data points must be in $N_{\epsilon}(x_{i})$ to label $x_{i}$ such that $x_{i}$ is not noise? 

This naturally requires some intensive tuning.

### The Algorithm

In [1]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform

class DBSCAN(object):
    '''This is a DBSCAN clustering algorithm. It will default to euclidean distance.'''
    
    def __init__(self, radius, noiseNumber):
        '''Initialize Hyperparameters:

        radius: The size of the neighborhoods around points to evaluate close points.
        coreNumber: The minimum number of points in a neighborhood that filters out noise.
        '''
        self.radius = radius
        self.n = noiseNumber
        
    def fit(self, DataFrame):
        '''Puts labels on a pandas dataframe. Scale before fitting.'''
        
        #Create a distance matrix of the data
        dist_matrix = squareform(pdist(DataFrame, metric='euclidean'))
        
        #Use distance matrix to filter nosie points  
        mask = np.ma.masked_less_equal(dist_matrix, self.radius).mask
        neighborhood_density = np.count_nonzero(mask == True, axis = 0) - 1
        df_dense_points = DataFrame[neighborhood_density >= self.n]
        dense_dist = dist_matrix[neighborhood_density >= self.n] 
        self.noise_points = DataFrame[neighborhood_density < self.n]
        
        
        
        
        
        
        return dense_dist    

In [2]:
df = pd.DataFrame({'a': [1,2,3,4,5,11,12,13,14,15,21,22,23,24,25]})
df['b'] = df['a'] - 5

In [3]:
dbscan = DBSCAN(3, 3)

In [4]:
dbscan.fit(df)

array([[ 1.41421356,  0.        ,  1.41421356,  2.82842712,  4.24264069,
        12.72792206, 14.14213562, 15.55634919, 16.97056275, 18.38477631,
        26.87005769, 28.28427125, 29.69848481, 31.11269837, 32.52691193],
       [ 2.82842712,  1.41421356,  0.        ,  1.41421356,  2.82842712,
        11.3137085 , 12.72792206, 14.14213562, 15.55634919, 16.97056275,
        25.45584412, 26.87005769, 28.28427125, 29.69848481, 31.11269837],
       [ 4.24264069,  2.82842712,  1.41421356,  0.        ,  1.41421356,
         9.89949494, 11.3137085 , 12.72792206, 14.14213562, 15.55634919,
        24.04163056, 25.45584412, 26.87005769, 28.28427125, 29.69848481],
       [15.55634919, 14.14213562, 12.72792206, 11.3137085 ,  9.89949494,
         1.41421356,  0.        ,  1.41421356,  2.82842712,  4.24264069,
        12.72792206, 14.14213562, 15.55634919, 16.97056275, 18.38477631],
       [16.97056275, 15.55634919, 14.14213562, 12.72792206, 11.3137085 ,
         2.82842712,  1.41421356,  0.        , 

In [6]:
dbscan.dense

AttributeError: 'DBSCAN' object has no attribute 'dense'