## DBSCAN Clustering
#### Lewis Sears

**Density-based spatial clustering of applications with noise**, commonly DBSCAN, is a popular unsupervised clustering technique that is fundamentally different than $k$-means clustering. Instead of iteratively creating centroids estimating the centers of our desired centroids, DBSCAN creates $\epsilon$ neighborhoods around each data point and analyzes the crossover of the points. We label these points as *core* and *boundary* points based on how many other data points are in the neighborhood. Points with no other points in their neighborhood are considered *noise* points. There are naturally two hyperparameters to the algorithm:
1. How large do we make $\epsilon$ to accurately capture the topology of clusters without capturing noise?
2. For a data point $x_{i}$ with an neighborhood $N_{\epsilon}(x_{i})$, how many other data points must be in $N_{\epsilon}(x_{i})$ to label $x_{i}$ such that $x_{i}$ is not noise? 

This naturally requires some intensive tuning.

### The Algorithm

In [50]:
import pandas as pd
import numpy as np
from scipy.spatial.distance import pdist, squareform

class DBSCAN(object):
    '''This is a DBSCAN clustering algorithm. It will default to euclidean distance.'''
    
    def __init__(self, radius, noiseNumber):
        '''Initialize Hyperparameters:

        radius: The size of the neighborhoods around points to evaluate close points.
        coreNumber: The minimum number of points in a neighborhood that filters out noise.
        '''
        self.radius = radius
        self.n = noiseNumber
        
    def fit(self, DataFrame):
        '''Puts labels on a pandas dataframe. Scale before fitting.'''
        
        #Create a distance matrix of the data
        dist_matrix = squareform(pdist(DataFrame, metric='euclidean'))
        
        #Use distance matrix to filter nosie points  
        mask = np.ma.masked_less_equal(dist_matrix, self.radius).mask
        neighborhood_density = np.count_nonzero(mask == True, axis = 0) - 1
        df_dense_points = DataFrame[neighborhood_density >= self.n]
        dense_dist = dist_matrix[neighborhood_density >= self.n] 
        self.noise_points = DataFrame[neighborhood_density < self.n]
        
        
        
        
        
        
        return df_dense_points
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    
    

In [51]:
df = pd.DataFrame({'a': [1,2,3,4,5,11,12,13,14,15,21,22,23,24,25]})
df['b'] = df['a'] - 5

In [52]:
dbscan = DBSCAN(3, 3)

In [53]:
dbscan.fit(df)

Unnamed: 0,a,b
1,2,-3
2,3,-2
3,4,-1
6,12,7
7,13,8
8,14,9
11,22,17
12,23,18
13,24,19
