# DSCI 6003 7.3 Introduction to Proximity Clustering

## By the End of This Lesson You Will
1. Be able to describe in your own words what proximity Clustering is and what DBSCAN does.
2. Be able to describe verbally how the DBSCAN algorithm works.
3. Be able to identify basic circumstances where DBSCAN (or other proximity clustering algorithms) would be used

## Proximity Clustering: A basic introduction

Proximity clustering is less of an algorithm and more of an entire field, worthy of a 4 week survey.  It's more important that you understand why it exists and how to understand it. Since the publication of DBSCAN, something like 15 established upgraded versions of the original algorithm exist, i.e. AUTODBSCAN, VDBSCAN, PDBSCAN, DENCLUE, DBSCAN-DLP, GDBSCAN, OPTICS, and so on. Maybe 1000+ other publications have attempted improvements. Research on these algorithms continues at a rapid pace even today.

### Intuition:

The proximity methods **deliberately eliminate any presumption of underlying distributions.** This makes them more of a relative of K-means than probabilistic methods. However unlike K-means, proximity methods do not operate under the assumption of clean decision boundaries between clusters. The fact that K-means uses a Voronoi tesselation makes it prefer spherical distributions. Nonlinear boundaries 

Proximity clustering allows the **clustered data to describe the clusters**. It does this with two parameters, $\epsilon$ and minPoints. Unfortunately, it turns out that these end up being quite a restriction on the applicability of the method. 

### Hypothesis:



We can't make any assumptions about the clusters present, or their density. There could be no clusters, or many.
We describe the relationships in terms of *neighborhoods*.

![dbscan_neighborhood](./images/DBSCAN_neighborhood.png)


A neighborhood is defined as:

1. Having a minimum number of points, minPoints, within it.
2. Points that belong to that neighborhood are no more than $\epsilon$ distance apart.
3. If a point is within another point's neighborhood and that neighborhood is in a cluster, it is labeled as a member of that cluster.
4. Points that don't belong to a cluster are labeled as noise.

You will also see neighborhoods described in terms of their density. This is a reflection of the distance $\epsilon$ and the minPoints number of points that belong to the neighborhood. Any cluster's minimum density is defined in terms of these two parameters: 

$$\rho_{min} \propto \frac{minPoints}{\epsilon}$$

There are two types of relationships worth describing here: *density connectedness* and *density reachability*. If two points are density connected, they fall within each other's neighborhood. In the above diagram, the points *A* are density connected. The points *B* and *C* are density *reachable* but **not** density *connected* to points *A*. They will still be part of the *A* cluster, but these points will not become new centers of the same cluster themselves.

## QUIZ:  

What would happen if we added an additional groups of points within  to *B* but not *A*


### Cost:

There are no cost functions to minimize here. The algorithm proceeds until it can't change the labels on any of the points.

### Optimization:

There is no function to optimize but we can think of the clustering process as, where the process reaches an optimum when it can't change the labels on points any further.

There are numerous ways to define neighborhoods and distances. The most sophisticated algorithms carefully define neighborhoods into regions of similar density and provide multiple ways to define a cluster.  

Really, that's it!

### Reasoning: 

We are looking for ways to define clusters of unknown shape and size. 

## QUIZ:
Is there an upper limit on cluster density? How do you think variable density will affect the clustering process?

# DBSCAN

DBSCAN (Density Based Spatial Clustering Analysis with Noise) is the original version of the algorithm and the only one we expect you to learn in detail. As its acronym suggests, it is fast (SCAN) and capable of performing clustering on large databases (DB). One of the reasons it's so fast is because it's so simple. 


### Algorithm:

    Pick an arbitrary point A from the dataset. This is marked as a visited point.
    For all points within eps of A:
        Mark if visited
        Count all points within $\epsilon$ of the original point. This is the neighborhood.
        If there are more than minPoints points within a distance of $\epsilon$ from that point (including the original point itself), label them all as members of a new cluster.
        Recurse:
            For all points in the neighborhood of every point in the cluster, check to see if they also have minPoints within an $\epsilon$ neighborhood. Those that are, are also cluster members.
    
   If a point has been visited by the algorithm and doesn't belong to a cluster yet, it is marked as noise.



###Demonstration:

Heres a great demonstration page for both DBSCAN and K-Means:

http://www.naftaliharris.com/blog/visualizing-dbscan-clustering/

I encourage you to work with this after class.

