<img src="http://imgur.com/1ZcRyrc.png" style="float: left; margin: 20px; height: 55px">

# DBSCAN

_Authors: Matt Brems (DC), Riley Dallas (AUS)_

In [None]:
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import silhouette_score
from sklearn.datasets import load_iris, load_wine
from sklearn.cluster import DBSCAN
import seaborn as sns
import matplotlib.pyplot as plt

%matplotlib inline

## Where `DBSCAN` shines
---

`DBSCAN` does really well when there is clear "separation" within your dataset. `load_iris` is a good example of this, because one of the species is an island unto itself.

**In the cell below, load the iris dataset into a `pandas` DataFrame. Ignore the species.**

## Preprocessing: `StandardScaler`
---

Because clustering models are based on distance, we don't want the magnitude of our features to affect the algorithm. Therefore, when clustering **you should always scale your data**.

Create `X_scaled` using an instance of `StandardScaler` in the cell below.

## `DBSCAN`
---

Fit an instance of `DBSCAN` to `X_Scaled`. Use the default parameters for now (we'll tune them in a minute).

## Model Evaluation: Silhouette score
---

Recall the formula for Silhouette score:

### $s_i = \frac{b_i - a_i}{max\{a_i, b_i\}}$

Where:
- $a_i$ = Cohesion: Average distance of points within clusters
- $b_i$ = Separation: Average distance from point $x_i$ to all points in the next nearest cluster.

In the cell below, use the `silhouette_score` function from `sklearn` to evaluate our `DBSCAN` model.

## EDA: `pairplot`
---

Now let's view our clusters using `seaborn`'s `pairplot` method. 

1. First, you'll need to assign the clusters (`dbscan.labels_`) to your original DataFrame.
2. Then you'll create a `pairplot` using the `cluster` column as the hue

In [None]:
# Create cluster column

In [None]:
# Pairplot

## Where `DBSCAN` does poorly
---

`DBSCAN` is dependent on two things:

1. Consistent density (one `eps` to rule them all)
2. Clear separation of the clusters within your dataset

The `load_wine` dataset is more or less clumped together, which makes it a great dataset for exposing one of the weaknesses of `DBSCAN`: no clear separation.

In the cell below, load the wine dataset into a `pandas` DataFrame. Ignore the target.

## Preprocessing: `StandardScaler`
---

Because clustering models are based on distance, we don't want the magnitude of our features to affect the algorithm. Therefore, when clustering **you should always scale your data**.

Create `X_scaled` using an instance of `StandardScaler` in the cell below.

## `DBSCAN`
---

Fit an instance of `DBSCAN` to `X_Scaled`. Finding the right values for `eps` and `min_samples` can take a while, so to save on time we'll use the following parameters:

- 2.3 for `eps`
- 4 for `min_samples`

## Model evaluation
---

Calculate the silhouette score for our instance of `DBSCAN` in the cell below.

## EDA
---

It's not practical to create a `pairplot` because the wine dataset has several features. We'll try some different techniques in a bit.

In the cell below, create a `cluster` column using `dbscan.labels_`.

In [None]:
# Create cluster column

In [None]:
# Value counts for each cluster

## Exploring each cluster
---

Clustering is sort of backwards: We fit a model, **then** we do EDA on each cluster. You can go one of two routes:

1. Break each cluster into its own DataFrame
2. Use `.groupby()` extensively

In the cell below, use `.groupby()` in conjunction with `.mean()` and see if you spot any defining characteristics.