# Clustering 

## 1. DBSCAN
Using DBSCAN iterate (for-loop) through different values of `min_samples` (1 to 10) and `epsilon` (.05 to .5, in steps of .01)  to find clusters in the road-data used in the Lesson and calculate the Silohouette Coeff for `min_samples` and `epsilon`. Plot **_one_** line plot with the multiple lines generated from the min_samples and epsilon values. Use a 2D array to store the SilCoeff values, one dimension represents `min_samples`, the other represents epsilon.

Expecting a plot of `epsilon` vs `sil_score`.

In [1]:
import pandas as pd
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn
# from mpl_toolkits.mplot3d import Axes3D
plt.rcParams['font.size'] = 14
from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.preprocessing import StandardScaler

In [2]:
# Read Data in
Q1X = pd.read_csv('../data/3D_spatial_network.txt.gz', header=None, names=['osm', 'lat','lon','alt'])
Q1X = Q1X.drop(['osm'], axis=1).sample(10000)
Q1X.head()

Unnamed: 0,lat,lon,alt
312987,10.312665,56.709469,1.201847
101829,9.597935,57.260385,7.257621
244312,9.879426,57.005578,6.167099
109890,10.042743,57.112234,21.007275
322594,8.758444,56.896737,28.646659


In [3]:
# Scale Data
scaler = StandardScaler()
Q1X_scaled = scaler.fit_transform(Q1X)

In [4]:
Q1X_scaled

array([[ 0.9215047 , -1.30152792, -1.13065087],
       [-0.21690984,  0.60179142, -0.80288404],
       [ 0.231447  , -0.27852313, -0.86190819],
       ...,
       [ 0.83583254, -1.56896783, -1.06423671],
       [ 0.56297863, -0.08720688, -1.14139442],
       [ 1.1831374 ,  1.35627252, -0.84007134]])

In [5]:
# convert scaled data to dataframe
Q1X_scaled = pd.DataFrame(Q1X_scaled, columns=['lat', 'lon', 'alt'])
Q1X_scaled

Unnamed: 0,lat,lon,alt
0,0.921505,-1.301528,-1.130651
1,-0.216910,0.601791,-0.802884
2,0.231447,-0.278523,-0.861908
3,0.491576,0.089955,-0.058688
4,-1.554043,-0.654551,0.354791
...,...,...,...
9995,2.224241,0.745839,-1.146461
9996,-1.489962,-0.303063,-1.102406
9997,0.835833,-1.568968,-1.064237
9998,0.562979,-0.087207,-1.141394


In [6]:
min_samples = np.arange(1, 11)
epsilons = np.arange(0.05,0.51,0.01)
                         
all_scores = []
all_eps_values = []
for min_sample in min_samples:
    scores = []
    eps_values = []
    for epsilon in epsilons:
        dbscanQ1 = DBSCAN(eps=epsilon, min_samples=min_sample).fit(Q1X_scaled[['lat','lon', 'alt']])
        score = metrics.silhouette_score(Q1X_scaled[['lon', 'lat', 'alt']], dbscanQ1.labels_)
        epsvalue = epsilon
        scores.append(score)
        eps_values.append(epsvalue)
        
    all_scores.append(scores)
    all_eps_values.append(eps_values)

In [90]:
plt.figure()
plt.xlabel('Epsilon Value')
plt.ylabel('Silhouette Score')
plt.plot(all_eps_values[0], all_scores[0])
plt.plot(all_eps_values[1], all_scores[1])
plt.plot(all_eps_values[3], all_scores[3])
plt.plot(all_eps_values[4], all_scores[4])
plt.plot(all_eps_values[5], all_scores[5])
plt.plot(all_eps_values[6], all_scores[6])
plt.plot(all_eps_values[7], all_scores[7])
plt.plot(all_eps_values[8], all_scores[8])
plt.plot(all_eps_values[9], all_scores[9])




<IPython.core.display.Javascript object>

[<matplotlib.lines.Line2D at 0x28790f1f0>]

In [89]:
graph_loop_val = np.arange(0, 9)
plt.figure()
plt.xlabel('Epsilon Value')
plt.ylabel('Silhouette Score')
for i in graph_loop_val:
    plt.plot(all_eps_values[i],all_scores[i])

<IPython.core.display.Javascript object>

## 2. Clustering your own data
Using your own data, find relevant clusters/groups within your data (repeat the above). If your data is labeled with a class that you are attempting to predict, be sure to not use it in training and clustering. 

You may use the labels to compare with predictions to show how well the clustering performed using one of the clustering metrics (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

If you don't have labels, use the silhouette coefficient to show performance. Find the optimal fit for your data but you don't need to be as exhaustive as above.

Additionally, show the clusters in 2D or 3D plots. 

As a bonus, try using PCA first to condense your data from N columns to less than N.

Two items are expected: 
- Metric Evaluation Plot (like in 1.)
- Plots of the clustered data