# Clustering 

## 1. DBSCAN
Using DBSCAN iterate (for-loop) through different values of `min_samples` (1 to 10) and `epsilon` (.05 to .5, in steps of .01)  to find clusters in the road-data used in the Lesson and calculate the Silohouette Coeff for `min_samples` and `epsilon`. Plot **_one_** line plot with the multiple lines generated from the min_samples and epsilon values. Use a 2D array to store the SilCoeff values, one dimension represents `min_samples`, the other represents epsilon.

Expecting a plot of `epsilon` vs `sil_score`.

In [1]:
# Loading packages 

import pandas as pd
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn
from mpl_toolkits.mplot3d import Axes3D
plt.rcParams['font.size'] = 14
# plt.rcParams['figure.figsize'] = (20.0, 10.0)
from sklearn.cluster import DBSCAN
from sklearn.metrics import silhouette_score

In [None]:
# Reading in road data 
X = pd.read_csv('3D_spatial_network.txt.gz', header=None, names=['osm', 'lat','lon','alt'])
X = X.drop(['osm'], axis=1).sample(10000)
X.head()

In [None]:
# Setting up ranges for min_samples and epsilon
min_samples = np.arange(1,11,1)
epsilons = np.arange(0.05, 0.51, 0.01)

In [None]:
all_scores = []
for min_sample in min_samples:
    scores = []
    for epsilon in epsilons:
        
        # Applying DBSCAN
        dbscan = DBSCAN(eps=epsilon, min_samples=min_sample)
        labels = dbscan.fit_predict(X)

        # calculate silouette score here
        score = silhouette_score(X, labels)
        
        scores.append(score)
        
    all_scores.append(scores)

In [None]:
# Setting up ranges for min_samples and epsilon
min_samples = np.arange(1, 11, 1)
epsilons = np.arange(0.05, 0.51, 0.01)

# Initializing a 2D array to store the silhouette scores
silhouette_scores = np.zeros((len(min_samples), len(epsilons)))

# Iterating through the values of min_samples and epsilon
for i, min_sample in enumerate(min_samples):
    for j, epsilon in enumerate(epsilons):
        # Applying DBSCAN
        dbscan = DBSCAN(eps=epsilon, min_samples=min_sample)
        labels = dbscan.fit_predict(X)
        
        # Checking if the number of clusters is less than 2
        if len(set(labels)) > 1:
            # Calculating the Silhouette Coefficient
            score = silhouette_score(X, labels)
        else:
            score = -1  # Assigning a score of -1 if the number of clusters is less than 2
        
        # Storing the score in the array
        silhouette_scores[i, j] = score

# Plotting the results
plt.figure(figsize=(12, 8))
for i, min_sample in enumerate(min_samples):
    plt.plot(epsilons, silhouette_scores[i], label=f'min_samples={min_sample}')

plt.xlabel('Epsilon')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for different values of min_samples and epsilon')
plt.legend()
plt.show()

In [None]:
# Plotting the results
plt.figure(figsize=(12, 8))
for i, min_sample in enumerate(min_samples):
    plt.plot(epsilons, silhouette_scores[i], label=f'min_samples={min_sample}')

plt.xlabel('Epsilon')
plt.ylabel('Silhouette Coefficient')
plt.title('Silhouette Coefficient for different values of min_samples and epsilon')
plt.legend()
plt.show()

## 2. Clustering your own data
Using your own data, find relevant clusters/groups within your data (repeat the above). If your data is labeled with a class that you are attempting to predict, be sure to not use it in training and clustering. 

You may use the labels to compare with predictions to show how well the clustering performed using one of the clustering metrics (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

If you don't have labels, use the silhouette coefficient to show performance. Find the optimal fit for your data but you don't need to be as exhaustive as above.

Additionally, show the clusters in 2D or 3D plots. 

As a bonus, try using PCA first to condense your data from N columns to less than N.

Two items are expected: 
- Metric Evaluation Plot (like in 1.)
- Plots of the clustered data