# 9 Clustering 

## 1. DBSCAN
Using DBSCAN iterate (for-loop) through different values of `min_samples` (1 to 10) and `epsilon` (.05 to .5, in steps of .01)  to find clusters in the road-data used in the Lesson and calculate the Silohouette Coeff for `min_samples` and `epsilon`. Plot **_one_** line plot with the multiple lines generated from the min_samples and epsilon values. Use a 2D array to store the SilCoeff values, one dimension represents `min_samples`, the other represents epsilon.

In [185]:
import pandas as pd
import numpy as np
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn
from mpl_toolkits.mplot3d import Axes3D
plt.rcParams['font.size'] = 14

In [186]:
X = pd.read_csv('../data/3D_spatial_network.txt.gz', header=None, names=['osm', 'lat','lon','alt'])
X = X.drop(['osm'], axis=1).sample(1000)
X.head()

Unnamed: 0,lat,lon,alt
17966,8.466112,56.977628,19.730818
278962,9.024107,57.101089,2.533939
416065,9.728677,57.000271,21.773259
31644,9.930156,57.568115,26.90933
307009,10.145402,57.441053,33.791876


In [187]:
XX = X.copy()
XX['alt'] = (X.alt - X.alt.mean())/X.alt.std()
XX['lat'] = (X.lat - X.lat.mean())/X.lat.std()
XX['lon'] = (X.lon - X.lon.mean())/X.lon.std()

In [194]:
from sklearn.cluster import DBSCAN
from sklearn import metrics

episilon = np.arange(.05,.5,.01)
samples = range(2,7)
scores = [[0,0,0]]

for E in episilon:
    for M in samples:
        dbscan = DBSCAN(eps=E, min_samples = M)
        labels = dbscan.fit_predict(XX[['lon', 'lat', 'alt']])
        sil_score = (metrics.silhouette_score(XX[['lon', 'lat', 'alt']], labels))
        new_row = [[E, M, sil_score]]
        scores = np.concatenate((scores, new_row))

In [189]:
scores = np.delete(scores, (0), axis=0)

In [190]:
len(scores)

225

In [191]:
plt.figure()
line1, = plt.plot(scores[:,0],scores[:,2],'b.', label='episilon')
#line2, = plt.plot(scores[:,1],scores[:,2],'r.', label='# of samples')
#line1, = plt.scatter(scores[:,0],scores[:,2], label='episilon')
#line2, = plt.scatter(scores[:,1],scores[:,2],label='# of samples')

plt.legend(loc='upper right')
plt.draw()

<IPython.core.display.Javascript object>

In [192]:
plt.figure()
line1, = plt.plot(scores[:,1],scores[:,2],'r.', label='# of samples')

<IPython.core.display.Javascript object>

## 2. Clustering your own data
Using your own data, find relevant clusters/groups within your data. If your data is labeled already, with a class that you are attempting to predict, be sure to not use it in fitting/training/predicting. 

You may use the labels to compare with predictions to show how well the clustering performed using one of the clustering metrics (http://scikit-learn.org/stable/modules/clustering.html#clustering-performance-evaluation). 

If you don't have labels, use the silhouette coefficient to show performance. Find the optimal fit for your data but you don't need to be as exhaustive as above.

Additionally, show the clusters in 2D and 3D plots. 

For bonus, try using PCA first to condense your data from N columns to less than N.

Two items are expected: 
- Metric Evaluation Plot
- Plots of the clustered data

In [None]:
import pandas as pd
%matplotlib notebook
import matplotlib.pyplot as plt
import seaborn
from mpl_toolkits.mplot3d import Axes3D
from sklearn import preprocessing
from sklearn import metrics
from sklearn.cluster import DBSCAN
from sklearn.cluster import KMeans
plt.rcParams['font.size'] = 14

In [None]:
kiva = pd.read_csv("../data/kiva_loans.csv")
kiva = kiva.drop(['use','country_code','region','posted_time','disbursed_time','funded_time','tags',
                  'borrower_genders','date'], axis=1)
kiva.head()

In [None]:
kiva1 = kiva[['loan_amount','term_in_months','lender_count']]
kiva1 = preprocessing.MinMaxScaler().fit_transform(kiva1)

In [None]:
# from sklearn.preprocessing import StandardScaler
#scaler = StandardScaler()
#short_scaled = scaler.fit_transform(x)

In [None]:
KM = KMeans (n_clusters=7, random_state=1)
#kiva1['cluster'] = KM.fit_predict(kiva1)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
kiva1_scaled = scaler.fit_transform(kiva1_sample)

episilons = []
scores = []
for x in range(1,10):
    KM = KMeans(n_clusters = x)
    clusters = dbscan.fit_predict(kiva1_scaled)
    sil_score = metrics.silhouette_score(kiva1_scaled, clusters)
    scores.append(sil_score)
    episilons.append(.1*x)

In [None]:
kiva1['cluster'].value_counts()

In [None]:
colors = ['red','blue','green','yellow','purple','orange', 'black']
kiva1['color'] = kiva1.cluster.apply(lambda x: colors[x])

In [None]:
fig = plt.figure()
plt.clf()
ax = Axes3D(fig, rect=[0, 0, .95, 1], elev=48, azim=140)

plt.cla()

ax.scatter(kiva1['loan_amount'], kiva1['term_in_months'], kiva1['lender_count'], c=kiva1.color, s=5)

ax.set_xlabel('loan amount')
ax.set_ylabel('term in months')
ax.set_zlabel('lender count')
plt.show()

In [None]:
fig = plt.figure()
plt.scatter(kiva1.loan_amount, kiva1.term_in_months, c=kiva1.cluster, s=5, cmap='Paired')

plt.xlabel('loan amount')
plt.ylabel('term in months')
plt.show()

In [None]:
kiva1 = kiva[['loan_amount','term_in_months','lender_count']]
kiva1_sample = kiva1.sample(10000)

In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
kiva1_scaled = scaler.fit_transform(kiva1_sample)

episilons = []
scores = []
for x in range(1,10):
    dbscan = DBSCAN(eps=.1*x)
    clusters = dbscan.fit_predict(kiva1_scaled)
    sil_score = metrics.silhouette_score(kiva1_scaled, clusters)
    scores.append(sil_score)
    episilons.append(.1*x)

In [None]:
plt.scatter(episilons,scores)

In [None]:
sil_score

In [None]:
kiva1_scaled[:,2].shape

In [None]:
plt.scatter(kiva1_scaled[:,0], kiva1_scaled[:,1], s=5, c=clusters)

In [None]:
epsilons =[]
master_scores = []
for samp in range(1,5):
    scores = []
    for x in range(1,10):
        dbscan = DBSCAN(eps=.1*x, min_samples = samp)
        clusters = dbscan.fit_predict(kiva1_scaled)
        score = silhouette_score(kiva1_scaled,clusters)
        print(score)
        scores.append(score)
        epsilons.append(.1*x)
    master_scores.append(scores)

In [None]:
epsilons = np.arrange(.1,1.0,.1)
for scores in master_scores:
    plt.scatter(epsilons, scores)

In [None]:
_= plt.hit(short[short.lender_count<500].lender_count, bins=100, log=False)

## Note
You may use any for both parts 1 and 2, I only recommend using the data I used in the Lesson for part 1. I've included several new datasets in the `data/` folder, such as `beers.csv`, `snow_tweets.csv`, `data/USCensus1990.data.txt.gz`. You do not need to unzip or ungzip any data files. Pandas can open these files on its own.