# PCA and DBSCAN codealong

---

In this detailed codealong/lab we are going to practice and look more extensively at PCA (primarily). If time permits, we will also look at a popular unsupervized learning clustering algorithm called "Density Based Clustering of Applications with Noise" (DBSCAN).

PCA is one of the more difficult concepts/algorithms in this class to understand well in such a short amount of time, but considering how often people use it to simplify their data, reduce noise in their data, and find unmeasured "latent variables", it is important to spend the time to understand what's going on.

Hopefully this will help with that!

---

### How does DBSCAN work?

DBSCAN, in a nutshell, groups datapoints together based on "density", or in other words how close they are together. Nearby points get assigned to a common cluster, whereas outlier points get assigned to their own clusters. DBSCAN is very effective and attractive for its simplicity and minimal pre-specified conditions; for these reasons it is the most popular clustering algorithm.

There are only two parameters that need to be specified for DBSCAN:

    eps : a minimum distance between points that can define a "connection"
    
    min_samples : minimum number of points that a point needs to have 
                  as neighbors to define it as a "core sample"
    
**Core samples** are by design the points that lie internally within a cluster. Non-core samples do not meet the minimum required neighboring points, but are still connected to a cluster defined by a core sample or samples. Hence these points lie on the edges of a cluster.

The DBSCAN algorithm proceeds iteratively through the points, determining via the distance measure and minimum samples specified whether points are core samples, edge samples, or outliers (which are not assigned to any cluster).

---

### Dataset

The dataset we are using for this lab is a subset of the [much more detailed speed dating dataset](https://www.kaggle.com/annavictoria/speed-dating-experiment). In particular, this contains no information on the actual speed dating itself (successes with or opinions of other individuals). It also contains no "follow-up" information where individuals are re-asked the same questions about themselves. All it contains are things that an individual enjoys doing, their ratings of themselves on how desireable they are, and how they think others rate them on desireability.

Specifically, the columns in the data are outlined below:

    subject_id                   :   unique individual identifier
    like_sports                  :   enjoyment of participating in sports
    like_tvsports                :   enjoyment of watching sports on tv
    like_exercise                :   enjoyment of exercise
    like_food                    :   enjoyment of food
    like_museums                 :   enjoyment of museums
    like_art                     :   enjoyment of art
    like_hiking                  :   enjoyment of hiking
    like_gaming                  :   enjoyment of pl aying games
    like_clubbing                :   enjoyment of going clubbing/partying
    like_reading                 :   enjoyment of reading
    like_tv                      :   enjoyment of tv in general
    like_theater                 :   enjoyment of the theater (plays, musicals, etc.)
    like_movies                  :   enjoyment of movies
    like_concerts                :   enjoyment of concerts
    like_music                   :   enjoyment of music
    like_shopping                :   enjoyment of shopping
    like_yoga                    :   enjoyment of yoga
    subjective_attractiveness    :   how attractive they rate themselves
    subjective_sincerity         :   how sincere they rate themselves
    subjective_intelligence      :   how intelligent they rate themselves
    subjective_fun               :   how fun they rate themselves
    subjective_ambition          :   how ambitious they rate themselves
    objective_attractiveness     :   percieved rating others would give them on how attractive they are
    objective_sincerity          :   percieved rating others would give them on how sincere they are
    objective_intelligence       :   percieved rating others would give them on how intelligent they are
    objective_fun                :   percieved rating others would give them on how fun they are
    objective_ambition           :   percieved rating others would give them on how ambitious they are
    
There are 551 subjects total.

---

In [2]:
import numpy as np
import pandas as pd

from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

In [24]:
df = pd.read_csv('../../datasets/speed_dating_user_attributes.csv')

In [25]:
df.head()

Unnamed: 0,subject_id,wave,like_sports,like_tvsports,like_exercise,like_food,like_museums,like_art,like_hiking,like_gaming,...,subjective_attractiveness,subjective_sincerity,subjective_intelligence,subjective_fun,subjective_ambition,objective_attractiveness,objective_sincerity,objective_intelligence,objective_fun,objective_ambition
0,1,1,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,...,6.0,8.0,8.0,8.0,7.0,,,,,
1,2,1,3.0,2.0,7.0,10.0,8.0,6.0,3.0,5.0,...,7.0,5.0,8.0,10.0,3.0,,,,,
2,3,1,3.0,8.0,7.0,8.0,5.0,5.0,8.0,4.0,...,8.0,9.0,9.0,8.0,8.0,,,,,
3,4,1,1.0,1.0,6.0,7.0,6.0,7.0,7.0,5.0,...,7.0,8.0,7.0,9.0,8.0,,,,,
4,5,1,7.0,4.0,7.0,7.0,6.0,8.0,6.0,6.0,...,6.0,3.0,10.0,6.0,8.0,,,,,


In [26]:
df.shape

(551, 29)

In [27]:
df['objective_attractiveness'].unique()

array([ nan,   9.,   6.,   8.,   7.,   3.,   5.,  10.,   4.,   2.])

In [28]:
from sklearn.preprocessing import Imputer

In [29]:
# Changing all NaN values to medians

col = df.columns
impute = Imputer(strategy='median')
df = pd.DataFrame(impute.fit_transform(df), columns = col)
df.head()

Unnamed: 0,subject_id,wave,like_sports,like_tvsports,like_exercise,like_food,like_museums,like_art,like_hiking,like_gaming,...,subjective_attractiveness,subjective_sincerity,subjective_intelligence,subjective_fun,subjective_ambition,objective_attractiveness,objective_sincerity,objective_intelligence,objective_fun,objective_ambition
0,1.0,1.0,9.0,2.0,8.0,9.0,1.0,1.0,5.0,1.0,...,6.0,8.0,8.0,8.0,7.0,7.0,8.0,8.0,8.0,8.0
1,2.0,1.0,3.0,2.0,7.0,10.0,8.0,6.0,3.0,5.0,...,7.0,5.0,8.0,10.0,3.0,7.0,8.0,8.0,8.0,8.0
2,3.0,1.0,3.0,8.0,7.0,8.0,5.0,5.0,8.0,4.0,...,8.0,9.0,9.0,8.0,8.0,7.0,8.0,8.0,8.0,8.0
3,4.0,1.0,1.0,1.0,6.0,7.0,6.0,7.0,7.0,5.0,...,7.0,8.0,7.0,9.0,8.0,7.0,8.0,8.0,8.0,8.0
4,5.0,1.0,7.0,4.0,7.0,7.0,6.0,8.0,6.0,6.0,...,6.0,3.0,10.0,6.0,8.0,7.0,8.0,8.0,8.0,8.0


In [66]:
from sklearn.cluster import DBSCAN
from sklearn.preprocessing import StandardScaler

db = DBSCAN(eps=0.9, min_samples=30)

X = StandardScaler().fit_transform(df)

db.fit(X)

DBSCAN(algorithm='auto', eps=0.9, leaf_size=30, metric='euclidean',
    min_samples=30, n_jobs=1, p=None)

In [67]:
print X

[[-1.73070723 -1.67820677  0.99411778 ..., -0.14541839  0.25237277
   0.16351203]
 [-1.72443173 -1.67820677 -1.30256523 ..., -0.14541839  0.25237277
   0.16351203]
 [-1.71815622 -1.67820677 -1.30256523 ..., -0.14541839  0.25237277
   0.16351203]
 ..., 
 [ 1.71454581  1.65041789 -0.53700422 ..., -1.13462248 -0.46813187
  -0.55724899]
 [ 1.72082132  1.65041789 -0.15422372 ...,  0.84378571 -4.07065503
  -2.71953206]
 [ 1.72709682  1.65041789  0.61133728 ...,  1.8329898  -1.90914113
   0.88427305]]


In [68]:
core_samples = db.core_sample_indices_
labels = db.labels_
print labels

[-1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1
 -1 -1 -1 -1

In [43]:
from sklearn.metrics import silhouette_score

print("Silhouette Coefficient: %0.3f"
      % silhouette_score(X, labels))

ValueError: Number of labels is 1. Valid values are 2 to n_samples - 1 (inclusive)

In [72]:
df.dtypes

subject_id                   float64
wave                         float64
like_sports                  float64
like_tvsports                float64
like_exercise                float64
like_food                    float64
like_museums                 float64
like_art                     float64
like_hiking                  float64
like_gaming                  float64
like_clubbing                float64
like_reading                 float64
like_tv                      float64
like_theater                 float64
like_movies                  float64
like_concerts                float64
like_music                   float64
like_shopping                float64
like_yoga                    float64
subjective_attractiveness    float64
subjective_sincerity         float64
subjective_intelligence      float64
subjective_fun               float64
subjective_ambition          float64
objective_attractiveness     float64
objective_sincerity          float64
objective_intelligence       float64
o

In [83]:
x = 'subject_id'
x[-3:]

'_id'

In [90]:
subj_list = [x for x in df.columns if (('subj' in x) and (x[-3:] != '_id'))]
obj_list = [x for x in df.columns if x[0] == 'o']
like_list = [x for x in df.columns if 'like' in x]

In [98]:
sub_standard = StandardScaler().fit_transform(df[subj_list])

In [99]:
obj_standard = StandardScaler().fit_transform(df[obj_list])
like_standard = StandardScaler().fit_transform(df[like_list])

In [100]:
from sklearn.decomposition import PCA

pca = PCA()
pca.fit(sub_standard)

PCA(copy=True, iterated_power='auto', n_components=None, random_state=None,
  svd_solver='auto', tol=0.0, whiten=False)

In [101]:
print("The information (explained variance) contained in each principal component: ", pca.explained_variance_ratio_)
print(pca.components_)

('The information (explained variance) contained in each principal component: ', array([ 0.43093117,  0.17771361,  0.1472401 ,  0.14454555,  0.09956957]))
[[-0.50226223 -0.31548092 -0.46678014 -0.48386605 -0.44293836]
 [ 0.21520117 -0.90627866 -0.10583025  0.27853298  0.20872658]
 [-0.40233845 -0.14490503  0.42131786 -0.47995234  0.63973691]
 [ 0.34979236 -0.22449684  0.64816115 -0.3593041  -0.52728944]
 [-0.64590261 -0.08794591  0.41628674  0.57344514 -0.27007657]]


In [102]:
pca = PCA()
pca.fit(obj_standard)
print("The information (explained variance) contained in each principal component: ", pca.explained_variance_ratio_)
print(pca.components_)

('The information (explained variance) contained in each principal component: ', array([ 0.45677934,  0.18837747,  0.14350009,  0.12440376,  0.08693934]))
[[-0.44441374 -0.38535047 -0.48425962 -0.43061549 -0.48380199]
 [-0.44800261  0.622034    0.41206787 -0.48866426 -0.06143838]
 [-0.00264133 -0.57790748  0.36394753 -0.49396673  0.53810319]
 [-0.72433758 -0.00767019 -0.20758188  0.41517172  0.50972389]
 [-0.2777016  -0.36130711  0.64818566  0.39925789 -0.46128893]]


In [103]:
pca = PCA()
pca.fit(like_standard)
print("The information (explained variance) contained in each principal component: ", pca.explained_variance_ratio_)
print(pca.components_)

('The information (explained variance) contained in each principal component: ', array([ 0.23222315,  0.12227591,  0.09897686,  0.07410494,  0.06508082,
        0.059612  ,  0.05665896,  0.04718946,  0.04397325,  0.04171317,
        0.03349372,  0.02985333,  0.0259053 ,  0.02508021,  0.01898246,
        0.01758949,  0.00728696]))
[[ 0.11119041  0.043285    0.01374125 -0.2716203  -0.39588287 -0.39456449
  -0.08724977 -0.00852673 -0.10694796 -0.17137261 -0.13126497 -0.38427033
  -0.30834179 -0.32919044 -0.26995016 -0.24607328 -0.22167613]
 [ 0.47337083  0.50720676  0.33593843  0.08280048 -0.11916853 -0.11090991
   0.11819994  0.3710932   0.21431877 -0.21055602  0.24502036 -0.07150713
   0.0582877   0.11130953  0.12328159  0.17896408  0.04836226]
 [-0.32616646 -0.01586391 -0.25219266  0.07062075 -0.21749909 -0.21976452
  -0.47763116  0.07737794  0.09791859 -0.13267064  0.47682628  0.05947868
   0.22550643 -0.08108141 -0.01127378  0.36349851 -0.21743507]
 [ 0.04570789 -0.06713625  0.403547