# Star_Galaxy_HDBSCAN_v1
**Author:** Kalea Sebesta<br>
**Date:** 10 October 2018<br>

**Purpose:** The purpose to this notebook is to explore the application of HDBSCAN on the Sloan Digital Sky dataset from Kaggle. https://www.kaggle.com/lucidlenn/sloan-digital-sky-survey

**Description of Algorithm:** HDBSCAN stands for Hierarchical Density-Based Spatial Clustering of Applications with Noise. This approach uses a technique that extracts flat clusterings that are considered stable clusters. According to the HDBSCAN documentation at https://hdbscan.readthedocs.io/en/latest/how_hdbscan_works.html "We can break it out into a series of steps:
1) Transform the space according to the density/sparsity.
2) Build the minimum spanning tree of the distance weighted graph.
3) Construct a cluster hierarchy of connected components.
4) Condense the cluster hierarchy based on minimum cluster size.
5) Extract the stable clusters from the condensed tree."

Lets explain these steps in greater detail for a better understanding. Transforming the space essentially is the way to "find the islands of higher density amongst the sea of sparser noise". This makes the algorthim robust to noise which is extremely useful and important. In order to accomplish this there are two distance metrics that are being used, **core distance** and **mutual reachablity**. Core distance is the distance from the center of the cluster to the boundary of the cluster. This can changed based on how many k neighbors are in the cluster. Essentially, the core distance is the distances from a point (core) to the k nearest neighbor. Mutual reachability is the longest distance between two point looking at the three following options: 
- the core distance of the first point (core(a) 
- the core distance of the second point (core(b))
- the distance between point a and point b

The mutual reachability gives the information needed to build the minimum spanning tree. Along with the mutual reachability values the practioner needs to choose a threshold value. Create a weighted graph using the mutual reachability and connect all points using the densities from the mutual reachability as edges. If an edge is above the chosen threshold then drop the edge, in this manner the spanning tree is created. It is important to note that the threshold value should start high and then adjust it lower to find the optimal value.

Next, building the cluster heirarchy uses the information from the minimum spanning tree. The cluster heirarchy is essentially creating connected components from the minimum spanning tree. This can be done by sorting the edges of the tree by increasing order of distance and then iterating through to create/merge clusters together for each edge.

From there the HDBSCAN algorithm condenses the cluster tree. This is accomplished by using the prameter **minimum cluster size** as the "fall out point". This means that at each split in the hierarchy if the new cluster that was created by the split has few points than the minimum cluster size than it is considered noise. If both clusters from the split have the minimum cluster size or larger number of points than they are both considered true clusters.

Lastly, clusters are extracted from the cluster tree. Ideally choosing clusters that have a longer lifetime.

It is important to note that HDBSCAN has 15 parameter to tune which makes this algorithm not only robust to noise but also applicable to many differing data.

For the purpose of this script, minimum cluster size and metric will be the two parameters that I will initially investigate and tune. The defualt parameters of HDBSCAN are as follows:

### Data Desciption
According to the Sloan Digital Sky Survey/ SkyServer Glossary the following features of the dataset are described as follows:
- objid: The long object identification, which is a bit-encoded integer of run,rerun, camcol, field, object. When the data is reprocessed (rerun), this number will change! IMPORTANT NOTE: For spectroscopic objects, there are two possible choices for the matching photometric measurement: TargetObjID is the photometric object identification number of the corresponding photometric object when targeting was run, and BestObjID, which points to the best imaging and processing of the photometry.
- ra: The SDSS has two sets of coordinates it uses which are specially designed for the survey geometry. spherical coordinates (corresponding to **RA** and Dec)
- dec: The SDSS has two sets of coordinates it uses which are specially designed for the survey geometry. spherical coordinates (corresponding to RA and **Dec**)
- u: The SDSS uses five filters: u,g,r,i,z. Filter u is Ultraviolet with a wavelength of 3543 Angstromes
- g: The SDSS uses five filters: u,g,r,i,z. Green (g) 4770 Angstromes
- r: The SDSS uses five filters: u,g,r,i,z. Red (r) 6231 Angstromes
- i: The SDSS uses five filters: u,g,r,i,z. Near Infrared (i) 7625 Angstromes
- z: The SDSS uses five filters: u,g,r,i,z. Infrared (z) 9134Angstromes
- run: A Run is just a length of a strip observed in a single contiguous observing pass scan, bounded by lines of mu and nu. A strip covers a great circle region from pole to pole; this cannot be observed in one pass. The fraction of a strip observed at one time (limited by observing conditions) is a Run. Runs can (and usually do) overlap at the ends. Like strips, it takes a pair of runs to fill in a length of a stripe. This is why you may read about data taken from "Runs 752/756" or some similar terminology. However, each individual run does contain 6 camcols spanning the same range of nu, but not delimited by eta. These run pairs might not have the same starting and ending nu coordinates.
- rerun: A reprocessing of an imaging run. The underlying imaging data is the same, just the software version and calibration may have changed.
- camcol: A Camcol is the output of one camera column as part of a Run. Therefore, 1 Camcol = 1/6 of a Run. It is also a portion of a scanline.
- field: A field is a part of a camcol that is processed by the Photo pipeline at one time. Fields are 2048x1489 pixels; a field consists of the frames in the 5 filters for the same part of the sky. Fields overlap each other by 128 rows; primaries are decided when Chunks are resolved (using objects between rows 64 and1425 as primaries). A field at the edge of a Chunk may in fact be included in 2 (or more) Chunks.
- specobjid: A unique bit-encoded 64-bit ID used for spectroscopic objects. It is generated from plateid, mjd, and fiberid. Completely independent of any photometric enumeration system.
- class: Outcome variable (STAR, GALAXY, QSO)
- redshift:
- plate: SDSS has the largest multi-fiber spectrograph in operation in the world, observing 640 objects simultaneously. This is done by drilling holes in round aluminum plates at the positions of objects of interest, and plugging optical fibers into each hole.
- mjd: Part of the specobjid
- fiberid: The SDSS spectrograph uses optical fibers to direct the light from individual objects to the slithead. Each object is assigned a corresponding fiberID. The fibers are 3 arcsec in diameter in the source plane. Each fiber is surrounded by a large sheath which prevents any pair of fibers from being placed closer than 55 arcsec on the same plate.

### Import Packages

In [205]:
import pandas as pd
import hdbscan 
import numpy as np
import pandas_profiling as pp
from sklearn import preprocessing
from scipy.stats import trim_mean, kurtosis
from scipy.stats.mstats import mode, gmean, hmean
from functools import reduce

### Read in File & Exploratory Analysis
After the file is read into a pandas dataframe, I check for missing values. In this particular dataset there are no missing values. From there I look at the descriptive statistics for the numerical variables. After looking at the descriptive statistics I utlize the the pandas profiling package to run a report on the dataframe. This gives the histograms for the distribution of the variables, spearman and pearson corralations, and descriptive stats. From this it seen that objid and rerun are constants and should be dropped from any analysis that is performed. Also, specobjid is a unique identifier and a composition variable with plateid, mjd, and fiberid and thus is should also be dropped prior to analysis. i is highly correlated with r (ρ = 0.97767) therefore should be rejected, mjd is highly correlated with plate (ρ = 0.96688) thus it too should be rejected as well as, plate since it is highly correlated with specobjid (ρ = 1), r since it is highly correlated with g (ρ = 0.95811), and z since it is highly correlated with i (ρ = 0.98151).

In [219]:
#read in file
df = pd.read_csv('/Users/kaleasebesta/Downloads/Skyserver_SQL2_27_2018 6_51_39 PM.csv')

In [221]:
df.isnull().sum()

objid        0
ra           0
dec          0
u            0
g            0
r            0
i            0
z            0
run          0
rerun        0
camcol       0
field        0
specobjid    0
class        0
redshift     0
plate        0
mjd          0
fiberid      0
dtype: int64

In [222]:
df.describe()

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,redshift,plate,mjd,fiberid
count,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0,10000.0
mean,1.23765e+18,175.529987,14.836148,18.619355,17.371931,16.840963,16.583579,16.422833,981.0348,301.0,3.6487,302.3801,1.645022e+18,0.143726,1460.9864,52943.5333,353.0694
std,157703.9,47.783439,25.212207,0.828656,0.945457,1.067764,1.141805,1.203188,273.305024,0.0,1.666183,162.577763,2.013998e+18,0.388774,1788.778371,1511.150651,206.298149
min,1.23765e+18,8.2351,-5.382632,12.98897,12.79955,12.4316,11.94721,11.61041,308.0,301.0,1.0,11.0,2.99578e+17,-0.004136,266.0,51578.0,1.0
25%,1.23765e+18,157.370946,-0.539035,18.178035,16.8151,16.173333,15.853705,15.618285,752.0,301.0,2.0,184.0,3.389248e+17,8.1e-05,301.0,51900.0,186.75
50%,1.23765e+18,180.394514,0.404166,18.853095,17.495135,16.85877,16.554985,16.389945,756.0,301.0,4.0,299.0,4.96658e+17,0.042591,441.0,51997.0,351.0
75%,1.23765e+18,201.547279,35.649397,19.259232,18.010145,17.512675,17.25855,17.141447,1331.0,301.0,5.0,414.0,2.8813e+18,0.092579,2559.0,54468.0,510.0
max,1.23765e+18,260.884382,68.542265,19.5999,19.91897,24.80204,28.17963,22.83306,1412.0,301.0,6.0,768.0,9.46883e+18,5.353854,8410.0,57481.0,1000.0


In [16]:
pp.ProfileReport(df)

0,1
Number of variables,18
Number of observations,10000
Total Missing (%),0.0%
Total size in memory,1.4 MiB
Average record size in memory,144.0 B

0,1
Numeric,10
Categorical,1
Boolean,0
Date,0
Text (Unique),0
Rejected,7
Unsupported,0

0,1
Distinct count,6
Unique (%),0.1%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,3.6487
Minimum,1
Maximum,6
Zeros (%),0.0%

0,1
Minimum,1
5-th percentile,1
Q1,2
Median,4
Q3,5
95-th percentile,6
Maximum,6
Range,5
Interquartile range,3

0,1
Standard deviation,1.6662
Coef of variation,0.45665
Kurtosis,-1.222
Mean,3.6487
MAD,1.4545
Skewness,-0.10022
Sum,36487
Variance,2.7762
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
4,1834,18.3%,
5,1827,18.3%,
6,1769,17.7%,
2,1712,17.1%,
3,1560,15.6%,
1,1298,13.0%,

Value,Count,Frequency (%),Unnamed: 3
1,1298,13.0%,
2,1712,17.1%,
3,1560,15.6%,
4,1834,18.3%,
5,1827,18.3%,

Value,Count,Frequency (%),Unnamed: 3
2,1712,17.1%,
3,1560,15.6%,
4,1834,18.3%,
5,1827,18.3%,
6,1769,17.7%,

0,1
Distinct count,3
Unique (%),0.0%
Missing (%),0.0%
Missing (n),0

0,1
GALAXY,4998
STAR,4152
QSO,850

Value,Count,Frequency (%),Unnamed: 3
GALAXY,4998,50.0%,
STAR,4152,41.5%,
QSO,850,8.5%,

0,1
Distinct count,10000
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,14.836
Minimum,-5.3826
Maximum,68.542
Zeros (%),0.0%

0,1
Minimum,-5.3826
5-th percentile,-1.7731
Q1,-0.53904
Median,0.40417
Q3,35.649
95-th percentile,66.176
Maximum,68.542
Range,73.925
Interquartile range,36.188

0,1
Standard deviation,25.212
Coef of variation,1.6994
Kurtosis,-0.40615
Mean,14.836
MAD,21.439
Skewness,1.1915
Sum,148360
Variance,635.66
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
0.701145,1,0.0%,
0.03114837,1,0.0%,
0.322301336,1,0.0%,
-0.728211449,1,0.0%,
60.96031635,1,0.0%,
68.27531163,1,0.0%,
-0.82763499,1,0.0%,
-0.037103909,1,0.0%,
0.271520578,1,0.0%,
59.67595514,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-5.382632499,1,0.0%,
-5.378793694,1,0.0%,
-5.371988496,1,0.0%,
-5.35457198,1,0.0%,
-5.3496208,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
68.48016107,1,0.0%,
68.48323595,1,0.0%,
68.53200681,1,0.0%,
68.54056693,1,0.0%,
68.54226541,1,0.0%,

0,1
Distinct count,892
Unique (%),8.9%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,353.07
Minimum,1
Maximum,1000
Zeros (%),0.0%

0,1
Minimum,1.0
5-th percentile,36.0
Q1,186.75
Median,351.0
Q3,510.0
95-th percentile,636.0
Maximum,1000.0
Range,999.0
Interquartile range,323.25

0,1
Standard deviation,206.3
Coef of variation,0.5843
Kurtosis,-0.30854
Mean,353.07
MAD,172.03
Skewness,0.30805
Sum,3530694
Variance,42559
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
155,29,0.3%,
138,28,0.3%,
454,27,0.3%,
249,26,0.3%,
11,26,0.3%,
287,25,0.2%,
300,25,0.2%,
568,25,0.2%,
291,25,0.2%,
506,25,0.2%,

Value,Count,Frequency (%),Unnamed: 3
1,12,0.1%,
2,11,0.1%,
3,13,0.1%,
4,15,0.1%,
5,19,0.2%,

Value,Count,Frequency (%),Unnamed: 3
996,2,0.0%,
997,2,0.0%,
998,1,0.0%,
999,1,0.0%,
1000,1,0.0%,

0,1
Distinct count,703
Unique (%),7.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,302.38
Minimum,11
Maximum,768
Zeros (%),0.0%

0,1
Minimum,11.0
5-th percentile,32.95
Q1,184.0
Median,299.0
Q3,414.0
95-th percentile,582.0
Maximum,768.0
Range,757.0
Interquartile range,230.0

0,1
Standard deviation,162.58
Coef of variation,0.53766
Kurtosis,-0.47805
Mean,302.38
MAD,129.33
Skewness,0.2498
Sum,3023801
Variance,26432
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
301,62,0.6%,
302,58,0.6%,
304,56,0.6%,
305,55,0.5%,
309,54,0.5%,
307,54,0.5%,
312,51,0.5%,
311,51,0.5%,
300,50,0.5%,
310,48,0.5%,

Value,Count,Frequency (%),Unnamed: 3
11,24,0.2%,
12,21,0.2%,
13,27,0.3%,
14,25,0.2%,
15,25,0.2%,

Value,Count,Frequency (%),Unnamed: 3
764,5,0.1%,
765,4,0.0%,
766,3,0.0%,
767,5,0.1%,
768,3,0.0%,

0,1
Distinct count,9817
Unique (%),98.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,17.372
Minimum,12.8
Maximum,19.919
Zeros (%),0.0%

0,1
Minimum,12.8
5-th percentile,15.62
Q1,16.815
Median,17.495
Q3,18.01
95-th percentile,18.822
Maximum,19.919
Range,7.1194
Interquartile range,1.195

0,1
Standard deviation,0.94546
Coef of variation,0.054424
Kurtosis,0.44398
Mean,17.372
MAD,0.73972
Skewness,-0.53629
Sum,173720
Variance,0.89389
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
17.55623,3,0.0%,
17.75478,3,0.0%,
17.60766,3,0.0%,
18.3191,3,0.0%,
17.53612,2,0.0%,
18.17183,2,0.0%,
17.998920000000002,2,0.0%,
17.49261,2,0.0%,
17.73917,2,0.0%,
18.24233,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
12.79955,1,0.0%,
13.08055,1,0.0%,
13.20555,1,0.0%,
13.25014,1,0.0%,
13.43728,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
19.65993,1,0.0%,
19.67224,1,0.0%,
19.68232,1,0.0%,
19.73869,1,0.0%,
19.91897,1,0.0%,

0,1
Correlation,0.97767

0,1
Correlation,0.96688

0,1
Constant value,1.2376e+18

0,1
Correlation,1

0,1
Correlation,0.95811

0,1
Distinct count,10000
Unique (%),100.0%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,175.53
Minimum,8.2351
Maximum,260.88
Zeros (%),0.0%

0,1
Minimum,8.2351
5-th percentile,62.12
Q1,157.37
Median,180.39
Q3,201.55
95-th percentile,243.82
Maximum,260.88
Range,252.65
Interquartile range,44.176

0,1
Standard deviation,47.783
Coef of variation,0.27222
Kurtosis,2.6636
Mean,175.53
MAD,33.802
Skewness,-1.2274
Sum,1755300
Variance,2283.3
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
226.511352,1,0.0%,
186.8641983,1,0.0%,
162.41278280000003,1,0.0%,
242.41659750000002,1,0.0%,
25.76549491,1,0.0%,
190.18234230000002,1,0.0%,
141.3838027,1,0.0%,
178.8316406,1,0.0%,
177.3189666,1,0.0%,
189.1913606,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
8.235100497000001,1,0.0%,
8.245963351,1,0.0%,
8.29136717,1,0.0%,
8.386869414,1,0.0%,
8.487886222,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
260.7587632,1,0.0%,
260.7601998,1,0.0%,
260.8111888,1,0.0%,
260.85089750000003,1,0.0%,
260.8843818,1,0.0%,

0,1
Distinct count,9637
Unique (%),96.4%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,0.14373
Minimum,-0.0041361
Maximum,5.3539
Zeros (%),0.2%

0,1
Minimum,-0.0041361
5-th percentile,-0.00031907
Q1,8.09e-05
Median,0.042591
Q3,0.092579
95-th percentile,1.0518
Maximum,5.3539
Range,5.358
Interquartile range,0.092498

0,1
Standard deviation,0.38877
Coef of variation,2.705
Kurtosis,20.55
Mean,0.14373
MAD,0.18531
Skewness,4.2657
Sum,1437.3
Variance,0.15115
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
0.0,19,0.2%,
-1.98e-05,6,0.1%,
-6.159999999999999e-05,5,0.1%,
-1.04e-05,4,0.0%,
-3.27e-05,4,0.0%,
6.31e-05,4,0.0%,
-8.73e-05,4,0.0%,
-7.32e-05,4,0.0%,
-4.8200000000000006e-05,4,0.0%,
6.18e-05,4,0.0%,

Value,Count,Frequency (%),Unnamed: 3
-0.004136078,2,0.0%,
-0.003327649,1,0.0%,
-0.002965176,1,0.0%,
-0.002054598,1,0.0%,
-0.0014432179999999,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
3.014294,1,0.0%,
3.829476,1,0.0%,
3.896586,1,0.0%,
4.106183000000001,1,0.0%,
5.353854,1,0.0%,

0,1
Constant value,301

0,1
Distinct count,23
Unique (%),0.2%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,981.03
Minimum,308
Maximum,1412
Zeros (%),0.0%

0,1
Minimum,308
5-th percentile,752
Q1,752
Median,756
Q3,1331
95-th percentile,1402
Maximum,1412
Range,1104
Interquartile range,579

0,1
Standard deviation,273.31
Coef of variation,0.27859
Kurtosis,-1.5589
Mean,981.03
MAD,259.04
Skewness,0.41255
Sum,9810348
Variance,74696
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
756,3060,30.6%,
752,2086,20.9%,
1345,915,9.2%,
1350,540,5.4%,
1140,527,5.3%,
745,453,4.5%,
1035,396,4.0%,
1412,347,3.5%,
1302,246,2.5%,
1331,245,2.5%,

Value,Count,Frequency (%),Unnamed: 3
308,31,0.3%,
727,4,0.0%,
745,453,4.5%,
752,2086,20.9%,
756,3060,30.6%,

Value,Count,Frequency (%),Unnamed: 3
1356,4,0.0%,
1402,49,0.5%,
1404,137,1.4%,
1411,10,0.1%,
1412,347,3.5%,

0,1
Distinct count,6349
Unique (%),63.5%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,1.645e+18
Minimum,2.9958e+17
Maximum,9.4688e+18
Zeros (%),0.0%

0,1
Minimum,2.9958e+17
5-th percentile,3.0748e+17
Q1,3.3892e+17
Median,4.9666e+17
Q3,2.8813e+18
95-th percentile,6.4493e+18
Maximum,9.4688e+18
Range,9.1693e+18
Interquartile range,2.5424e+18

0,1
Standard deviation,2.014e+18
Coef of variation,1.2243
Kurtosis,2.9654
Mean,1.645e+18
MAD,1.6212e+18
Skewness,1.7946
Sum,1.645e+22
Variance,4.0562e+36
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
2.88127e+18,18,0.2%,
2.88122e+18,18,0.2%,
3.72223e+18,18,0.2%,
3.22241e+18,17,0.2%,
2.88012e+18,17,0.2%,
2.88008e+18,16,0.2%,
3.22237e+18,16,0.2%,
3.21118e+18,16,0.2%,
2.88011e+18,16,0.2%,
2.88006e+18,16,0.2%,

Value,Count,Frequency (%),Unnamed: 3
2.99578e+17,1,0.0%,
2.99582e+17,1,0.0%,
2.99583e+17,1,0.0%,
2.99585e+17,1,0.0%,
2.99588e+17,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
9.31932e+18,2,0.0%,
9.32039e+18,1,0.0%,
9.32043e+18,1,0.0%,
9.33501e+18,1,0.0%,
9.46883e+18,1,0.0%,

0,1
Distinct count,9730
Unique (%),97.3%
Missing (%),0.0%
Missing (n),0
Infinite (%),0.0%
Infinite (n),0

0,1
Mean,18.619
Minimum,12.989
Maximum,19.6
Zeros (%),0.0%

0,1
Minimum,12.989
5-th percentile,16.971
Q1,18.178
Median,18.853
Q3,19.259
95-th percentile,19.534
Maximum,19.6
Range,6.6109
Interquartile range,1.0812

0,1
Standard deviation,0.82866
Coef of variation,0.044505
Kurtosis,1.4325
Mean,18.619
MAD,0.65708
Skewness,-1.2198
Sum,186190
Variance,0.68667
Memory size,78.2 KiB

Value,Count,Frequency (%),Unnamed: 3
18.90212,3,0.0%,
18.99697,3,0.0%,
18.984,3,0.0%,
19.53507,3,0.0%,
19.5635,3,0.0%,
19.2575,3,0.0%,
19.49994,3,0.0%,
19.435660000000002,2,0.0%,
19.48646,2,0.0%,
19.31125,2,0.0%,

Value,Count,Frequency (%),Unnamed: 3
12.98897,1,0.0%,
13.55178,1,0.0%,
13.99371,1,0.0%,
14.45856,1,0.0%,
14.72825,1,0.0%,

Value,Count,Frequency (%),Unnamed: 3
19.59934,1,0.0%,
19.5996,1,0.0%,
19.59971,1,0.0%,
19.59975,1,0.0%,
19.5999,1,0.0%,

0,1
Correlation,0.98151

Unnamed: 0,objid,ra,dec,u,g,r,i,z,run,rerun,camcol,field,specobjid,class,redshift,plate,mjd,fiberid
0,1.23765e+18,183.531326,0.089693,19.47406,17.0424,15.94699,15.50342,15.22531,752,301,4,267,3.72236e+18,STAR,-9e-06,3306,54922,491
1,1.23765e+18,183.598371,0.135285,18.6628,17.21449,16.67637,16.48922,16.3915,752,301,4,267,3.63814e+17,STAR,-5.5e-05,323,51615,541
2,1.23765e+18,183.680207,0.126185,19.38298,18.19169,17.47428,17.08732,16.80125,752,301,4,268,3.23274e+17,GALAXY,0.123111,287,52023,513
3,1.23765e+18,183.870529,0.049911,17.76536,16.60272,16.16116,15.98233,15.90438,752,301,4,269,3.72237e+18,STAR,-0.000111,3306,54922,510
4,1.23765e+18,183.883288,0.102557,17.55025,16.26342,16.43869,16.55492,16.61326,752,301,4,269,3.72237e+18,STAR,0.00059,3306,54922,512


### Cleaning 
To clean this dataset I will drop the constant variables as well as the variables that were found to be highly correlated during the exploratory phase of analysis. Also, the 'class' variable is dropped in order to separate out the features from the target variable (class).

In [224]:
#cleaning
features = df.drop(['class', 'i', 'z', 'mjd', 'objid', 'plate', 'r', 'rerun', 'specobjid'], axis = 1)

## Create function to test the different metric parameters of HDBSCAN
This function is created to look at different metrics to run the clustering algorithm. It then takes the cluster label that is found using the HDBSCAN algorithm this information is stored in a new column labeled 'label'. It calculates the number of unique clusters for the three different classes (star, qso, and galaxy). It also calculates how many labels are shared in all three classes. From there the percentage of noise in each class is found and displayed as is the amount of overlap in cluster labels for the classes.

In [225]:
def hdbscan_clusters(metric_type):
    #create clusterer algorithm for the specific metric and fit it to the features of the data
    clusterer = hdbscan.HDBSCAN(metric = metric_type)
    clusterer.fit(features)
    
    #create new dataframe that will contain the original dataframe and a new column 'label' 
    #which holds the label found from the clustering alg.
    df_new = df
    df_new['label'] = clusterer.labels_
    
    #display information to the user
    print('Metric: {}'.format(metric_type))
    print('Unique Labels in Star Group: {}'.format(len(df_new[(df_new['class']=='STAR')].label.unique())))
    print('Unique Labels in QSO Group: {}'.format(len(df_new[(df_new['class']=='QSO')].label.unique())))
    print('Unique Labels in Galaxy Group: {}'.format(len(df_new[(df_new['class']=='GALAXY')].label.unique())))
    
    #identify the cluster labels for each class
    star = df_new[(df_new['class']=='STAR')].label.unique()
    qso = df_new[(df_new['class']=='QSO')].label.unique()
    galaxy = df_new[(df_new['class']=='GALAXY')].label.unique()
    
    #display information to the user
    print('Unique Labels that are in Star, QSO, and Galaxy Group: {}'.format(len(reduce(np.intersect1d, ([star], [galaxy], [qso])))))
    print('Percentage of noise label in Star Group: {}'.format(len(df_new[(df_new['class']=='STAR') & (df_new['label'] == -1)])/float(len(df_new))))
    print('Percentage of noise label in QSO Group: {}'.format(len(df_new[(df_new['class']=='QSO') & (df_new['label'] == -1)])/float(len(df_new))))
    print('Percentage of noise label in Galaxy Group: {}'.format(len(df_new[(df_new['class']=='GALAXY') & (df_new['label'] == -1)])/float(len(df_new))))
    print('Percentage of cluster overlap in Star Group: {}'.format(len(df_new.loc[(df_new['label'].isin(reduce(np.intersect1d, ([star], [galaxy], [qso])))) & (df_new['class'] == 'STAR')])/float(len(df_new['class'] == 'STAR'))))
    print('Percentage of cluster overlap in QSO Group: {}'.format(len(df_new.loc[(df_new['label'].isin(reduce(np.intersect1d, ([star], [galaxy], [qso])))) & (df_new['class'] == 'QSO')])/float(len(df_new['class'] == 'QSO'))))
    print('Percentage of cluster overlap in Galaxy Group: {}'.format(len(df_new.loc[(df_new['label'].isin(reduce(np.intersect1d, ([star], [galaxy], [qso])))) & (df_new['class'] == 'GALAXY')])/float(len(df_new['class'] == 'GALAXY')))) 

### Apply Function 
Apply the hdbscan cluster function that was created to loop through all the metric possibilities within the hdbscan algorithm parameter to compare the results based on the metric changing.

In [226]:
#create list of metrics to loop through
metric_list=['braycurtis','canberra','chebyshev','cityblock','dice','euclidean','hamming',
             'infinity','jaccard','kulsinski','l1','l2','manhattan',
             'matching','p','rogerstanimoto','russellrao',
             'sokalmichener','sokalsneath']

In [276]:
#loop through the metric list and apply the clustering function that was created to compare results
for metric in metric_list:
    hdbscan_clusters(metric)

Metric: braycurtis
Unique Labels in Star Group: 516
Unique Labels in QSO Group: 294
Unique Labels in Galaxy Group: 561
Unique Labels that are in Star, QSO, and Galaxy Group: 227
Percentage of noise label in Star Group: 0.0937
Percentage of noise label in QSO Group: 0.0358
Percentage of noise label in Galaxy Group: 0.163
Percentage of cluster overlap in Star Group: 0.2255
Percentage of cluster overlap in QSO Group: 0.0749
Percentage of cluster overlap in Galaxy Group: 0.3165
Metric: canberra
Unique Labels in Star Group: 41
Unique Labels in QSO Group: 38
Unique Labels in Galaxy Group: 102
Unique Labels that are in Star, QSO, and Galaxy Group: 2
Percentage of noise label in Star Group: 0.0584
Percentage of noise label in QSO Group: 0.0401
Percentage of noise label in Galaxy Group: 0.1341
Percentage of cluster overlap in Star Group: 0.137
Percentage of cluster overlap in QSO Group: 0.0402
Percentage of cluster overlap in Galaxy Group: 0.1342
Metric: chebyshev
Unique Labels in Star Group: 4

Metric: sokalsneath
Unique Labels in Star Group: 1
Unique Labels in QSO Group: 1
Unique Labels in Galaxy Group: 2
Unique Labels that are in Star, QSO, and Galaxy Group: 1
Percentage of noise label in Star Group: 0.0
Percentage of noise label in QSO Group: 0.0
Percentage of noise label in Galaxy Group: 0.0
Percentage of cluster overlap in Star Group: 0.4152
Percentage of cluster overlap in QSO Group: 0.085
Percentage of cluster overlap in Galaxy Group: 0.4979


### Interpret Results
There were a few metrics that produces no noise and very few clusters for the various classes. To investigate these clusters a new function was created that would display the clusters labels. A metric subset was identified for those metrics that only resulted in the star, qso, and galaxy only having 1 or 2 labels. When these metrics were used to loop through the new function it was seen that all star and qso were being labeled as 0 and galaxy was being labeled as 0 or 1. Therefore, using any one of the metrics from the subset metric list if a datapoint is labeled as 1 it is certain that is is a galaxy. These metrics would not be useful if a star or qso wanted to be identified with high accuracy.

In [218]:
#create function to return the cluster labels for the hdbscan algorithm
def cluster_labels_output(metric):
    #create clusterer algorithm for the specific metric and fit it to the features of the data
    clusterer = hdbscan.HDBSCAN(metric = metric)
    clusterer.fit(features)
    
    #create new dataframe that will contain the original dataframe and a new column 'label' 
    #which holds the label found from the clustering alg.    
    df_new = df
    df_new['label'] = clusterer.labels_
    
    #display information
    print('Metric type: {}'.format(metric))
    print('Array of cluster arrays for Star:{}'.format(df_new[(df_new['class']=='STAR')].label.unique()))
    print('Array of cluster arrays for QSO:{}'.format(df_new[(df_new['class']=='QSO')].label.unique()))
    print('Array of cluster arrays for Galaxy:{}'.format(df_new[(df_new['class']=='GALAXY')].label.unique()))
    

In [235]:
#loop through those metric that only had 1 or 2 clusters
metric_subset = ('dice', 'jaccard', 'kulsinski', 'matching', 'rogerstanimoto', 
                'sokalmichener', 'sokalsneath')

In [236]:
#loop through the metric subset list and apply the clustering function that was created to compare results
for metric in metric_subset:
    cluster_labels_output(metric)

Metric type: dice
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: jaccard
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: kulsinski
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: matching
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: rogerstanimoto
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: sokalmichener
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]
Metric type: sokalsneath
Array of cluster arrays for Star:[0]
Array of cluster arrays for QSO:[0]
Array of cluster arrays for Galaxy:[0 1]


### Create Function to Find Labels that are Unique across the Class
Since there is overlap in the cluster labels I created a function that would find the cluster labels that only exist in each class. This will allow for better insight when new data is read in without a class label. Essentially the cluster label could be mapped back to determine the appropriate class label.

In [274]:
#create function to see which cluster labels are ONLY in each class given the specific metric
def clusters_specific_class(metric_type):
    #create clusterer algorithm for the specific metric and fit it to the features of the data
    clusterer = hdbscan.HDBSCAN(metric = metric_type)
    clusterer.fit(features)
    
    #create new dataframe that will contain the original dataframe and a new column 'label' 
    #which holds the label found from the clustering alg.
    df_new = df
    df_new['label'] = clusterer.labels_
    print('Metric: {}'.format(metric_type))
    
    #find which labels are in the different classes and those shared by all three
    star = df_new[(df_new['class']=='STAR')].label.unique()
    qso = df_new[(df_new['class']=='QSO')].label.unique()
    galaxy = df_new[(df_new['class']=='GALAXY')].label.unique()
    
    un_star = np.setdiff1d(star,qso)   
    un_star = np.setdiff1d(un_star, galaxy)
    
    un_qso = np.setdiff1d(qso, star)   
    un_qso = np.setdiff1d(un_qso, galaxy)
    
    un_gal = np.setdiff1d(galaxy,star)   
    un_gal = np.setdiff1d(un_gal, qso)
    
    #display the labels that are unique only to each class, thus not existing in the other two classes
    print('Unique cluster labels in Star: {}'.format(un_star))
    print('Unique cluster labels in QSO: {}'.format(un_qso))
    print('Unique cluster labels in Galaxy: {}'.format(un_gal))
    

### Interpret Results
Applying the function that I created, the metric 'canberra' is the only metric that identifies unique cluster labels to each class type. From initial inspection canberra metric seems to be a good starting place. Next steps should include tuning the minimum_cluster_size to reduce the nosie within the cluster labels as well as overlap in the labels.<br>

Metric: canberra
- 37 cluster labels unique to ONLY Star
- 20 cluster labels unique to ONLY QSO
- 82 cluster labels unique to ONLY galaxy
- Unique Labels in Star Group: 41
- Unique Labels in QSO Group: 38
- Unique Labels in Galaxy Group: 102
- Unique Labels that are in Star, QSO, and Galaxy Group: 2
- Percentage of noise label in Star Group: 0.0584
- Percentage of noise label in QSO Group: 0.0401
- Percentage of noise label in Galaxy Group: 0.1341
- Percentage of cluster overlap in Star Group: 0.137
- Percentage of cluster overlap in QSO Group: 0.0402
- Percentage of cluster overlap in Galaxy Group: 0.1342
- Unique cluster labels in Star: [ 0  1  3  8  9 12 13 14 16 17 18 19 20 21 22 23 24 25 26 27 28 30 32 33
 41 42 46 47 48 49 68 69 70 85 86 91 92]
- Unique cluster labels in QSO: [  4  29  31  34  44  57  58  63  71  72  73  77  79  80  88  93  94  95
  96 100]
- Unique cluster labels in Galaxy: [  5   6  10  36  38  40  43  45  50  51  52  53  54  55  56  59  60  62
  65  66  67  74  75  76  81  82  83  84  89  97  98  99 103 104 105 106
 107 108 110 111 112 114 115 116 117 118 119 120 121 122 123 124 125 126
 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 146
 147 148 149 150 151 152 153 154 156 157]

In [275]:
#apply function to all metrics to identify the cluster labels that are unique to each class.
for metric in metric_list:
    clusters_specific_class(metric)

Metric: braycurtis
Unique cluster labels in Star: [  1   4   5  11  16  17  24  25  26  58  67  68 102 103 188 207 222 224
 258 266 268 269 275 282 345 355 371 382 427 429 435 437 448 458 469 474
 488 489 501 503 528 562 571]
Unique cluster labels in QSO: []
Unique cluster labels in Galaxy: [ 30  44  70  76  77 112 126 130 131 132 133 160 168 180 183 196 198 210
 219 227 236 240 261 290 314 338 342 347 348 351 389 401 415 434 451 467
 495 510 521 527 529 533 543 560 563 574 582 588 590]
Metric: canberra
Unique cluster labels in Star: [ 0  1  3  8  9 12 13 14 16 17 18 19 20 21 22 23 24 25 26 27 28 30 32 33
 41 42 46 47 48 49 68 69 70 85 86 91 92]
Unique cluster labels in QSO: [  4  29  31  34  44  57  58  63  71  72  73  77  79  80  88  93  94  95
  96 100]
Unique cluster labels in Galaxy: [  5   6  10  36  38  40  43  45  50  51  52  53  54  55  56  59  60  62
  65  66  67  74  75  76  81  82  83  84  89  97  98  99 103 104 105 106
 107 108 110 111 112 114 115 116 117 118 119 120 121 1