# Classification of stars, galaxies, and quasars

Using a query, you should try to obtain three data files each contains 10000 observations of stars, galaxies, and quasars, respectively, which are distinctly different objects.

As this is data resuting from a (raw) query, it has got several missing/9999 values and other flaws to consider, before applying any further data analysis. You need to fix this!

Thus, this exercise first consists of **Inspecting and Cleaning the data**.

### Features:
The data features / input variables (X) are:
* ID:   unique object ID in the database
* ra:   right ascension (coordinate)
* dec:  declination (coordinate)
* istar:        log-likelihood that the object is point-like, given by the pipeline run on the images
* gmag:  magnitude in g-band
* rmag: magnitude in r-band
* imag: magnitude in i-band
* zmag: magnitude in z-band
* W1:   magnitude in W1-band (from AllWISE)
* W2:   magnitude in W2-band (from AllWISE)
* psfgmag:      PSF magnitude in g-band (i.e. the best-fit magnitude of a point-like object fit to the pixel data)
* psfrmag:      PSF magnitude in r-band (i.e. the best-fit magnitude of a point-like object fit to the pixel data)
* psfimag:      PSF magnitude in i-band (i.e. the best-fit magnitude of a point-like object fit to the pixel data)
* W3:   magnitude in W3-band (from AllWISE)
* W3err:        uncertainty on magnitude in W1-band (from AllWISE)
* J:    magnitude in J-band (from 2MASS, in AllWISE)
* Jerr: uncertainty on J
* H:    magnitude in J-band (from 2MASS, in AllWISE)
* Herr: uncertainty on H
* K:    magnitude in J-band (from 2MASS, in AllWISE)
* Kerr: uncertainty on K
* umag: magnitude in u-band
* zs: "true" redshift

Make sure that you shortly think about (and discuss) which of these features should be included, if you want to try to identify which type of object it is.

Also, this time there is no target value (Y) given in the data. However, given the query selection (by other means) to be stars, galaxies, and quasars, you can consider the file type to be the target. But you need to put these three files together and add a column with the target (i.e. file origin) value.


### Task:
Thus, the task before you is to:<br>
1) Make three queries, which produces three files of data containing stars, galaxies, and quasars.<br>
2) Combine the three data files into one, which has a target value corresponding to the file type.<br>
3) Read and inspect this data, and make sure that you understand what it (roughly) looks like.<br>
4) Clean/cut (or impute) the data, such that different (unsupervised) analysis techniques will work.<br>
5) Run a (k)PCA (and later other techniques) on it, and see what the resulting distributions looks like.<br>

Do you in the end manage to get e.g. get three well separated classes out?<br>

***

* Author: Troels C. Petersen (NBI)
* Email:  petersen@nbi.dk
* Date:   3rd of May 2021

In [None]:
from __future__ import print_function, division   # Ensures Python3 printing & division standard
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from sklearn.decomposition import PCA, IncrementalPCA

In [None]:
allfiles = ['Data_Galaxies.txt','Data_Quasars.txt','Data_Stars.txt']

index_col = 'ID'

dfs = [pd.read_csv(filename, index_col=index_col) for filename in allfiles]
i=0
for df in dfs:
    dfs[i]['source'] = allfiles[i]
    i+1

merged_df_2 = pd.concat(dfs)
print(merged_df_2)



                             ra        dec        istar      gmag      rmag  \
ID                                                                            
1237649918426415808    9.254510  13.818784   -28.583860  22.80513  20.86707   
1237652942639202897    1.786889  14.392791   -28.201030  22.61204  20.91495   
1237652942639267859    1.908622  14.497953 -6275.146000  18.34751  17.21329   
1237652942639333782    2.068091  14.467087   -19.421730  22.34108  20.65322   
1237652944249028825  359.916063  15.607598 -2156.052000  18.68034  17.69513   
...                         ...        ...          ...       ...       ...   
1237663784751792318   40.044881   0.609199    -2.785142  19.04584  17.62281   
1237663785276801114   12.901164   0.980016   -46.915640  18.00692  17.33035   
1237663785276932271   13.237769   1.012060   -18.991050  21.02905  19.46825   
1237663785277194303   13.849441   0.903742  -210.847900  15.61581  15.09167   
1237663785277259952   14.057410   0.914362   -73.580

In [None]:
gal = pd.read_csv('GALAXY.csv')
star = pd.read_csv('STAR.csv')
qso = pd.read_csv('QSO.csv')

allfiles = ['GALAXY.csv','STAR.csv','QSO.csv']
index_col = 'ID'
dfs = [pd.read_csv(filename, index_col=index_col) for filename in allfiles]
i=0
for df in dfs:
    dfs[i]['source'] = allfiles[i]
    i+1
merged_df = pd.concat(dfs)
#print(merged_df)

In [None]:
fun = pd.concat([gal, star, qso])
gal['class'] = 'GALAXY'
star['class'] = 'STAR'
qso['class'] = 'QSO'
total = pd.concat([gal, star, qso])

In [None]:
pca = PCA(n_components = 2)
pca.fit(fun)
PCA(copy = True, iterated_power = 'auto', n_components = 2, random_state = None,
   svd_solver = 'auto', tol = 0.0, whiten = False)
print(pca.explained_variance_ratio_)

[1.00000000e+00 1.44530517e-18]


In [None]:
from sklearn.preprocessing import RobustScaler


In [None]:
from sklearn.preprocessing import StandardScaler

features = fun.columns


x = fun.loc[:,features].values

x_scaled = StandardScaler().fit_transform(x)

#x_scaled = pd.DataFrame(x_scaled, columns = features)

print(x_scaled)




[[-1.68205440e+00  2.23197265e+00 -1.53229683e+00 ... -1.46469554e+00
  -2.83589312e-03 -6.38896900e-01]
 [-1.68205440e+00  2.23201008e+00 -1.52991483e+00 ... -1.46467729e+00
   5.50334646e-04 -5.44640858e-01]
 [-1.68205440e+00  2.23269976e+00 -1.53154256e+00 ... -1.46468588e+00
   1.58572317e-02 -4.20424383e-01]
 ...
 [-1.68500988e-01 -4.02526416e-01 -6.01761719e-01 ...  6.82738854e-01
   1.20524999e-02  8.61278581e-01]
 [-1.68382431e-01 -4.00090345e-01 -5.31520202e-01 ...  6.82738854e-01
   2.81801670e-03  3.13545362e-01]
 [-1.68382424e-01 -3.98943289e-01 -5.23694934e-01 ...  6.82738854e-01
   7.67803905e-03  8.36690296e-01]]


In [None]:
pca_scaled = PCA(n_components = 23)
pca_scaled.fit(x_scaled)
PCA(copy = True, iterated_power = 'auto', n_components = 23, random_state = None,
   svd_solver = 'auto', tol = 0.0, whiten = False)
print(pca_scaled.explained_variance_ratio_)
#print(pca_scaled.singular_values_)


[3.16094904e-01 2.94614346e-01 6.99075515e-02 5.08840947e-02
 4.72114895e-02 4.48069207e-02 4.32201497e-02 3.58002330e-02
 3.24312024e-02 2.97032190e-02 2.41265162e-02 5.76723791e-03
 2.93056798e-03 2.04193027e-03 4.48206230e-04 7.94129950e-06
 2.12584218e-06 8.78700473e-07 4.29500430e-07 4.46090369e-08
 1.03895206e-08 4.61951205e-11 1.51152057e-11]


In [None]:
print(x_scaled.shape)

(30000, 23)


<a style='text-decoration:none;line-height:16px;display:flex;color:#5B5B62;padding:10px;justify-content:end;' href='https://deepnote.com?utm_source=created-in-deepnote-cell&projectId=68e70cc0-4a2e-4baa-91ca-7b7df4c59022' target="_blank">
 </img>
Created in <span style='font-weight:600;margin-left:4px;'>Deepnote</span></a>