# Dataset Preparation

### Description

The dataset that that will be used is the <b> Stars Dataset </b>. 

In astronomy, stellar classification is the classification of stars based on their spectral characteristics. The classification scheme of galaxies, quasars, and stars is one of the most fundamental topics in astronomy. The early cataloging of stars and their distribution in the sky has led to the understanding that they make up our own galaxy and, following the distinction that Andromeda was a separate galaxy to our own, numerous galaxies began to be surveyed as more powerful telescopes were built

The data consists of <b> 100,000 observations </b> of space taken by the SDSS (Sloan Digital Sky Survey). 

Every observation is described by <b>18 feature columns</b>.

• **obj_ID** – Object Identifier, the unique value that identifies the object in the image catalog used by the CAS

• **alpha** – Right Ascension angle (at J2000 epoch)

• **delta** – Declination angle (at J2000 epoch)

• **u** – Ultraviolet filter in the photometric system

• **g** – Green filter in the photometric system

• **r** – Red filter in the photometric system

• **i** – Near Infrared filter in the photometric system

• **z** – Infrared filter in the photometric system

• **run_ID** – Run Number used to identify the specific scan

• **rereun_ID** – Rerun Number to specify how the image was processed

• **cam_col** – Camera column to identify the scanline within the run

• **field_ID** – Field number to identify each field

• **spec_obj_ID** – Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)

• **class** – Object class (galaxy, star, or quasar object)

• **redshift** – Redshift value based on the increase in wavelength

• **plate** – Plate ID, identifies each plate in SDSS

• **MJD** – Modified Julian Date, used to indicate when a given piece of SDSS data was taken

• **fiber_ID** – Identifies the fiber that pointed the light at the focal plane in each observation


### Importing of Libraries

In [11]:
import numpy as np
import pandas as pd

### Importing the Stars Dataset

In [12]:
stars = pd.read_csv('data/original/stars.csv')
stars.head()

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
0,1.237661e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,3606,301,2,79,6.543777e+18,GALAXY,0.634794,5812,56354,171
1,1.237665e+18,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,4518,301,5,119,1.176014e+19,GALAXY,0.779136,10445,58158,427
2,1.237661e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,3606,301,2,120,5.1522e+18,GALAXY,0.644195,4576,55592,299
3,1.237663e+18,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,4192,301,3,214,1.030107e+19,GALAXY,0.932346,9149,58039,775
4,1.23768e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,8102,301,3,137,6.891865e+18,GALAXY,0.116123,6121,56187,842


### Data preprocessing and Cleaning

Show all columns in the dataset

In [13]:
stars.columns

Index(['obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'run_ID',
       'rerun_ID', 'cam_col', 'field_ID', 'spec_obj_ID', 'class', 'redshift',
       'plate', 'MJD', 'fiber_ID'],
      dtype='object')

Get all the unique value counts

In [14]:
# Get the number of unique values for each column
unique_counts = stars.nunique()
unique_counts

obj_ID          78053
alpha           99999
delta           99999
u               93748
g               92651
r               91901
i               92019
z               92007
run_ID            430
rerun_ID            1
cam_col             6
field_ID          856
spec_obj_ID    100000
class               3
redshift        99295
plate            6284
MJD              2180
fiber_ID         1000
dtype: int64

Find null values

In [15]:
null_counts = stars.isnull().sum()

# Display the number of null values for all columns
print(null_counts)

obj_ID         0
alpha          0
delta          0
u              0
g              0
r              0
i              0
z              0
run_ID         0
rerun_ID       0
cam_col        0
field_ID       0
spec_obj_ID    0
class          0
redshift       0
plate          0
MJD            0
fiber_ID       0
dtype: int64


We need to find the right ID to use when indexing. 

In [16]:
print(stars[['obj_ID']].value_counts().shape)
print(stars[['spec_obj_ID']].value_counts().shape)

(78053,)
(100000,)


Since, spec_obj_ID is more specific we will use that for indexing.

In [17]:
stars['class'].value_counts()

class
GALAXY    59445
STAR      21594
QSO       18961
Name: count, dtype: int64

### Getting X and y data

Since some columns are just IDs and do not provide any insight for the model, we will remove those.

In [18]:
cleaned = stars[['spec_obj_ID', 'alpha', 'delta', 'u', 'g', 'r', 'i', 'z', 'redshift', 'class']]
cleaned.head()

Unnamed: 0,spec_obj_ID,alpha,delta,u,g,r,i,z,redshift,class
0,6.543777e+18,135.689107,32.494632,23.87882,22.2753,20.39501,19.16573,18.79371,0.634794,GALAXY
1,1.176014e+19,144.826101,31.274185,24.77759,22.83188,22.58444,21.16812,21.61427,0.779136,GALAXY
2,5.1522e+18,142.18879,35.582444,25.26307,22.66389,20.60976,19.34857,18.94827,0.644195,GALAXY
3,1.030107e+19,338.741038,-0.402828,22.13682,23.77656,21.61162,20.50454,19.2501,0.932346,GALAXY
4,6.891865e+18,345.282593,21.183866,19.43718,17.58028,16.49747,15.97711,15.54461,0.116123,GALAXY


### Exporting the cleaned dataset to a csv file

In [19]:
import os

output_dir = 'data/cleaned'
os.makedirs(output_dir, exist_ok=True)
cleaned.to_csv(f'{output_dir}/stars_cleaned.csv', index=False)
