# **Stellar Object Classification**
_John Andrew Dixon_

---

#### **Data Dictionary**

|**Column Name**|**Description**| 
|-|-|
|obj_ID|Object Identifier, the unique value that identifies the object in the image catalog used by the CAS|
|alpha|Right Ascension angle (at J2000 epoch)|
|delta|Declination angle (at J2000 epoch)|
|u|Ultraviolet filter in the photometric system|
|g|Green filter in the photometric system|
|r|Red filter in the photometric system|
|i|Near Infrared filter in the photometric system|
|z|Infrared filter in the photometric system|
|run_ID|Run Number used to identify the specific scan|
|rereun_ID|Rerun Number to specify how the image was processed|
|cam_col|Camera column to identify the scanline within the run|
|field_ID|Field number to identify each field|
|spec_obj_ID|Unique ID used for optical spectroscopic objects (this means that 2 different observations with the same spec_obj_ID must share the output class)|
|class|object class (galaxy, star or quasar object)|
|redshift|redshift value based on the increase in wavelength|
|plate|plate ID, identifies each plate in SDSS|
|MJD|Modified Julian Date, used to indicate when a given piece of SDSS data was taken|
|fiber_ID|fiber ID that identifies the fiber that pointed the light at the focal plane in each observation|

#### **Setup**

In [1]:
import pandas as pd

In [2]:
url = "https://docs.google.com/spreadsheets/d/e/2PACX-1vRviGqwoVRVF_HY9LcyDLvVEDpdqZKvk1mL6K9xiWpEFh_i2QF4AX13P96L9F6nIAsD0iF0JZmcJ69A/pub?output=csv"
df = pd.read_csv(url)
stellar_df = df.copy()
df.sample(5)

Unnamed: 0,obj_ID,alpha,delta,u,g,r,i,z,run_ID,rerun_ID,cam_col,field_ID,spec_obj_ID,class,redshift,plate,MJD,fiber_ID
47373,1.237665e+18,219.358378,25.172065,19.03422,17.68312,17.08695,16.74858,16.54623,4649,301,3,231,2.403818e+18,GALAXY,0.121961,2135,53827,80
40856,1.23766e+18,146.205067,40.078568,21.6047,21.22174,20.61363,20.39042,20.51013,3462,301,3,242,9.945308e+18,GALAXY,0.52876,8833,57779,853
32910,1.237665e+18,196.346631,29.217491,22.30363,20.89544,19.11858,18.48914,18.06431,4649,301,3,92,7.299447e+18,GALAXY,0.387419,6483,56341,866
93298,1.237658e+18,169.996413,58.982203,24.69456,24.20848,23.06297,20.87551,19.70769,2987,301,3,101,8.002966e+18,STAR,-0.000231,7108,56686,252
92229,1.237664e+18,194.006641,14.227486,23.07723,22.56971,20.62846,19.58605,19.09302,4381,301,2,153,6.098075e+18,GALAXY,0.518973,5416,56002,733


#### **Helper Functions**

---
## **Explanatory Analysis**

#### **Cleaning**

Duplicate check and removal:

In [3]:
stellar_df.duplicated().sum()

0

- Implement some type of feature selection/engineering
- Create simple imputers based on distributions
- Check class imbalance and fix it if imbalanced,   
- Train-test split with stratify so the testing data has good class balance
- See project 2 rubric