---
title: "Stellar Classification"
format:
  html:
    toc: true
author: "Jun Ryu"
date: "2023-12-08"
categories: [ML, python, project]
---

## 1. Abstract

In this report, we will deal with the topic of stellar classification, which is determining an astronomical object based on its spectral characteristics. In particular, we will focus on building a classification algorithm to correctly identify whether an input object is a star, a galaxy, or a quasar. We will mainly rely on the values given by the photometric system to help us classify these objects.   

The machine learning models used in this report include **support vector machine (SVM), adaptive boosting with decision trees (AdaBoost), and artificial neural network (ANN)**. 


## 2. Introduction

Stellar classification can be considered to be one of the most fundamental problems in astronomy, where the distinctions between different spectral objects lay the building blocks of studies often conducted in astronomy. To preface this report, we will first define what each object category will be:

*1. Star*: 

We are using the [Stellar Classification Dataset](https://www.kaggle.com/datasets/fedesoriano/stellar-classification-dataset-sdss17/data), provided by SDSS17. This dataset houses 100,000 observations of space taken by Sloan Digital Sky Survey (SDSS), what do they do... The original dataset contains 18 columns (the full data dictionary can be accessed at the Kaggle site); however, since we have already decided our features to be the values resulting from the photometric system, we will only be working with 6 columns (5 features and 1 response). The 5 feature columns are `u`, `g`, `r`, `i`, and `z`, which are the values obtained from 5 different filters in the photometric system. These values all mostly range between 0 and 100. The response column, or `class`, is one that reports on the object class; it is one of "STAR", "GALAXY", or "QSO", the last of which stands for quasi-stellar objects. 

The photometric system, often utilized in astronomy, is a set of filters that essentially work together to determine an object's brightness. The idea is that an object will undergo different filters with each filter transmitting only a specific range of wavelengths. Through this process, an astronomer is able to measure the different intensities that pass through each filter, which can be combined to determine the overall brightness. Here, we deal with a specific set of filters: the `ugriz` filters, which the Sloan Digital Sky Survey most notably employs for their observations. The `ugriz` filters are made up of 5 filters (`u`, `g`, `r`, `i`, and `z`), which extract...


## 3. Data Preprocessing

Preprocessing the data was not a difficult process for this specific dataset and required less than 5 lines of code. There were no `NA` values and there was only one extreme outlier noted across all 100,000 observations. Therefore, we took steps to remove this observation. We then grabbed the 6 columns described above, and sampled 10,000 observations (10% of the original). The reason why we scaled down the size of the dataset was due to the computational cost; with the available computational power, we were not able to run some models on the full dataset. We have also encoded the labels for the response variable as integers ($0,1,2$) in order to ensure we can track which label is which when plotting the confusion matrix.

After the preprocessing, we proceeded to split the data into 80% training and 20% testing. Then, for each of these, we split it into the $X$ (the predictors) and the $y$ (the response), resulting in four sets defined as `X_train`, `y_train`, `X_test`, and `y_test`.


why did we not scale data? we probably should...