## DSCI 100 Project Proposal: Classifying star category using temperature, luminosity, radius, and absolute magnitude as predictors.

### Introduction

&emsp; In the celestial realm, stars are crucial celestial entities, each characterized by unique spectral and physical attributes. Contrary to ancient methods of classification that utilized constellations and positions in the sky, scientific reasoning in modern astrophysics seeks a unified understanding. Being such complex bodies, classifying the various star types can become ambiguous, thus demanding rigorous analysis of the various multifaceted characteristics through quantifiable values.

**Question:** Can we successively predict the star type based on data including temperature, luminosity, radius and absolute magnitude? 

**Dataset Description**

&emsp; For the purpose of the project we will be using the Star Classification dataset provided by the YBI Foundation on [Kaggle](https://www.kaggle.com/code/ybifoundation/stars-classification). The data contains the following variables: star absolute temperature(in K), relative luminosity (L/Lo), relative radius (R/Ro), absolute magnitude (Mv), color, spectral class, and star type. Not all of these variables serve a purpose as many are classifications not predictors, more information on our variable selection in the **methods** section.


### Preliminary exploratory data analysis

**Setting Up Libraries and Parameters**

In [25]:
# Run this first.
library(tidyverse)
library(tidymodels)

# Importing data
dataset_url <- "https://raw.githubusercontent.com/YBIFoundation/Dataset/main/Stars.csv"

**Loading and tidying data**

In [43]:
# loading and tidying
star_raw_data <- read_csv(dataset_url) 

star_data <- star_raw_data |>
    rename(temperature = "Temperature (K)",
           luminosity = "Luminosity (L/Lo)",
           radius = "Radius (R/Ro)",
           absolute_magnitude = "Absolute magnitude (Mv)",
           star_type = "Star type",
           star_category = "Star category",
           star_colour = "Star color",
           spectral_class = "Spectral Class") |>
    select(temperature:absolute_magnitude,star_category)
    

# splitting and creating training and testing data
star_data_split <- star_data |>
    initial_split(props=0.75, strata=star_category)

#star_training_data <- 
#star_testing_data <- 

# head and summary of our data
head(star_data)
summary(star_data)

[1mRows: [22m[34m240[39m [1mColumns: [22m[34m8[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[31mchr[39m (3): Star category, Star color, Spectral Class
[32mdbl[39m (5): Temperature (K), Luminosity (L/Lo), Radius (R/Ro), Absolute magnitu...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


temperature,luminosity,radius,absolute_magnitude,star_category
<dbl>,<dbl>,<dbl>,<dbl>,<chr>
3068,0.0024,0.17,16.12,Brown Dwarf
3042,0.0005,0.1542,16.6,Brown Dwarf
2600,0.0003,0.102,18.7,Brown Dwarf
2800,0.0002,0.16,16.65,Brown Dwarf
1939,0.000138,0.103,20.06,Brown Dwarf
2840,0.00065,0.11,16.98,Brown Dwarf


  temperature      luminosity           radius          absolute_magnitude
 Min.   : 1939   Min.   :     0.0   Min.   :   0.0084   Min.   :-11.920   
 1st Qu.: 3344   1st Qu.:     0.0   1st Qu.:   0.1027   1st Qu.: -6.232   
 Median : 5776   Median :     0.1   Median :   0.7625   Median :  8.313   
 Mean   :10497   Mean   :107188.4   Mean   : 237.1578   Mean   :  4.382   
 3rd Qu.:15056   3rd Qu.:198050.0   3rd Qu.:  42.7500   3rd Qu.: 13.697   
 Max.   :40000   Max.   :849420.0   Max.   :1948.5000   Max.   : 20.060   
 star_category     
 Length:240        
 Class :character  
 Mode  :character  
                   
                   
                   

**Preliminary Data Visualization**

### Methods

&emsp; Using the processed Star Classification Dataset, where libraries such as tidyverse, repr, and tidymodels have been used, we can utilize the data to determine the star category. The following variables serve as potential predictors:
*  **Temperature**
*  **Luminosity**
*  **Radius**
*  **Absolute Magnitude** 

Star type is chosen as our class reasoning due to its vague characteristics that can be used as predictors to classify the types of stars. Analyzing star colour or spectral class for classification relies on observation as opposed to predictive reasoning.


KNN is chosen as our predictive model due to the star type's categorical nature and its ability to capture nonlinear boundaries present in the star dataset. Thus providing a more interpretable and reliable approach. The justification for not choosing star spectral class and star colour was that their categorical nature could disrupt the KNN classifications that rely on numeric computation.


### Expected Outcomes and Significance

We predict that after using KNN classification on our modified dataset, we can predict star types based on the chosen predictors successively. This enables a deeper understanding of the distinguishing characteristics of various star types. Utilizing star classification to predict the star type can be highly significant for astronomers to identify distinct patterns and gain a higher understanding of the universe. 

Future Questions:
1. Can we use this algorithm to successfully predict the star type of the sun?
2. To what extent is KNN classification a more fitting way of analyzing the data rather than other algorithms?


### References

YBI FOUNDATION. “Stars Classification.” *Kaggle*, 9 Mar. 2023, Accessed 25 Oct. 2023. 