## Predicting Pulsars: Mean, Standard Deviation and Kurtosis Analysis

**Logan Chan, Sam Donato, Navneet Bedi, Ahadjon Sultonov**

### Introduction

Pulsars were first discovered in 1967 by astronomers Jocelyn Bell Burnell and Antony Hewish (American Physical Society, 2006). 

A pulsar is a rotating neutron star that is highly magnetized, causing them to emit beams of radiation. These stars provided the first indirect evidence for the existence of gravitational waves. Pulsar stars also have the potential to reveal extreme phenomena in neutron star astrophysics (Zhang et al., 2020. 

Thus, in effect, pulsars can be thought of as 'cosmic lighthouses.'

These beams can appear to pulse as the star rotates, but other astronomical phenomena in space can mimic these pulsar signals, which we call spurious signals. The spurious signals can be challenging to identify and seperate from pulsar signals (Gaskill, 2020). 

The goal of this project will be to use variables from the HTRU2 UC Irvine Machine Learning Repository Pulsar Star Dataset to classify whether a star is pulsar or not.

HTRU2 is a data set which describes a sample of pulsar candidates collected during the High Time Resolution Universe Survey (South). The data set shared contains 16,259 spurious examples caused by RFI/noise, and 1,639 real pulsar examples (Lyon, 2017).  In the data set, legitimate pulsar examples are a minority positive class, and spurious examples are the majority negative class. The class labels used are 0 (negative) and 1 (positive). Due to the number of predictor variables, we only chose a few potentially important predictors to focus on for the preliminary analysis.

The question we will be addressing is: **Given the mean, standard deviation, and excess kurtosis of the integrated profile, can we predict if a star is pulsar or if it is a spurious signal?**

### Methods and Results

In [3]:
library(tidyverse)
library(repr)
library(tidymodels)


pulsar_data <- read_csv("data/pulsar_data.csv") 


names(pulsar_data) <- c("mean_ip", "std_dev_ip", "kurtosis_ip", "skew_ip", "mean_curve", "std_dev_curve", "kurtosis_curve", "skew_curve", "type")

pulsar_data_selected <- pulsar_data |> 
select("mean_ip", "std_dev_ip", "kurtosis_ip", "type") 
pulsar_data_selected


pulsar_split <- initial_split(pulsar_data_selected, prop = 0.75, strata = type)
pulsar_training <- training(pulsar_split)
pulsar_testing  <- testing(pulsar_split)


pulsar_recipe<- recipe(type ~ mean_ip + std_dev_ip + kurtosis_ip, data = pulsar_data) |>
  step_scale(all_predictors()) |>
  step_center(all_predictors())
pulsar_recipe


pulsar_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = tuning()) |>
  set_engine("kknn") |>
  set_mode("classification")


pulsar_fit <- workflow() |>
  add_recipe(pulsar_recipe) |>
  add_model(pulsar_spec) |>
  fit(data = pulsar_data)

[1mRows: [22m[34m17897[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): 140.5625, 55.68378214, -0.234571412, -0.699648398, 3.199832776, 19....

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


mean_ip,std_dev_ip,kurtosis_ip,type
<dbl>,<dbl>,<dbl>,<dbl>
102.50781,58.88243,0.465318154,0
103.01562,39.34165,0.323328365,0
136.75000,57.17845,-0.068414638,0
88.72656,40.67223,0.600866079,0
93.57031,46.69811,0.531904850,0
119.48438,48.76506,0.031460220,0
130.38281,39.84406,-0.158322759,0
107.25000,52.62708,0.452688025,0
107.25781,39.49649,0.465881961,0
142.07812,45.28807,-0.320328426,0




[36m──[39m [1mRecipe[22m [36m──────────────────────────────────────────────────────────────────────[39m



── Inputs 

Number of variables by role

outcome:   1
predictor: 3



── Operations 

[36m•[39m Scaling for: [34mall_predictors()[39m

[36m•[39m Centering for: [34mall_predictors()[39m



ERROR: [1m[33mError[39m in `check_outcome()`:[22m
[33m![39m For a classification model, the outcome should be a `factor`, not a `numeric`.


some explanations

### Discussion

asdfghj

### References

Cheng Jun Zhang, Zhen Hong Shang, Wan Min Chen, Liu Xie, Xiang Hua Miao, A Review of Research on Pulsar Candidate Recognition Based on Machine Learning, Procedia Computer Science, Volume 166, 2020, Pages 534-538, ISSN 1877-0509, https://doi.org/10.1016/j.procs.2020.02.050.

Chodos, Alan. (February 2006). American Physical Society. APS News, Volume 15, Number 2). https://www.aps.org/publications/apsnews/200602/history.cfm

Gaskill, Melissa. (June 22, 2020). Phys.org https://phys.org/news/2020-06-future-space-cosmic-lighthouses.html

Lyon, Robert. (2017). HTRU2. UCI Machine Learning Repository. https://doi.org/10.24432/C5DK6R