**Introduction**:

In this project, analysis on data from the High Time Resolution Universe Pulsar Survey will be done to gain a better understanding of how to better label Neutron stars (also known as pulsars).

Studying pulsars allows researchers to explore “space-time, the interstellar medium, and states of matter.” However, radio signals received to identify pulsars are obstructed with radio frequency interference (RFI) and noise. As a result, there is demand for advanced learning tools that will be able to automatically label pulsar candidates.

If the pulse’s signal is integrated with respect to its rotational period, a unique integrated profile can be created for the pulse. A DM-SNR curve plots the spectral supernova remnants (SNR) as a function of the trail dispersion measure (DM).

The question that this project will answer is **“Which variable, between integrated profile and DM-SNR, is a better predictor for class?”**

The dataset that is used for this project provides information on 17,898 candidates—16,259 are examples of RFI/noise, and 1,639 are examples of real pulsars. 8 variables corresponding to data about the integrated profile and DM-SNR curve of the candidates, will be further analyzed in the project to answer the research question.


In [None]:
library(tidyverse)
library(repr)
library(RColorBrewer)
library(tidymodels)

In [None]:
set.seed(19)
# data loaded from ksenia

pulsar_data <- read_csv("https://github.com/kseniak1/DSCI100-Group-Project/raw/main/HTRU_2%5B1%5D.csv", col_names = FALSE) %>%
                rename(mean_ip = X1,
                      dev_ip = X2,
                      excess_ip = X3,
                      skewness_ip = X4,
                      mean_dm = X5,
                      dev_dm = X6,
                      excess_dm = X7,
                      skewness_dm = X8,
                      class = X9) %>%
                mutate(class = as.factor(class)) %>%
                sample_n(500)                
pulsar_data

In [None]:
set.seed(19)
# data split into training and testing sets 

pulsar_split <- initial_split(pulsar_data, prop = 0.75, strata = class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split)

pulsar_test

In [None]:
set.seed(19)
# 5-fold cross-validation on data
pulsar_vfold <- vfold_cv(pulsar_train, v = 5, strata = class)
pulsar_vfold

In [29]:
set.seed(19)
# create model spec for knn
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")

K-Nearest Neighbor Model Specification (classification)

Main Arguments:
  neighbors = tune()
  weight_func = rectangular

Computational engine: kknn 
