**Introduction**:

In this project, analysis on data from the High Time Resolution Universe Pulsar Survey will be done to gain a better understanding of how to better label Neutron stars (also known as pulsars).

Studying pulsars allows researchers to explore “space-time, the interstellar medium, and states of matter.” However, radio signals received to identify pulsars are obstructed with radio frequency interference (RFI) and noise. As a result, there is demand for advanced learning tools that will be able to automatically label pulsar candidates.

If the pulse’s signal is integrated with respect to its rotational period, a unique integrated profile can be created for the pulse. A DM-SNR curve plots the spectral supernova remnants (SNR) as a function of the trail dispersion measure (DM).

The question that this project will answer is **“Which variable, between integrated profile and DM-SNR, is a better predictor for class?”**

The dataset that is used for this project provides information on 17,898 candidates—16,259 are examples of RFI/noise, and 1,639 are examples of real pulsars. 8 variables corresponding to data about the integrated profile and DM-SNR curve of the candidates, will be further analyzed in the project to answer the research question.


1. Load the libraries tidyverse, tidymodels, repr, ggplot2 and cowplot, and read the data from the web using the read_csv() function.

In [1]:
library(tidyverse)
library(repr)
library(RColorBrewer)
library(tidymodels)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.0 ──

[32m✔[39m [34mggplot2[39m 3.3.2     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.0.3     [32m✔[39m [34mdplyr  [39m 1.0.2
[32m✔[39m [34mtidyr  [39m 1.1.2     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 1.3.1     [32m✔[39m [34mforcats[39m 0.5.0

“package ‘ggplot2’ was built under R version 4.0.1”
“package ‘tibble’ was built under R version 4.0.2”
“package ‘tidyr’ was built under R version 4.0.2”
“package ‘dplyr’ was built under R version 4.0.2”
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

“package ‘tidymodels’ was built under R version 4.0.2”
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 0.1.1 ──

[32m✔

In [2]:
set.seed(19)
# data loaded from ksenia

pulsar_data <- read_csv("https://github.com/kseniak1/DSCI100-Group-Project/raw/main/HTRU_2%5B1%5D.csv", col_names = FALSE) %>%
                rename(mean_ip = X1,
                      dev_ip = X2,
                      excess_ip = X3,
                      skewness_ip = X4,
                      mean_dm = X5,
                      dev_dm = X6,
                      excess_dm = X7,
                      skewness_dm = X8,
                      class = X9) %>%
                mutate(class = as.factor(class)) %>%
                sample_n(500)                
pulsar_data

Parsed with column specification:
cols(
  X1 = [32mcol_double()[39m,
  X2 = [32mcol_double()[39m,
  X3 = [32mcol_double()[39m,
  X4 = [32mcol_double()[39m,
  X5 = [32mcol_double()[39m,
  X6 = [32mcol_double()[39m,
  X7 = [32mcol_double()[39m,
  X8 = [32mcol_double()[39m,
  X9 = [32mcol_double()[39m
)



mean_ip,dev_ip,excess_ip,skewness_ip,mean_dm,dev_dm,excess_dm,skewness_dm,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
107.91406,37.32941,0.30272446,1.58089787,3.9272575,24.924345,6.711611,46.2225904,0
14.70312,29.81594,7.21863076,52.86635443,81.6028428,57.256627,1.058996,1.2584176,1
99.52344,54.36896,0.63675112,-0.22504476,2.2784281,15.332172,10.197710,127.5418937,0
106.06250,51.40354,0.31323378,-0.33572020,4.6889632,26.324866,6.360699,42.3073497,0
128.97656,55.67406,-0.08570242,-0.59446104,5.8586957,30.994971,5.618643,32.0538478,0
128.62500,43.70212,0.15793456,0.46496956,49.1906355,78.242754,1.043050,-0.7616284,0
95.39844,48.16286,0.51860972,0.49351065,2.2784281,18.554590,10.275586,114.6954798,0
123.57031,50.13352,0.13839104,-0.19433412,4.2433110,20.507895,6.616925,52.5052344,0
113.39844,51.77843,0.26218746,0.20984763,1.8545151,18.093558,10.431586,113.0406108,0
152.21875,54.58464,-0.16691626,-0.43796620,0.6856187,9.458402,20.296619,480.7530292,0


In [None]:
set.seed(19)
# data split into training and testing sets 

pulsar_split <- initial_split(pulsar_data, prop = 0.75, strata = class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split)

pulsar_test

In [None]:
set.seed(19)
# 5-fold cross-validation on data
pulsar_vfold <- vfold_cv(pulsar_train, v = 5, strata = class)
pulsar_vfold

In [30]:
set.seed(19)
# create model spec for knn
knn_spec <- nearest_neighbor(weight_func = "rectangular", 
                             neighbors = tune()) %>%
  set_engine("kknn") %>%
  set_mode("classification")