**DSCI 100 09 Group 04 Project Proposal - Pulsar Star Dataset**

The HTRU2 dataset consists of pulsar candidates from the High Time Resolution Universe Survey (South). Pulsars are rare, rapidly rotating Neutron stars that emit radio signals, with their unique patterns holding scientific significance. Identifying genuine pulsars amidst noise and radio frequency interference is a complex task. Machine learning techniques, particularly binary classification, are being employed to automatically categorize these candidates. Legitimate pulsar signals represent a minority class.

In [1]:
library(tidyverse)
library(repr)
library(tidymodels)

── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.2     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.3     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.2     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors
── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.1.1 ──

[32m✔[39m [34mbroom       [39m 1.0.5     [32m✔[39m [34mrsample     [39

The total set of variables per image in this data set is:

Mean_IP: Mean of the integrated profile.SD_IP:. Standard deviation of the integrated profile.Kurtosis_IP:3. Excess kurtosis of the integrated profileSkewness_IP:
4. Skewness of the integrated profilMean_DM_SNR:
5. Mean of the DM-SNR curSD_DM_SNR:.
6. Standard deviation of the DM-SNR cuKurtosis_DM_SNR:e.
7. Excess kurtosis of the DM-SNR cSkeweness_DM-SNR:ve.
8. Skewness of the DM-SNR Class:rThe diagnosis (0 = spurious examples(non-real), 1 = real pulsar examples)9. Class


In [17]:
 pulsar_data <- read_csv("HTRU_2.csv", col_names = FALSE) |>   #reading data into rand giving the column readable names
                            rename(Mean_IP = X1,
                                   SD_IP = X2,
                                   Kurtosis_IP = X3,
                                   Skewness_IP = X4,
                                   Mean_DM_SNR = X5,
                                   SD_DM_SNR = X6,
                                   Kurtosis_DM_SNR = X7,
                                   Skeweness_DM_SNR = X8,
                                   Class = X9) |>
                            mutate(Class = as_factor(Class)) |>
                            mutate(Class = fct_recode(Class, "Spurious" = "0", "Real Pulsar" = "1"))
pulsar_data

ERROR: Error in read_csv("HTRU_2.csv", col_names = FALSE, na.rm = TRUE): unused argument (na.rm = TRUE)


In [34]:




glimpse(pulsar_data)  #seeing how many rows exist in the dataset
pulsar_data |>
distinct(Class)  #checking the distinct classes in the dataset

#Finding out how many observations belong to each class as well as the percentage of each class

obs <- nrow(pulsar_data)
pulsar_data |>
  group_by(Class) |>
  summarize(
    count = n(),
    percentage = n() / obs * 100
                 
  ) 
obs

#splitting data into training and testing sets

pulsar_split <- initial_split(pulsar_data, prop = 0.75, strata = Class)
pulsar_train <- training(pulsar_split)
pulsar_test <- testing(pulsar_split) 
#glimpse(pulsar_train)
#glimpse(pulsar_test)

#Finding out how many observations belong to each class as well as the percentage of each class for training set


# dataset has no missing values although it is always good to check 

obs_train <- nrow(pulsar_train)
pulsar_data |>
  group_by(Class) |>
  summarize(
    count = n(),
    percentage = n() / obs * 100)

obs_train


    







Rows: 17,898
Columns: 9
$ Mean_IP          [3m[90m<dbl>[39m[23m 140.56250, 102.50781, 103.01562, 136.75000, 88.72656,…
$ SD_IP            [3m[90m<dbl>[39m[23m 55.68378, 58.88243, 39.34165, 57.17845, 40.67223, 46.…
$ Kurtosis_IP      [3m[90m<dbl>[39m[23m -0.23457141, 0.46531815, 0.32332837, -0.06841464, 0.6…
$ Skewness_IP      [3m[90m<dbl>[39m[23m -0.69964840, -0.51508791, 1.05116443, -0.63623837, 1.…
$ Mean_DM_SNR      [3m[90m<dbl>[39m[23m 3.1998328, 1.6772575, 3.1212375, 3.6429766, 1.1789298…
$ SD_DM_SNR        [3m[90m<dbl>[39m[23m 19.110426, 14.860146, 21.744669, 20.959280, 11.468720…
$ Kurtosis_DM_SNR  [3m[90m<dbl>[39m[23m 7.975532, 10.576487, 7.735822, 6.896499, 14.269573, 1…
$ Skeweness_DM_SNR [3m[90m<dbl>[39m[23m 74.24222, 127.39358, 63.17191, 53.59366, 252.56731, 1…
$ Class            [3m[90m<fct>[39m[23m Spurious, Spurious, Spurious, Spurious, Spurious, Spu…


Class
<fct>
Spurious
Real Pulsar


Class,count,percentage,na.rm
<fct>,<int>,<dbl>,<lgl>
Spurious,16259,90.842552,True
Real Pulsar,1639,9.157448,True


Class,count,percentage
<fct>,<int>,<dbl>
Spurious,16259,90.842552
Real Pulsar,1639,9.157448


count
<dbl>
68.10162


══ Workflow [trained] ══════════════════════════════════════════════════════════
[3mPreprocessor:[23m Recipe
[3mModel:[23m nearest_neighbor()

── Preprocessor ────────────────────────────────────────────────────────────────
2 Recipe Steps

• step_scale()
• step_center()

── Model ───────────────────────────────────────────────────────────────────────

Call:
kknn::train.kknn(formula = ..y ~ ., data = data, ks = min_rows(5,     data, 5), kernel = ~"rectangular")

Type of response variable: nominal
Minimal misclassification: 0.0224987
Best kernel: rectangular
Best k: 5

In [None]:
#CLEAN AND WRANGLE DATA




Kurtosis_IP_Skewness_IP <- pulsar_data |>
  ggplot(aes(x = Kurtosis_IP, y = Skewness_IP, color = Class)) +
  geom_point(alpha = 0.5) +
  labs(color = "Diagnosis") +
  theme(text = element_text(size = 12)) + 

Kurtosis_IP_Skewness_IP  
