# Project Proposal: Exploring Pulsar Star Data

Introduction
----

<b>Relevant Background<b>
    
Pulsar stars are a type of Neutron star that produces radio emission detectable on Earth. They are significant as they are used as probes of space-time, the interstellar medium, and states of matter (1).
    

This HTRU2 dataset, taken from (cite kaggle) will inform this project. It shows classifications of stars as either pulsar or non-pulsar stars (1). Light emitted by a pulsar contains information about the physics of neutron stars, which are the densest material in the universe. The precise blink or pulse of a pulsar star can indicate a possible event happening in space, as well. Cosmic distances can be calculated due to the regular periodicity of pulsar stars' light emission and they have been used to test parts of the theory of relativity (2).
 
    
    

<b>Research Question<b>


<b>About the Dataset<b>
    
This dataset uses the mean, standard deviation, excess kurtosis, and skewness of the stars’ integrated profiles and DM-SNR curves to classify whether or not a star can be identified as a pulsar. DM-SNR curves measure the radio waves released by pulsar stars once they reach Earth and have already traveled long distances in space surrounded by free electrons. Used to identify the pulsar as each profile is unique;however, pulse profiles vary slightly each period due to their signals being  non uniform and unstable. Averaged over many  thousands of rotations makes the profiles stable.Means that the distribution of event outcomes have many outliers leading to fat tails on the bell shaped distribution curve. The observations under the Class variable are binary, indicating that a star is a pulsar or non-pulsar. It is for this reason that the Class variable has been converted to a factor, as the observations are discrete. The rest of the variables are continuous values (1).

Exploratory Data Analysis
----

In [1]:
#1. Download libraries

library(tidyverse)
library(repr)
library(tidymodels)
set.seed(1)

── [1mAttaching packages[22m ─────────────────────────────────────── tidyverse 1.3.1 ──

[32m✔[39m [34mggplot2[39m 3.3.6     [32m✔[39m [34mpurrr  [39m 0.3.4
[32m✔[39m [34mtibble [39m 3.1.7     [32m✔[39m [34mdplyr  [39m 1.0.9
[32m✔[39m [34mtidyr  [39m 1.2.0     [32m✔[39m [34mstringr[39m 1.4.0
[32m✔[39m [34mreadr  [39m 2.1.2     [32m✔[39m [34mforcats[39m 0.5.1

── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()

── [1mAttaching packages[22m ────────────────────────────────────── tidymodels 1.0.0 ──

[32m✔[39m [34mbroom       [39m 1.0.0     [32m✔[39m [34mrsample     [39m 1.0.0
[32m✔[39m [34mdials       [39m 1.0.0     [32m✔[39m [34mtune        [39m 1.0.0
[32m✔[39m [34minfer       [39m 1.0.2     [32m✔[39m [34mworkflows   [39m 1.0.0
[32m✔

In [2]:
#2.Read in data Pulsar Star Data

pulsar_star_data<-read_csv("https://raw.githubusercontent.com/madisongill/dsci-100-2023s-group-39-section-002/main/HTRU_2.csv",col_names=FALSE)

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): X1, X2, X3, X4, X5, X6, X7, X8, X9

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


In [3]:
#3. Tidy data (add column names,condense if needed,

pulsar <- pulsar_star_data|>
rename(mean=X1,
       std_dev=X2,
       kurt=X3,
       skew=X4,
       mean_dmsnr=X5,
       std_dev_dmsnr=X6,
       kurt_dmsnr=X7,
       skew_dmsnr=X8,
       class=X9) |>
mutate(class = as_factor(class))
       

In [4]:
#4.Split data into training and test set

pulsar_split<-initial_split(pulsar,prop=0.75,strata=class)
pulsar_train<-training(pulsar_split)
pulsar_test<-testing(pulsar_split)

head(pulsar_train)

mean,std_dev,kurt,skew,mean_dmsnr,std_dev_dmsnr,kurt_dmsnr,skew_dmsnr,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
102.50781,58.88243,0.46531815,-0.5150879,1.6772575,14.860146,10.576487,127.39358,0
103.01562,39.34165,0.32332837,1.0511644,3.1212375,21.744669,7.735822,63.17191,0
136.75,57.17845,-0.06841464,-0.6362384,3.6429766,20.95928,6.896499,53.59366,0
93.57031,46.69811,0.53190485,0.4167211,1.6362876,14.545074,10.621748,131.394,0
119.48438,48.76506,0.03146022,-0.1121676,0.9991639,9.279612,19.20623,479.75657,0
130.38281,39.84406,-0.15832276,0.3895404,1.2207358,14.378941,13.539456,198.23646,0


<b>Summarizing and Visualizing the Data

The next code cell reduces the columns to be used, and summarizes their values. A count of each class is shown, as well as the average of all the columns and a count of missing values.

In [47]:
reduced_train <- pulsar_train |>
    select(-mean_dmsnr:-skew_dmsnr) #selecting for only integrated profile values

class_count <- reduced_train |> #summarize counts of class labels
    group_by(class) |>
    summarize(count = n()) 

summary_train <- reduced_train |> #the average of all the predictor variables
    select(-class) |>
    map_df(mean)

check_na <- as_tibble(sum(is.na(reduced_train))) |> #checking for na values
    rename(missing_vals = value)

total_summary <- cbind(summary_train, check_na) #combine summary_train and check_na into one df


total_summary
class_count

mean,std_dev,kurt,skew,missing_vals
<dbl>,<dbl>,<dbl>,<dbl>,<int>
111.1363,46.56378,0.4749385,1.757917,0


class,count
<fct>,<int>
0,12200
1,1223


The following plot is an example of one way the research question can be answered. Using information from the stars' integrated profiles and excluding that of the DM-SNR curves, a relationship can be seen between the excess kurtosis and mean. It appears to be a strong negative linear relationship, with higher values of mean and lower values of excess kurtosis correlating to non-pulsar stars.




In [None]:
options(repr.plot.width = 9, repr.plot.height = 7)

pulsar_star_graph <- pulsar_train |>
ggplot(aes(x = mean, y = kurt, color = class)) +
geom_point(alpha = 0.4) +

labs(x = "Mean of the Integrated Profile", y = "Excess kurtosis of the integrated profile", color = "Class") +
scale_color_manual(labels = c("Non-pulsar", "Pulsar"),
                   values = c("orange", "steel blue")) +
ggtitle("Determining Class from Mean and Excess\n Kurtosis of Integrated Profile")+
theme(plot.title = element_text(hjust = 0.5))+
theme(text = element_text(size = 20))

pulsar_star_graph