# Applying Frequentist Inference to Pulsars Data

## Introduction

Here we are carrying on from the conclusion we reached in the main document, we will be obtaining values to manupliate the data with.

### What is Frequentist Inference:
- This is when you have a data set of say two values (one varying and the other between say two values like calss and mean_ip for us) and you try to manupliate the data set to allow for the values to be further apart from eachother to allow for easier manupliation and training.
- You try to seperate them enough that when you guess a random value, that random value is closer to one rather than the other, a pulsar instead of a non pulsar in this case.
- For that first we need to decide the percent of False Positives (values we decide are pulsars but are not in reality) and the percent of True Positives (values we decide are pulsars when they are Pulsars). I am aiming to get a 5% False Positive and a 80% true positive for each value in the dataset.



In [1]:
library(tidyverse)
library(repr)

“running command 'timedatectl' had status 1”
── [1mAttaching core tidyverse packages[22m ──────────────────────── tidyverse 2.0.0 ──
[32m✔[39m [34mdplyr    [39m 1.1.0     [32m✔[39m [34mreadr    [39m 2.1.4
[32m✔[39m [34mforcats  [39m 1.0.0     [32m✔[39m [34mstringr  [39m 1.5.0
[32m✔[39m [34mggplot2  [39m 3.4.1     [32m✔[39m [34mtibble   [39m 3.2.1
[32m✔[39m [34mlubridate[39m 1.9.2     [32m✔[39m [34mtidyr    [39m 1.3.0
[32m✔[39m [34mpurrr    [39m 1.0.1     
── [1mConflicts[22m ────────────────────────────────────────── tidyverse_conflicts() ──
[31m✖[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31m✖[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()
[36mℹ[39m Use the conflicted package ([3m[34m<http://conflicted.r-lib.org/>[39m[23m) to force all conflicts to become errors


So first we will call in the dataset as we had before to manupliate here.

In [2]:
# Adding the column names to the dataset because they originally did not exist
columns <- c("mean_ip", "std_ip", "ek_ip", "sk_ip", "mean_dmsnr", "std_dmsnr", "ek_dmsnr", "sk_dmsnr", "class")

# reading the data, and converting class to factor type
pulsar_data <- read_csv("https://raw.githubusercontent.com/originalajitest/Pulsars_R/main/data/HTRU_2.csv", col_names = columns) |>
                    mutate(class = as.factor(class))

print("Table 1")
head(pulsar_data, 10)

[1mRows: [22m[34m17898[39m [1mColumns: [22m[34m9[39m
[36m──[39m [1mColumn specification[22m [36m────────────────────────────────────────────────────────[39m
[1mDelimiter:[22m ","
[32mdbl[39m (9): mean_ip, std_ip, ek_ip, sk_ip, mean_dmsnr, std_dmsnr, ek_dmsnr, sk_...

[36mℹ[39m Use `spec()` to retrieve the full column specification for this data.
[36mℹ[39m Specify the column types or set `show_col_types = FALSE` to quiet this message.


[1] "Table 1"


mean_ip,std_ip,ek_ip,sk_ip,mean_dmsnr,std_dmsnr,ek_dmsnr,sk_dmsnr,class
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<fct>
140.5625,55.68378,-0.23457141,-0.6996484,3.1998328,19.110426,7.975532,74.24222,0
102.50781,58.88243,0.46531815,-0.5150879,1.6772575,14.860146,10.576487,127.39358,0
103.01562,39.34165,0.32332837,1.0511644,3.1212375,21.744669,7.735822,63.17191,0
136.75,57.17845,-0.06841464,-0.6362384,3.6429766,20.95928,6.896499,53.59366,0
88.72656,40.67223,0.60086608,1.1234917,1.1789298,11.46872,14.269573,252.56731,0
93.57031,46.69811,0.53190485,0.4167211,1.6362876,14.545074,10.621748,131.394,0
119.48438,48.76506,0.03146022,-0.1121676,0.9991639,9.279612,19.20623,479.75657,0
130.38281,39.84406,-0.15832276,0.3895404,1.2207358,14.378941,13.539456,198.23646,0
107.25,52.62708,0.45268802,0.1703474,2.3319398,14.486853,9.001004,107.97251,0
107.25781,39.49649,0.46588196,1.1628771,4.0794314,24.980418,7.39708,57.78474,0


### Each variable will go through at least two tests (hopefully the first two) to best determine the constants to use.
#### Tests:
<ol>
    <li>We will first put them through ${var}^{x}$ where $x\geq1$ aiming for a whole $x$</li>
    <li>This one will run using $x * var$ where $x \geq 1$ aiming for any $x$</li>
    <li>If the first one shows no sign of reaching a conclusion within the first few run, then we will run $var ^ {-x}$ where $x \geq 1$ aiming for a whole $x$. This would be most applicable to variables where $|var|\leq1$.</li>
    <li>If the third one does not converge, then we will try with $var^{\frac{1}{x}}$ where $1\leq|x|\leq100$ aiming for a whole $x$.</li>
    <li>If the second one shows no sign of converging to a value within a few runs, we will $var/x$ where $x\geq1$ aiming for any $x$. This has a low chance of working as it brings variables closer together and as such should be a last option if none of those above work.</li>
    <li>On the chence that none of these work, we shall then try to centre and scale the data and then repeat the tests to find the best fit for the data. Or as a last case ignore that variable and experiement with it later on using the model generator.</li>
</ol>

Now we shall separate the data into different variable to help with experiemnting.

In [3]:
mean_ip_set <- select(pulsar_data, mean_ip, class)
std_ip_set <- select(pulsar_data, std_ip, class)
ek_ip_set <- select(pulsar_data, ek_ip, class)
sk_ip_set <- select(pulsar_data, sk_ip, class)
mean_dmsnr_set <- select(pulsar_data, mean_dmsnr, class)
std_dmsnr_set <- select(pulsar_data, std_dmsnr, class)
ek_dmsnr_set <- select(pulsar_data, ek_dmsnr, class)
sk_dmsnr_set <- select(pulsar_data, sk_dmsnr, class)

In [None]:
str(mean_ip_set)

In [4]:
mean_ip_2 <- arrange(mean_ip_set,mean_ip)
mean_ip_non <- filter(mean_ip_2, class == 0)
mean_ip_pul <- filter(mean_ip_2, class == 1)
str(mean_ip_non)
str(mean_ip_pul)

tibble [16,259 × 2] (S3: tbl_df/tbl/data.frame)
 $ mean_ip: num [1:16259] 17.2 27.6 33.2 33.4 34.6 ...
 $ class  : Factor w/ 2 levels "0","1": 1 1 1 1 1 1 1 1 1 1 ...
tibble [1,639 × 2] (S3: tbl_df/tbl/data.frame)
 $ mean_ip: num [1:1639] 5.81 6.18 6.19 6.19 6.27 ...
 $ class  : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...


In [12]:
# for (x in 1:20) {
#     temp_frame = mutate(mean_ip_non, mean_ip_var = mean_ip^x)
#     print(slice(temp_frame, 813))
#     print(slice(temp_frame, 15446))
# }
# print("")
# print("Pulsars")
# for (x in 1:20) {
#     temp_frame = mutate(mean_ip_pul, mean_ip_var = mean_ip^x)
#     print(slice(temp_frame, 327))
#     print(slice(temp_frame, 1311))
# }

# This way works but I need to figure out how to better arrange this data to make this more presentable.