# DSCI 100 09 Group 04 Project Report

### Introduction

"The HTRU2 dataset consists of pulsar candidates from the High Time Resolution Universe Survey (South)." Pulsars are a type of Neutron star that rotate rapidly and emit radio signals. Their unique radio emission patterns hold scientific significance and require labeling individualistic pulsar candidates into two classes: 0 (Negative/Spurious) and 1 (Positive/Real Pulsar).

With this dataset, we want to determine the relationship between the transmitted radio waves of pulsar stars, visualize the strong relationships between variables, and eventually create a KNN classification model to help classify new incoming observations based off of the current dataset as a training set. More specifically, we want to visualize and observe how the **skewness** and **kurtosis** can determine the class of a pulsar star.

### Methods

In our data analysis, we want to visualize and investigate the relationship between the integrated profiles for *skewness* and *kurtosis* of pulsar stars. Through this investigation, we will create a classification model that will classify the pulsar stars based on given observations.

The general process to work towards our goal is to start with loading and wrangling the data for quick and easy discernment and comprehension of the observations we are dealing with.

In [None]:
library(tidyverse) 
library(repr)
library(tidymodels)
set.seed(1000904)

#### Step 1: Loading Data

Load the data set from the web. In this case, we downloaded the data set from the web and added it to a github repository and loaded the *.csv* file from there. Prior to this report, in our proposal, we had wrangled our data to look more tidy. We also used the *glimpse()* function to view how many rows exist in the dataset.

In [None]:
pulsar_data <- read_csv("https://raw.githubusercontent.com/jesquachi/dsci-100-09-group-04-project/main/HTRU_2.csv", 
                         col_names = FALSE) |>   #reading data into rand giving the column readable names
                            rename(Mean_IP = X1,
                                   SD_IP = X2,
                                   Kurtosis_IP = X3,
                                   Skewness_IP = X4,
                                   Mean_DM_SNR = X5,
                                   SD_DM_SNR = X6,
                                   Kurtosis_DM_SNR = X7,
                                   Skewness_DM_SNR = X8,
                                   Class = X9) |>
                            mutate(Class = as_factor(Class)) |>
                            mutate(Class = fct_recode(Class, "Spurious" = "0", "Real Pulsar" = "1"))
glimpse(pulsar_data)

#### Step 2: Summarizing Data and Creating Training set Summaries

First, we made a table looking at the distinct classes and the percentage of observations in each class. Then we dropped any values and we started to split our data set into training and testing. With the newly split training set, we make another table looking at the classes of our training set and the percentage of observations in each class within our new training data set.

In [None]:
obs <- nrow(pulsar_data)
pulsar_data |>
  group_by(Class) |>
  summarize(
    count = n(),
    percentage = n() / obs * 100) 