# Classification of Pulsars: Distinguishing between Real and Candidate Pulsars Using Emission Patterns # 


# Introduction:
Provide some relevant background information on the topic so that someone unfamiliar with it will be prepared to understand the rest of your proposal.
Clearly state the question you will try to answer with your project.
Identify and describe the dataset that will be used to answer the question. 


- It explores the rare type of Neutron star that emits radio emission detected on Earth. Current research is based on space-time, interstellar medium and states of matter that produce emission beams of radiation as it crosses the path of Earth, and with pulsars rotation, a pattern occurs frequently. Each pulsar  emits a different type of emission pattern which means the emission variation can provide answers whether the pulsar is real or fake. 
- Our question will ask “Is the pulsar real or fake as it passes through the Earth’s line of path?” 
- To identify the answer to our question, the dataset provides information on “candidate (fake)” pulsar examples, treating the candidate data sets as binary classification problems, similar to the methods done by the researchers. 

# Preliminary exploratory data analysis: 
Demonstrate that the dataset can be read from the web into R

In [None]:
temp <- tempfile()
download.file("https://archive.ics.uci.edu/ml/machine-learning-databases/00372/HTRU2.zip",temp)

data <- read.table(unz(temp, "HTRU_2.arff"), skip = 11) |>
        separate(col = "V1",
                 into = c("mean", "st_deviation", "excess_kurtosis", "skewness", "mean_curve", "st_deviation_curve", "excess_kurtosis_curve", "skewness_curve", "class"),
                 sep = ',',
                 convert = TRUE)
unlink(temp)

data


# Clean and wrangle your data into a tidy format
Using only training data, summarize the data in at least one table (this is exploratory data analysis). An example of a useful table could be one that reports the number of observations in each class, the means of the predictor variables you plan to use in your analysis and how many rows have missing data. 

In [None]:
Number of observations per class:
pulsar_proportions <- df_train |>
                     group_by(class) |>
                     summarize(n = n()) |>
                     mutate(percent = 100*n/nrow(df_train))
pulsar_proportions


# Using only training data, visualize the data with at least one plot relevant to the analysis you plan to do (this is exploratory data analysis). 
An example of a useful visualization could be one that compares the distributions of each of the predictor variables you plan to use in your analysis.

In [None]:
The distribution of the mean and standard deviation of the integrated profile, with the different colors indicating whether it is a pulsar or not:
distribution_plot <- df_train |>
                     ggplot(aes(x = mean, y = st_deviation)) +
                         geom_point(aes(color = class)) +
                         xlab("Mean of the integrated profile") +
                         ylab("Standard deviation of the integrated profile") +
                         labs(color = "Class")
distribution_plot
When centered and standardized, the code for the graph is
standardized_train <- df_recipe |> prep() |> bake(df_train)

st_distribution_plot <- standardized_train |>
                     ggplot(aes(x = mean, y = st_deviation)) +
                         geom_point(aes(color = class)) +
                         xlab("Mean of the integrated profile") +
                         ylab("Standard deviation of the integrated profile") +
                         labs(color = "Class")
st_distribution_plot


# Methods:
Explain how you will conduct either your data analysis and which variables/columns you will use. Note - you do not need to use all variables/columns that exist in the raw data set. In fact, that's often not a good idea. For each variable think: is this a useful variable for prediction?

Describe at least one way that you will visualize the results


- According to the research question, the K-nearest neighbours   classification algorithm is used to analyze the data. The data set has 9 quantitative variables, which can be divided into two groups called Integrated Pulse Profile and DM-SNR Curve. The Mean and standard deviation of the integrated profile will be used to estimate whether the pulsar star is real or fake. Taking the average signals from the slight variations of emission can help us make the pulsars profiles stable and be able to determine the real and candidate pulsars. 

- We will use a scatter plot and a line graph to visualize the results and demonstrate correlation and likelihood pattern of the pulsars. The scatterplot will show the difference in mean and standard deviation of the integrated profile between the real pulsar and spurious data made by RFI/noise.


# Expected outcomes and significance:

What do you expect to find?

- Patterns or features in the dataset that distinguish real pulsars from fake ones which could include specific emission patterns, rotation periods, or other characteristics that are unique to real pulsars.

What impact could such findings have?

- A reliable classification method could improve the efficiency and accuracy   of pulsar surveys, enabling astronomers to identify new pulsars more quickly and with greater confidence.
- Advance our understanding of these objects and the astrophysical processes that give rise to their emission. 

What future questions could this lead to?

- It will lead to future research questions in the field of pulsar astrophysics.
- How can we use classification methods to understand better the pulsar population in the Milky Way and other galaxies?
- Can the classification method be extended to other types of pulsar data?
- Overall, successful studies of pulsar classification can open up new avenues of research in pulsar astrophysics and related fields, leading to discoveries and insights into the nature of the universe.


