# **Pulsar Star Classification**



<img src="https://media3.giphy.com/media/l3dj5M4YLaFww31V6/giphy.gif" width = "600"/>



In [47]:
### Run this cell before continuing

import pandas as pd
import altair as alt
from sklearn.model_selection import train_test_split
import statistics

## **Introduction**



**Background Information:**

Pulsars are rotating neutron stars that emit pulses of electromagnetic radiation \(radio waves\) at regular intervals. 

Investigating pulsar stars are important to the scientific community as they provide researchers with:

- Information about the physics of neutron stars \(via the light they emit\)
- Methods to accurately calculate cosmic distances \(via the emission of electromagnetic \[EM\] radiation at systematic intervals\)
- A “cosmic clock”; the intervals at which pulsar stars emit EM radiation can be used to help precisely tell time
- An instrument to detect gravitational waves



**Research Question:**

Given basic statistical features obtained from the integrated pulse profile and the DM\-SNR curve, what stars classify as pulsar stars? 



**Dataset Description:**

- The dataset we will be using is the HTRU2 \(High Time Resolution Universe\) dataset from the University of California Irvine \(UCI\) machine learning database. 

**Statistical measurements of the integrated profile:**

- In pulsar stars, this profile will show regular periodicity and a very narrow pulse width on the integrated profile graph
- These two general assumptions of the integrated profile of pulsar stars can be used as a way to distinguish between pulsar and non\-pulsar stars

**Statistical measurements of the DM\-SNR curve:**

- The DM\-SNR curve relates the dispersion measure and the signal\-to\-noise ratio
- The DM\-SNR curve can also be used to:
  - Estimate how far away a given pulsar is
  - Classify a given star as either pulsar or non\-pulsar

Both of these measurements can be analyzed using basic statistical tools like mean, SD, and skewness, to essentially “create” new parameters that can be used to classify stars.



## **Preliminary exploratory data analysis**



In [48]:
# Demonstrate dataset can be read from web into python:

pulsar_data = pd.read_csv("HTRU_2.csv", names=[
"Mean_of_Integrated_Profile",
"SD_of_Integrated_Profile",
"Excess_kurtosis_of_the_integrated_profile",
"skewness_of_integrated_profile",
"Mean_of_the_DM-SNR_curve",
"SD_of_the_DM-SNR_curve",
"Excess_kurtosis_of_the_DM-SNR_curve",
"Skewness_of_the_DM-SNR_curve",
"class"])

# Split dataset into training (75%) and testing (25%):

pulsar_training, pulsar_testing = train_test_split(
     pulsar_data, test_size=0.25, random_state=2000)

pulsar_data

Unnamed: 0,Mean_of_Integrated_Profile,SD_of_Integrated_Profile,Excess_kurtosis_of_the_integrated_profile,skewness_of_integrated_profile,Mean_of_the_DM-SNR_curve,SD_of_the_DM-SNR_curve,Excess_kurtosis_of_the_DM-SNR_curve,Skewness_of_the_DM-SNR_curve,class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


In [49]:
# Summarize dataset in a table:

# NOTE: Because the values for each of the variables is on a much different scale,
# it will be necessary to use a standard scaler to give each variable appropriate weight in the classification model.
# We will split it into two seperate tables for readability purposes.

pulsar_training_class_0 = pulsar_training[pulsar_training["class"] == 0]
pulsar_training_class_1 = pulsar_training[pulsar_training["class"] == 1]

median_Mean_of_Integrated_Profile_0 = statistics.median(pulsar_training_class_0["Mean_of_Integrated_Profile"])
median_Mean_of_Integrated_Profile_1 = statistics.median(pulsar_training_class_1["Mean_of_Integrated_Profile"])
median_SD_of_Integrated_Profile_0 = statistics.median(pulsar_training_class_0["SD_of_Integrated_Profile"])
median_SD_of_Integrated_Profile_1 = statistics.median(pulsar_training_class_1["SD_of_Integrated_Profile"])
median_kurtosis_of_Integrated_Profile_0 = statistics.median(pulsar_training_class_0["Excess_kurtosis_of_the_integrated_profile"])
median_kurtosis_of_Integrated_Profile_1 = statistics.median(pulsar_training_class_1["Excess_kurtosis_of_the_integrated_profile"])
median_skewness_of_Integrated_Profile_0 = statistics.median(pulsar_training_class_0["skewness_of_integrated_profile"])
median_skewness_of_Integrated_Profile_1 = statistics.median(pulsar_training_class_1["skewness_of_integrated_profile"])

median_Mean_of_DM_SNR_0 = statistics.median(pulsar_training_class_0["Mean_of_the_DM-SNR_curve"])
median_Mean_of_DM_SNR_1 = statistics.median(pulsar_training_class_1["Mean_of_the_DM-SNR_curve"])
median_SD_of_DM_SNR_0 = statistics.median(pulsar_training_class_0["SD_of_the_DM-SNR_curve"])
median_SD_of_DM_SNR_1 = statistics.median(pulsar_training_class_1["SD_of_the_DM-SNR_curve"])
median_kurtosis_of_DM_SNR_0 = statistics.median(pulsar_training_class_0["Excess_kurtosis_of_the_DM-SNR_curve"])
median_kurtosis_of_DM_SNR_1 = statistics.median(pulsar_training_class_1["Excess_kurtosis_of_the_DM-SNR_curve"])
median_skewness_of_DM_SNR_0 = statistics.median(pulsar_training_class_0["Skewness_of_the_DM-SNR_curve"])
median_skewness_of_DM_SNR_1 = statistics.median(pulsar_training_class_1["Skewness_of_the_DM-SNR_curve"])

class_dataframe = pd.DataFrame(pulsar_training["class"].value_counts())
class_dataframe["Counts"] = [0, 1]
class_dataframe.rename(columns = {'class':'Counts', 'Counts':'Class'}, inplace = True)
class_dataframe = class_dataframe[["Class", "Counts"]]
class_dataframe["Median of the Mean of Integrated Profile"] = [median_Mean_of_Integrated_Profile_0, median_Mean_of_Integrated_Profile_1]
class_dataframe["Median of the SD of Integrated Profile"] = [median_SD_of_Integrated_Profile_0, median_SD_of_Integrated_Profile_1]
class_dataframe["Median of the Excess Kurtosis of Integrated Profile"] = [median_kurtosis_of_Integrated_Profile_0, median_kurtosis_of_Integrated_Profile_1]
class_dataframe["Median of the Skewness of Integrated Profile"] = [median_skewness_of_Integrated_Profile_0, median_skewness_of_Integrated_Profile_1]

class_dataframe_1 = pd.DataFrame(pulsar_training["class"].value_counts())
class_dataframe_1["Counts"] = [0, 1]
class_dataframe_1.rename(columns = {'class':'Counts', 'Counts':'Class'}, inplace = True)
class_dataframe_1 = class_dataframe_1[["Class", "Counts"]]
class_dataframe_1["Median of the Mean of DM Curve"] = [median_Mean_of_DM_SNR_0 , median_Mean_of_DM_SNR_1]
class_dataframe_1["Median of the SD of DM Curve"] = [median_SD_of_DM_SNR_0, median_SD_of_DM_SNR_1]
class_dataframe_1["Median of the Excess DM Curve"] = [median_kurtosis_of_DM_SNR_0, median_kurtosis_of_DM_SNR_1]
class_dataframe_1["Median of the Skewness of DM Curve"] = [median_skewness_of_DM_SNR_0, median_skewness_of_DM_SNR_1]

In [50]:
class_dataframe

Unnamed: 0,Class,Counts,Median of the Mean of Integrated Profile,Median of the SD of Integrated Profile,Median of the Excess Kurtosis of Integrated Profile,Median of the Skewness of Integrated Profile
0,0,12173,117.257812,47.583274,0.185586,0.118464
1,1,1250,53.15625,37.113536,3.003763,11.969828


In [51]:
class_dataframe_1

Unnamed: 0,Class,Counts,Median of the Mean of DM Curve,Median of the SD of DM Curve,Median of the Excess DM Curve,Median of the Skewness of DM Curve
0,0,12173,2.633779,17.644442,8.764093,90.852031
1,1,1250,33.41806,59.248839,1.916913,2.61144


In [52]:
# Summarize dataset in a plot:

pulsar_chart_mean_integrated = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("Mean_of_Integrated_Profile", title = "Mean of Integrated Profile"),
        )
).properties(
    title='Proportion of Mean of Integrated Profile'
)
pulsar_chart_SD_integrated = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("SD_of_Integrated_Profile", title = "SD of Integrated Profile"),
        )
).properties(
    title='Proportion of SD of Integrated Profile'
)
pulsar_chart_kurtosis_integrated = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("Excess_kurtosis_of_the_integrated_profile", title = "Excess kurtosis of Integrated Profile"),
        )
).properties(
    title='Proportion of Excess Kurtosis of Integrated Profile'
)
pulsar_chart_skewness_integrated = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("skewness_of_integrated_profile", title = "Skewdness of Integrated Profile"),
        )
).properties(
    title='Proportion of Skewness of Integrated Profile'
)
pulsar_chart_integrated = alt.hconcat(pulsar_chart_mean_integrated, pulsar_chart_SD_integrated, pulsar_chart_kurtosis_integrated, pulsar_chart_skewness_integrated)

In [53]:
pulsar_chart_mean_DM_SNR = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("Mean_of_the_DM-SNR_curve", title = "Mean of DM-SNR curve"),
        )
).properties(
    title='Proportion of Mean of DM-SNR Curve'
)
pulsar_chart_SD_DM_SNR = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("SD_of_the_DM-SNR_curve", title = "SD of DM-SNR curve"),
        )
).properties(
    title='Proportion of SD of DM-SNR Curve'
)
pulsar_chart_kurtosis_DM_SNR = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("Excess_kurtosis_of_the_DM-SNR_curve", title = "Excess kurtosis of DM-SNR curve"),
        )
).properties(
    title='Proportion of Excess Kurtosis of DM-SNR Curve'
)
pulsar_chart_skewness_DM_SNR = (alt.Chart(pulsar_training[0:5000])
                .mark_boxplot()
                .encode(
        x=alt.X("class", title ="Class of Star", axis = alt.Axis(format='d', tickCount=1)),
        y=alt.Y("Skewness_of_the_DM-SNR_curve", title = "Skewdness of Integrated Profile"),
        )
).properties(
    title='Proportion of Skewness of DM-SNR Curve'
)
pulsar_chart_DM_SNR = alt.hconcat(pulsar_chart_mean_DM_SNR, pulsar_chart_SD_DM_SNR, pulsar_chart_kurtosis_DM_SNR, pulsar_chart_skewness_DM_SNR)
pulsar_chart = alt.vconcat(pulsar_chart_integrated, pulsar_chart_DM_SNR)
pulsar_chart.properties(title = alt.TitleParams(text = 'Proportion of Each Feature',
                                            fontSize = 40,
                                            anchor='middle')
                        )

  for col_name, dtype in df.dtypes.iteritems():


From the graphs above, you can infer that there is a clear difference between pulsar \(1\) and non\-pulsar stars \(0\) for each predictor variable. 


## **Methods**



**Variables:**

1. Mean of the integrated profile.
2. Standard deviation of the integrated profile.
3. Excess kurtosis of the integrated profile.
4. Skewness of the integrated profile.
5. Mean of the DM\-SNR curve.
6. Standard deviation of the DM\-SNR curve.
7. Excess kurtosis of the DM\-SNR curve.
8. Skewness of the DM\-SNR curve.
9. Class \(Pulsar = 1, Non\-Pulsar = 0\) 



**Data will be analyzed using the process outlined below:** 

1. Split the data into training \(75%\) and testing \(25%\) datasets 
2. Create a preprocessor that centers and scales the data \(for all variables\)
3. Create an object for the unspecified KNeighborsClassifier\(\) method \(e.g., "pulsar\_spec"\)
4. Put both the preprocessor and the KNeighborsClassifier\(\) object into a pipeline
5. Make a parameter grid dictionary that outlines a range of K values that will be tested
6. Using two objects, specify the first 8 variables as the predictors and the last variable “class”, as the target
7. Perform 5 fold cross\-validation using both the GridSearchCV and RandomizedSearchCV methods
8. Create two data frames that display the results \(e.g., mean test score\) of each cross\-validation method
9. Construct a visualization of each method’s results by the plotting the mean test score \(y\-axis\) vs the number of neighbors \(x\-axis\)
10. Choose the K value \(from the plot\) with the highest mean test score
11. Run the model with this new K value to answer the analysis question 

**Note:** Because we will be using all 8 variables in our classification model, it is impossible to effectively visualize a K\-Nearest Neighbors plot as it would have to be in 8 dimensions. Instead, we will do 8 different box plots, with each one plotting the distribution of a different variable within each of the two classes of star. Moreover, this will also allow us to make generalizations about the effect each variable will have on the classification of an unknown star. 



<span style='font-size:x-large'>**Methods FAQ**</span>



**Why did we choose to build a K\-nearest neighbors model?**

We chose to build a K\-nearest neighbors model as it is a well regarded and frequently used algorithm for answering classification based questions. 

**Why & how are we going to do hyperparameter optimization?**

Hyperparameter optimization is necessary for choosing the optimal K value that gives the best testing accuracy. To carry out hyperparameter optimization, we will be using both GridSearchCV and RandomizedSearch CV methods. We will then compare the graphical output of each method to choose the best value of K. 

**Which predictors variables are we planning on using and why?**

We plan to use all of the variables given in the dataset because they are all useful tools in classifying a star, despite the fact that they are all different statistical measurements of the same raw instrumental data. Additionally, the dataset source suggested using all 8 variables if using them in a classification model. 



## **Expected Outcomes and Significance**



It is expected that the classification model will be able to identify with reasonable accuracy the class of newly found stars using basic statistical measurements of the star’s DM\-SNR and integrated profile curves, making it significantly easier for astronomers to find pulsar stars amidst the noise of space. Because pulsars are so useful in their cosmic applications, a model like this would have wide ranging impacts due its ability to classify new star observations. This could lead to new questions about the reason pulsars form in certain regions of the universe, what that says about the conditions in those regions, and any scientific implications those findings would present.


<span style='font-size:x-large'>**Works Cited**</span>



Lea, R. \(2016, April 22\). What are pulsars? Space.com. Retrieved March 10, 2023, from https://www.space.com/32661\-pulsars.html 

Lyon, R. J., Stappers, B. W., Cooper, S., Brooke, J. M., & Knowles, J. D. \(2016\). Fifty years of pulsar candidate selection: from simple filters to a new principled real\-time classification approach. MNRAS, 459, 1104–1123. https://doi.org/10.1093/mnras/stw656 

Lyon, R. J. \(2016\). Why Are Pulsars Hard To Find? \(thesis\). 

Nm, N. \(2012, February 20\). Pulsars: The Universe's gift to physics. Astronomy.com. Retrieved March 10, 2023, from https://astronomy.com/news/2012/02/pulsars\-\-\-the\-universes\-gift\-to\-physics 

UCI Machine Learning Repository: HTRU2 data set. \(n.d.\). Retrieved March 10, 2023, from https://archive.ics.uci.edu/ml/datasets/HTRU2 

