In [1]:
import numpy as np
import pandas as pd

# Exploratory Data Analysis: Identifying Pulsar Stars

Name: Trevor Kling

Course: CPSC 392 - Introduction to Data Science

Last Date Modified: 12/03/2019

## Introduction
A "pulsar" is a rare type of neutron star that emits electromagnetic radiation which is perceptible from Earth.  However, like any radiation from space, this is accompanied by a large degree of cosmic background radiation which can make it difficult to percieve where and when pulsars occur.  The following data set documents a variety of stars which met conditions to be pulsars, as well as labels for these stars which indicate whether they truly were pulsars.  Pulsars offer the ability to document a variety of large-scale cosmic phenomena like gravitational waves or the curvature of spacetime.  Thus, being able to correctly identify where pulsars exist is an important question to astronomers.

## Importing the Data Set

The dataset used for this analysis was obtained from the paper "Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach" by R. J. Lyon et al., originally published in Monthly Notices of the Royal Astronomical Society 459 [[1](https://arxiv.org/pdf/1603.05166.pdf)].  The actual files were retreived from the UCI Machine Learning Database at https://archive.ics.uci.edu/ml/datasets/HTRU2 on 12/3/2019.

In [2]:
data = pd.read_csv("pulsar_stars.csv", header=0)

In [3]:
data

Unnamed: 0,Mean of the integrated profile,Standard deviation of the integrated profile,Excess kurtosis of the integrated profile,Skewness of the integrated profile,Mean of the DM-SNR curve,Standard deviation of the DM-SNR curve,Excess kurtosis of the DM-SNR curve,Skewness of the DM-SNR curve,target_class
0,140.562500,55.683782,-0.234571,-0.699648,3.199833,19.110426,7.975532,74.242225,0
1,102.507812,58.882430,0.465318,-0.515088,1.677258,14.860146,10.576487,127.393580,0
2,103.015625,39.341649,0.323328,1.051164,3.121237,21.744669,7.735822,63.171909,0
3,136.750000,57.178449,-0.068415,-0.636238,3.642977,20.959280,6.896499,53.593661,0
4,88.726562,40.672225,0.600866,1.123492,1.178930,11.468720,14.269573,252.567306,0
...,...,...,...,...,...,...,...,...,...
17893,136.429688,59.847421,-0.187846,-0.738123,1.296823,12.166062,15.450260,285.931022,0
17894,122.554688,49.485605,0.127978,0.323061,16.409699,44.626893,2.945244,8.297092,0
17895,119.335938,59.935939,0.159363,-0.743025,21.430602,58.872000,2.499517,4.595173,0
17896,114.507812,53.902400,0.201161,-0.024789,1.946488,13.381731,10.007967,134.238910,0


The dataset includes a variety of extended statistics, such as the mean, the standard deviation, the kurtosis, and the skewness.  The primary goal of the dataset is to allow for the production of a model of these pulsars based on these statistics.

## Analysis: Basic Trends

The first questions of interest simply relate to understanding the data set as a whole.

### Data Set Size and Types

In [4]:
data.shape

(17898, 9)

The pandas `shape` attribute gives us the dimensions of our data matrix.  From this, we can clearly see that there are 17898 rows (or entries) and 9 columns (or attributes).

### Breaking Down the Attributes

The data contained in this data matrix can be widely classified in two types; those relating to the integrated profile, and those relating to the DM-SNR curve.

#### The Integrated Profile

The *integrated profile*, or more formally the *integrated pulse profile*, can be thought of as a pulsar's cosmic fingerprint [[2](http://ipta.phys.wvu.edu/files/student-week-2017/IPTA2017_KuoLiu_pulsartiming.pdf)].  A pulsar generates periodic pulsation signals as it rotates, but each individual signal is often too weak to detect and can vary greatly in shape.  Instead, we integrate over a large number of instantaneous pulse periods to gain an overall profile of what the emissions from the pulsar look like over a substantial periood of time [[3](https://www.cv.nrao.edu/course/astr534/PulsarTiming.html)].  This allows us to determine an average pulse, which is relatively stable with changes in time.  These profiles can be simple or complex, and allow us to identify pulsars by only their radiation.

#### DM-SNR Curve

The *Dispersion Measure-Signal to Noise Ratio Curve*, or *DM-SNR Curve* for short, is another way of classifying a pulsar.  When the pulsar emits electromagnetic radiation, the dispersion by free electrons in the interstellar medium (a bunch of plasma, made up of ionized electrons and other small particles) causes a frequency dependent delay in the radiation as it propagates through the interstellar medium. This causes pulsar emissions to become temporally distorted, by an amount proportional to a quantity called the Dispersion Measure.  This is roughly the density of free electrons between our observatory and the pulsar, in a straight line.  Importantly, for a pulsar, minimizing this Dispersion Measure also corresponds to increasing the Signal to Noise ratio of the received radiation.

#### The Actual Attributes

In the data set, both the Integrated Profile and the DM-SNR curve have 4 attributes; *Mean*, *Standard Deviation*, *Excess Kurtosis*, and *Skewness*.  The mean and standard deviation are exactly what you would expect; they're the arithmetic mean and standard deviation of the observed values.  The kurtosis and skewness may be less familiar; these are what are known as *shape* summary statistics.  Both of these variables describe the shape of a probability distrobution for the given instance; a high kurtosis indicates the presence of many outliers, and a large skewness indicates a longer tail on one side of the distrobution.

![Example of Skew](https://upload.wikimedia.org/wikipedia/commons/f/f8/Negative_and_positive_skew_diagrams_%28English%29.svg)

**Figure 1: An Example of Skew [[4](https://en.wikipedia.org/wiki/Skewness)]**

### Checking for Missing Values

In [5]:
nulls = data[data.isnull().any(axis=1)]
print(nulls)

Empty DataFrame
Columns: [ Mean of the integrated profile,  Standard deviation of the integrated profile,  Excess kurtosis of the integrated profile,  Skewness of the integrated profile,  Mean of the DM-SNR curve,  Standard deviation of the DM-SNR curve,  Excess kurtosis of the DM-SNR curve,  Skewness of the DM-SNR curve, target_class]
Index: []


There are no entries in this data set which contain null values.  This is unsurprising, as this data set was produced for scientific study based on averages and has likely already undergone basic analysis.

## Extended Analysis: Preferred Indicator

As the data set presents us with two groups of similar attributes, a natural question is which group provides a better indicator that a particular instance is a pulsar.  One method of determining this is to look at the summary statistics within the set of instances that *are* pulsars versus the summary statistics of those that *are not* pulsars.  For a good discriminator, the expectation would be observably different mean values with low standard deviations; in other words, the distrobutions of each should have little overlap.

## Sources
[1] 
Paper: R. J. Lyon, B. W. Stappers, S. Cooper, J. M. Brooke, J. D. Knowles, Fifty Years of Pulsar Candidate Selection: From simple filters to a new principled real-time classification approach, Monthly Notices of the Royal Astronomical Society 459 (1), 1104-1123, DOI: 10.1093/mnras/stw656 https://arxiv.org/pdf/1603.05166.pdf

Data Set: R. J. Lyon, HTRU2, DOI: 10.6084/m9.figshare.3080389.v1.

[2] K. Liu, IPTA 2017 Student Workshop, http://ipta.phys.wvu.edu/files/student-week-2017/IPTA2017_KuoLiu_pulsartiming.pdf

[3] National Radia Astronomy Observatory, 2012 https://www.cv.nrao.edu/course/astr534/PulsarTiming.html

[4] Wikimedia Commons, Accessed 12/4/2019 https://en.wikipedia.org/wiki/File:Negative_and_positive_skew_diagrams_(English).svg