# Planetary Data Analysis Notebook

For the first part of this script, I will solely focus on the kNN supervised learning technique applied to the data pertaining to exoplanets from NASA.

In [32]:
# Import necessary libraries
from random import seed
from random import randrange
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
import statistics
import seaborn as sns
from sklearn.preprocessing import RobustScaler
from scipy import stats

As a resampling method, will use the train-test-split (tts) method. Will bundle the logic of tts into a function.

In [2]:
seed(1235)
def tts(data, split = 0.80):
    train = list()
    train_size = split*len(data)
    data_copy = list(data)
    while len(train) < train_size :
        index = randrange(len(data_copy))
        train.append(data_copy.pop(index))
    return np.array(train), np.array(data_copy)

In [3]:
# Assign path of data to a variable.
# Initiatiate first data frame with raw data.
path = 'J:\\datasets\\PS_2022.02.27_18.46.12.csv'
raw_data = pd.read_csv(path)

Need to remove the first thirteen rows. These rows contain copious notes/remarks.

In [4]:
raw_data = raw_data.drop(raw_data.index[range(14)])

In [5]:
raw_data

Unnamed: 0,# This file was produced by the NASA Exoplanet Archive http://exoplanetarchive.ipac.caltech.edu,Unnamed: 1,Unnamed: 2,Unnamed: 3,Unnamed: 4,Unnamed: 5,Unnamed: 6,Unnamed: 7,Unnamed: 8
14,pl_name,pl_orbper,pl_orbpererr1,pl_orbpererr2,pl_orbperlim,pl_orbeccen,pl_orbeccenerr1,pl_orbeccenerr2,pl_orbeccenlim
15,11 Com b,326.03,0.32,-0.32,0,0.231,0.005,-0.005,0
16,11 Com b,,,,,,,,
17,11 UMi b,,,,,,,,
18,11 UMi b,516.21997,3.2,-3.2,0,0.08,0.03,-0.03,0
...,...,...,...,...,...,...,...,...,...
32126,ups And d,1282.41,0.93,-0.94,0,0.294,0.011,-0.012,0
32127,ups And d,1281.507,1.055,-1.055,0,0.316,0.006,-0.006,0
32128,ups Leo b,385.2,2.8,-1.3,0,0.32,0.134,-0.218,0
32129,xi Aql b,,,,,,,,


In [6]:
raw_data.columns

Index(['# This file was produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu',
       'Unnamed: 1', 'Unnamed: 2', 'Unnamed: 3', 'Unnamed: 4', 'Unnamed: 5',
       'Unnamed: 6', 'Unnamed: 7', 'Unnamed: 8'],
      dtype='object')

Notice that the columns are not properly named. Now want to create first data frame with columns of interest.

In [7]:
df1 = raw_data.iloc[:, [0, 1, 5]]
df1.reset_index(drop=True, inplace=True)

Need to rename the columns of interest to make analysis clearer and remove unnecessary rows.

In [8]:
df1 = df1.rename(columns={'# This file was produced by the NASA Exoplanet Archive  http://exoplanetarchive.ipac.caltech.edu':'planetname', 'Unnamed: 1':'orbitperiod', 'Unnamed: 5':'eccentricity'})
df1 = df1.drop([0])

In [9]:
df1

Unnamed: 0,planetname,orbitperiod,eccentricity
1,11 Com b,326.03,0.231
2,11 Com b,,
3,11 UMi b,,
4,11 UMi b,516.21997,0.08
5,11 UMi b,516.22,0.08
...,...,...,...
32112,ups And d,1282.41,0.294
32113,ups And d,1281.507,0.316
32114,ups Leo b,385.2,0.32
32115,xi Aql b,,


Data frame 2 will remove all the NA values from original data frame.

In [10]:
df2 = df1.dropna()

In [11]:
df2

Unnamed: 0,planetname,orbitperiod,eccentricity
1,11 Com b,326.03,0.231
4,11 UMi b,516.21997,0.08
5,11 UMi b,516.22,0.08
7,14 And b,185.84,0
8,14 Her b,1766.41,0.3674
...,...,...,...
32111,ups And d,1274.6,0.242
32112,ups And d,1282.41,0.294
32113,ups And d,1281.507,0.316
32114,ups Leo b,385.2,0.32


The following logic is to determine the data types of the column entries in the latest data.

In [12]:
print(type(df2.iloc[0][1])) # <- Notice the columns are strings and not numeric.
print(type(df2.iloc[0][2]))

<class 'str'>
<class 'str'>


Now to convert just columns "a" and "b". That is, convert the 'orbitperiod' and 'eccentricity' columns into numerics from string data types.

In [13]:
#df2.loc[:, 'orbitperiod'] = df2.loc[:, 'orbitperiod'].apply(pd.to_numeric)
#df2.loc[:, 'eccentricity'] = df2.loc[:, 'eccentricity'].apply(pd.to_numeric)

In [14]:
x1 = df2.iloc[:, 1].apply(pd.to_numeric)
x2 = df2.iloc[:, 2].apply(pd.to_numeric)
x3 = df2.iloc[:,0] #<- this is the column with the column names

In [15]:
x3

1         11 Com b
4         11 UMi b
5         11 UMi b
7         14 And b
8         14 Her b
           ...    
32111    ups And d
32112    ups And d
32113    ups And d
32114    ups Leo b
32116     xi Aql b
Name: planetname, Length: 16511, dtype: object

In [16]:
df2_1 = pd.concat([x3,x1,x2], axis = 1)

In [17]:
df2_1

Unnamed: 0,planetname,orbitperiod,eccentricity
1,11 Com b,326.03000,0.2310
4,11 UMi b,516.21997,0.0800
5,11 UMi b,516.22000,0.0800
7,14 And b,185.84000,0.0000
8,14 Her b,1766.41000,0.3674
...,...,...,...
32111,ups And d,1274.60000,0.2420
32112,ups And d,1282.41000,0.2940
32113,ups And d,1281.50700,0.3160
32114,ups Leo b,385.20000,0.3200


In [18]:
df2_1.columns

Index(['planetname', 'orbitperiod', 'eccentricity'], dtype='object')

In [19]:
df2

Unnamed: 0,planetname,orbitperiod,eccentricity
1,11 Com b,326.03,0.231
4,11 UMi b,516.21997,0.08
5,11 UMi b,516.22,0.08
7,14 And b,185.84,0
8,14 Her b,1766.41,0.3674
...,...,...,...
32111,ups And d,1274.6,0.242
32112,ups And d,1282.41,0.294
32113,ups And d,1281.507,0.316
32114,ups Leo b,385.2,0.32


In [20]:
print(df2.dtypes['eccentricity'])

object


In [21]:
print(type(df2))

<class 'pandas.core.frame.DataFrame'>


Dataframe 3 will take the averages of multiple planet entries. Following lines will take arithmetic mean of the two columns of interest and concate the two columns into the third dataframe.

First will create intermediate variables to store the resulting averages.

In [22]:
x4 = df2_1.groupby('planetname')['eccentricity'].mean()
x5 = df2_1.groupby('planetname')['orbitperiod'].mean()

In [23]:
x4
x5

planetname
11 Com b       326.030000
11 UMi b       516.219985
14 And b       185.840000
14 Her b      1766.601670
16 Cyg B b     799.375000
                 ...     
ups And b        4.616229
ups And c      240.728533
ups And d     1285.346167
ups Leo b      385.200000
xi Aql b       136.750000
Name: orbitperiod, Length: 4350, dtype: float64

In [24]:
type(x4)

pandas.core.series.Series

In [25]:
type(x5)

pandas.core.series.Series

In [26]:
df3 = pd.concat([x4, x5], axis = 1)
df3

Unnamed: 0_level_0,eccentricity,orbitperiod
planetname,Unnamed: 1_level_1,Unnamed: 2_level_1
11 Com b,0.231000,326.030000
11 UMi b,0.080000,516.219985
14 And b,0.000000,185.840000
14 Her b,0.362233,1766.601670
16 Cyg B b,0.676033,799.375000
...,...,...
ups And b,0.030200,4.616229
ups And c,0.238933,240.728533
ups And d,0.281117,1285.346167
ups Leo b,0.320000,385.200000


In [29]:
df3['planetname'] = df3.index
df3

Unnamed: 0_level_0,eccentricity,orbitperiod,planetname
planetname,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
11 Com b,0.231000,326.030000,11 Com b
11 UMi b,0.080000,516.219985,11 UMi b
14 And b,0.000000,185.840000,14 And b
14 Her b,0.362233,1766.601670,14 Her b
16 Cyg B b,0.676033,799.375000,16 Cyg B b
...,...,...,...
ups And b,0.030200,4.616229,ups And b
ups And c,0.238933,240.728533,ups And c
ups And d,0.281117,1285.346167,ups And d
ups Leo b,0.320000,385.200000,ups Leo b


df3 has the final version of the 'raw' data frame.

In [30]:
df3.columns

Index(['eccentricity', 'orbitperiod', 'planetname'], dtype='object')

    In this part want to explicitly declare the median and IQR of the two columns which are the variables.

In [36]:
ecce = df3['eccentricity']
orbper = df3['orbitperiod']
eccen_med = statistics.median(list(ecce))
orbper_med = statistics.median(list(orbper))
print('\nThe median of the orbital period: %.2f' % orbper_med)
print('\nThe median of the eccentricity: %.2f' % eccen_med)
iqr_ecce = stats.iqr(ecce, interpolation = 'midpoint')
iqr_op = stats.iqr(orbper, interpolation = 'midpoint')
print('\nThe interquartile range of the eccentricity: %.2f' % iqr_ecce)
print('\nThe interquartile range of the orbital period: %.2f' % iqr_op)


The median of the orbital period: 12.24

The median of the eccentricity: 0.00

The interquartile range of the eccentricity: 0.07

The interquartile range of the orbital period: 39.74
