
#### ** This section is borrowed from Kaggle for reading purpose only **

# The Search for New Earths

GitHub

The data describe the change in flux (light intensity) of several thousand stars. Each star has a binary label of 2 or 1. 2 indicated that that the star is confirmed to have at least one exoplanet in orbit; some observations are in fact multi-planet systems.

As you can imagine, planets themselves do not emit light, but the stars that they orbit do. If said star is watched over several months or years, there may be a regular 'dimming' of the flux (the light intensity). This is evidence that there may be an orbiting body around the star; such a star could be considered to be a 'candidate' system. Further study of our candidate system, for example by a satellite that captures light at a different wavelength, could solidify the belief that the candidate can in fact be 'confirmed'.

Flux Diagram

In the above diagram, a star is orbited by a blue planet. At t = 1, the starlight intensity drops because it is partially obscured by the planet, given our position. The starlight rises back to its original value at t = 2. The graph in each box shows the measured flux (light intensity) at each time interval.

#### Description
##### Trainset:
5087 rows or observations.
3198 columns or features.
Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
37 confirmed exoplanet-stars and 5050 non-exoplanet-stars.

##### Testset:
570 rows or observations.
3198 columns or features.
Column 1 is the label vector. Columns 2 - 3198 are the flux values over time.
5 confirmed exoplanet-stars and 565 non-exoplanet-stars.

#### Acknowledgements
The data presented here are cleaned and are derived from observations made by the NASA Kepler space telescope. The Mission is ongoing - for instance data from Campaign 12 was released on 8th March 2017. Over 99% of this dataset originates from Campaign 3. To boost the number of exoplanet-stars in the dataset, confirmed exoplanets from other campaigns were also included.

To be clear, all observations from Campaign 3 are included. And in addition to this, confirmed exoplanet-stars from other campaigns are also included.

The datasets were prepared late-summer 2016.

Campaign 3 was used because 'it was felt' that this Campaign is unlikely to contain any undiscovered (i.e. wrongly labelled) exoplanets.

NASA open-sources the original Kepler Mission data and it is hosted at the Mikulski Archive. After being beamed down to Earth, NASA applies de-noising algorithms to remove artefacts generated by the telescope. The data - in the .fits format - is stored online. And with the help of a seasoned astrophysicist, anyone with an internet connection can embark on a search to find and retrieve the datafiles from the Archive.

The cover image is copyright © 2011 by Dan Lessmann

In [1]:
# YOUR CODE HERE
import pandas as pd
import numpy as np

https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data

#### Features
- LABEL -> 2 is an exoplanet star and 1 is a non-exoplanet-star.
- FLUX1-FLUX3197 -> the light intensity recorded for each star, at a different point in time.

In [2]:
# Data Source: https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data

#url = 'https://drive.google.com/file/d/1SCBlL68q-kAEPIcj_4-U11tKhqHyAn2n/view?usp=sharing' # Train dataset
url = 'https://drive.google.com/file/d/1uVh-NN7wdK5GuxDrXcWyVg9nYdlrLmUv/view?usp=sharing'  # Test dataset
path = 'https://drive.google.com/uc?export=download&id='+url.split('/')[-2]
#df_exoTrain = pd.read_csv(path)
df_exoTest = pd.read_csv(path)

In [3]:
df_exoTest.head(5)

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,119.88,100.21,86.46,48.68,46.12,39.39,18.57,6.98,6.63,...,14.52,19.29,14.44,-1.62,13.33,45.5,31.93,35.78,269.43,57.72
1,2,5736.59,5699.98,5717.16,5692.73,5663.83,5631.16,5626.39,5569.47,5550.44,...,-581.91,-984.09,-1230.89,-1600.45,-1824.53,-2061.17,-2265.98,-2366.19,-2294.86,-2034.72
2,2,844.48,817.49,770.07,675.01,605.52,499.45,440.77,362.95,207.27,...,17.82,-51.66,-48.29,-59.99,-82.1,-174.54,-95.23,-162.68,-36.79,30.63
3,2,-826.0,-827.31,-846.12,-836.03,-745.5,-784.69,-791.22,-746.5,-709.53,...,122.34,93.03,93.03,68.81,9.81,20.75,20.25,-120.81,-257.56,-215.41
4,2,-39.57,-15.88,-9.16,-6.37,-16.13,-24.05,-0.9,-45.2,-5.04,...,-37.87,-61.85,-27.15,-21.18,-33.76,-85.34,-81.46,-61.98,-69.34,-17.84


In [4]:
df_exoTrain = pd.read_csv('exoTrain.csv')
df_exoTrain.head(5)

Unnamed: 0,LABEL,FLUX.1,FLUX.2,FLUX.3,FLUX.4,FLUX.5,FLUX.6,FLUX.7,FLUX.8,FLUX.9,...,FLUX.3188,FLUX.3189,FLUX.3190,FLUX.3191,FLUX.3192,FLUX.3193,FLUX.3194,FLUX.3195,FLUX.3196,FLUX.3197
0,2,93.85,83.81,20.1,-26.98,-39.56,-124.71,-135.18,-96.27,-79.89,...,-78.07,-102.15,-102.15,25.13,48.57,92.54,39.32,61.42,5.08,-39.54
1,2,-38.88,-33.83,-58.54,-40.09,-79.31,-72.81,-86.55,-85.33,-83.97,...,-3.28,-32.21,-32.21,-24.89,-4.86,0.76,-11.7,6.46,16.0,19.93
2,2,532.64,535.92,513.73,496.92,456.45,466.0,464.5,486.39,436.56,...,-71.69,13.31,13.31,-29.89,-20.88,5.06,-11.8,-28.91,-70.02,-96.67
3,2,326.52,347.39,302.35,298.13,317.74,312.7,322.33,311.31,312.42,...,5.71,-3.73,-3.73,30.05,20.03,-12.67,-8.77,-17.31,-17.35,13.98
4,2,-1107.21,-1112.59,-1118.95,-1095.1,-1057.55,-1034.48,-998.34,-1022.71,-989.57,...,-594.37,-401.66,-401.66,-357.24,-443.76,-438.54,-399.71,-384.65,-411.79,-510.54


In [5]:
df_exoTrain.shape

(5087, 3198)

In [6]:
df_exoTest.shape

(570, 3198)

In [7]:
feature_list = list(df_exoTrain.columns.values)
len(feature_list)

3198

In [8]:
target=feature_list[0] # -->  creating 'label' as target
features=list(feature_list[1:(len(feature_list)-1)]) # --> rest of the columns as featueres
print(target)

LABEL


In [9]:
len(features)

3196

In [10]:
print(features[0],features[(len(features)-1)])

FLUX.1 FLUX.3196


In [50]:
import datetime
from sklearn.ensemble import RandomForestClassifier
start_time1= datetime.datetime.now()
df_exoTrain_RF = RandomForestClassifier(n_estimators=50, n_jobs=2, random_state=0)
df_exoTrain_RF.fit(df_exoTrain[features],df_exoTrain[target])
end_time1 = datetime.datetime.now()
print("Started at:", start_time1, "\n  Ended at:", end_time1, "\nTime taken: ", end_time1-start_time1)

Started at: 2020-07-07 20:46:26.964631 
  Ended at: 2020-07-07 20:46:34.144627 
Time taken:  0:00:07.179996


In [51]:
start_time2= datetime.datetime.now()
pred_rf=df_exoTrain_RF.predict(df_exoTest[features])
end_time2 = datetime.datetime.now()
#print("Prediction RF:",pred_rf)
actual_rf= df_exoTest['LABEL'].values
#print("Actual RF:", actual_rf)
print("Started at:", start_time2, "\n  Ended at:", end_time2, "\nTime taken: ", end_time2-start_time2)

Started at: 2020-07-07 20:46:37.107102 
  Ended at: 2020-07-07 20:46:37.273100 
Time taken:  0:00:00.165998


In [52]:
from sklearn.metrics import accuracy_score, confusion_matrix
pd.DataFrame(confusion_matrix(actual_rf,pred_rf))

Unnamed: 0,0,1
0,565,0
1,5,0
