# Supervised learning

For this investigation I selected a database from the NASA Kepler space telescope, representing the change in light intensity of several observed stars. The label corresponds to whether a star has an orbiting exoplanet, while the features are the light intensity in different points in time.

## Importing and requirements

First, we will import our main libraries. The dataset is already split in training and test portions, so the train_test_split method is not necessary. The data is quite large, so it's being uncompressed before processing.

In [66]:
import pandas as pd
import numpy as np
from subprocess import call

call(['tar', 'xf', 'kepler-labelled-time-series-data.tar.gz'])
with open('./kepler-labelled-time-series-data/exoTrain.csv', 'r') as file_handle:
    df_train = pd.read_csv(file_handle)
    
with open('./kepler-labelled-time-series-data/exoTest.csv', 'r') as file_handle:
    df_test = pd.read_csv(file_handle)
    
call(['rm', '-rf', 'kepler-labelled-time-series-data/'])

0

In [84]:
exoplanets = df_train.loc[df_train['LABEL'] == 2]
noexoplanets = df_train.loc[df_train['LABEL'] == 1].sample(n=len(exoplanets)*10)
print(len(exoplanets))
print(len(noexoplanets))

#newdf = pd.DataFrame(exoplanets)
#newdf.shape()
newdf = pd.DataFrame(['a', '1'])
data = [('p1', 't1'), ('p2', 't2'), ('p2', 't2'), ('p2', 't2'), ('p2', 't2')]
df = pd.DataFrame(data)

37
370


TypeError: 'tuple' object is not callable

In [21]:
print(df_train.shape)
print(df_test.shape)
print(df_train.columns[df_train.isna().any()].tolist())
print(df_test.columns[df_train.isna().any()].tolist())

df_train_described =  df_train.describe()
print(df_train_described.loc['min',:].min())
print(df_train_described.loc['max',:].max())
print(df_train_described)
print(df_test.describe())

(5087, 3198)
(570, 3198)
[]
[]
-2385019.12
4299288.0
             LABEL        FLUX.1        FLUX.2        FLUX.3        FLUX.4  \
count  5087.000000  5.087000e+03  5.087000e+03  5.087000e+03  5.087000e+03   
mean      1.007273  1.445054e+02  1.285778e+02  1.471348e+02  1.561512e+02   
std       0.084982  2.150669e+04  2.179717e+04  2.191309e+04  2.223366e+04   
min       1.000000 -2.278563e+05 -3.154408e+05 -2.840018e+05 -2.340069e+05   
25%       1.000000 -4.234000e+01 -3.952000e+01 -3.850500e+01 -3.505000e+01   
50%       1.000000 -7.100000e-01 -8.900000e-01 -7.400000e-01 -4.000000e-01   
75%       1.000000  4.825500e+01  4.428500e+01  4.232500e+01  3.976500e+01   
max       2.000000  1.439240e+06  1.453319e+06  1.468429e+06  1.495750e+06   

             FLUX.5        FLUX.6        FLUX.7        FLUX.8        FLUX.9  \
count  5.087000e+03  5.087000e+03  5.087000e+03  5.087000e+03  5.087000e+03   
mean   1.561477e+02  1.469646e+02  1.168380e+02  1.144983e+02  1.228639e+02   
std    

We can see the we have the data for 5087 different stars, each with 3198 different light intensity measurements.

No NaN values at all.

Flux values can range from -2e6 to 4e6.

Now we can begin trying out different learning methods.

## Decision tree classifier


In [18]:
from sklearn.tree import DecisionTreeClassifier

dtc_x_train = df_train.loc[:, df_train.columns != 'LABEL'].copy()
dtc_y_train = df_train.loc[:, 'LABEL']

dtc = DecisionTreeClassifier(max_depth = 3)
dtc.fit(dtc_x_train, dtc_y_train)

DecisionTreeClassifier(class_weight=None, criterion='gini', max_depth=3,
                       max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort=False,
                       random_state=None, splitter='best')

In [20]:
from sklearn.metrics import accuracy_score

dtc_x_test = df_test.loc[:, df_test.columns != 'LABEL']
dtc_y_test = df_test.loc[:, 'LABEL']

dtc_y_pred = dtc.predict(dtc_x_test)
accuracy_score(dtc_y_test, dtc_y_pred)

0.9912280701754386