# Machine Learning Engineer Nanodegree

## Capstone project: Exoplanet search (from Kaggle)

Student Name: Graciano Patino

Kaggle reference: https://www.kaggle.com/keplersmachines/kepler-labelled-time-series-data?/

The mission as stated in the Github (https://github.com/winterdelta/KeplerAI) is to build a classification algorithm for identifying if a particular time series input includes an exoplanet or not. It also mentions that a number of methods were tested: 1-D CNN in Torch7, XGBoost in R and PCA in Python. However, none of these methods provided strong results according to the kaggle and Github references. 

For this project, I would evaluate deep learning algorithms. Per paper in the paragraph (below), these algorithms appear to provide better results compared to the ones already tried as mentioned above.

1)	Initially I would evaluate 1-D CNN using Keras instead of Torch7. 
2)	Based on reference paper, I would try adding different number of layers and filters in combination with other CNN parameters. Details would be included in project report.
3)	The output of the CNNs would be the input to one or more dense layers.
4)	Performance of each model to be measured as per evaluation metrics section.
5)	Per kaggle source the test set is confirmed to have 5 exoplanets. This will also be useful on checking performance of algorithms. If an algorithm is unable to identify exoplanets on then testing set, then model might not be good. 

Please that the list above of models considered is not meant to be exhaustive for all possible scenarios in deep learning algorithms. It might be the case that other deep learning algorithms might be considered later should the ones proposed (above) fail in identifying any exoplanet as expected.


### Project design

The workflow for solving this problem would have the following order:

    1) Exploring the datasets (check dimensions of data, labels, etc.)
    2) Data preprocessing: 
    2.1) The datasets provided by kaggle are supposed to be clean (for the most part).
    2.2) It is noted that the data is not normalized. (Some normalization is required.)
    2.3) Github reference (above) mentions that techniques like data augmentation could help as we are dealing with time series. Perhaps systematically shifting rows and adding noise could generate additional realistic (albeit synthetic) trends. This might be explored depending on results from the different algorithms tested for solving the problem.
    3) Evaluate machine learning algorithms: This involves building the models and selecting best model by using evaluation metrics and comparing to benchmark model(s).
    4) Model tuning to optimize results: This involves using evaluating performance of the model and fine tuning hyper-parameters until a satisfactory model is identified. This is by using evaluation metrics to evaluate performance.
    5) Final conclusions.


In [1]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import math
import itertools
from IPython.display import display # Allows the use of display() for DataFrames
from get_results import plot_roc_auc, confusion_matrix_com

# Pretty display for notebooks
%matplotlib inline

# Some Sklearn libraries are required
from sklearn.metrics import roc_auc_score
from sklearn.metrics import precision_recall_curve
from sklearn.utils.fixes import signature
from sklearn.metrics import average_precision_score

In [2]:
# Load the training data from Exoplanet dataset
train_data = pd.read_csv('kepler/exoTrain.csv')

In [3]:
# Find dimensions of the train data
train_data.shape

(5087, 3198)

In [4]:
# Testing  if train_data has any null fields
testing = pd.isnull(train_data)
testing *= 1
testing2 = testing.sum()
testing2.sum() # If results is zero, then there a no fields with "null" value

0

In [5]:
# Getting X_train and y_train
# Using iloc to select data using position instead of label and converting to numpy array using values
X_train = train_data.iloc[:,1:].values 
y_train = train_data.iloc[:,0:1].values 

In [6]:
# Find dimensions of the X_train data
X_train.shape

(5087, 3197)

In [7]:
# Find dimensions of the labels (y_train) data
y_train.shape

(5087, 1)

In [8]:
# y_train: Label is 2 for exoplanet and 1 for non-exoplanet
y_train[:5]

array([[2],
       [2],
       [2],
       [2],
       [2]])

In [9]:
y_train -= 1 # Changing labels to: 1 for exoplanet and 0 for non-exoplanet

In [10]:
y_train[:5]

array([[1],
       [1],
       [1],
       [1],
       [1]])

In [11]:
# Load the testing data from Exoplanet dataset
test_data = pd.read_csv('kepler/exoTest.csv')

In [12]:
# Find dimensions of the test data
test_data.shape

(570, 3198)

In [13]:
# Testing  if test_data has any "null" fields
testing = pd.isnull(test_data)
testing *= 1
testing2 = testing.sum()
testing2.sum() # If results is zero, then there a no fields with "null" value

0

In [14]:
# Getting X_test and y_test
# Using iloc to select data using position instead of label and converting to numpy array (using values)
X_test = test_data.iloc[:,1:].values
y_test = test_data.iloc[:,0:1].values

In [15]:
# Find dimensions of the X_test data
X_test.shape

(570, 3197)

In [16]:
# Find dimensions of the labels (y_test) data
y_test.shape

(570, 1)

In [17]:
# y_test: Label is 2 for exoplanet and 1 for non-exoplanet
y_test[:6]

array([[2],
       [2],
       [2],
       [2],
       [2],
       [1]])

In [18]:
y_test -= 1 # Changing labels to: 1 for exoplanet and 0 for non-exoplanet

In [19]:
y_test[:6]

array([[1],
       [1],
       [1],
       [1],
       [1],
       [0]])

In [20]:
# Normalizing the data since it is not normalized according to Kaggle/Github
from sklearn.preprocessing import StandardScaler

In [21]:
# Checking X_train data
X_train

array([[  93.85,   83.81,   20.1 , ...,   61.42,    5.08,  -39.54],
       [ -38.88,  -33.83,  -58.54, ...,    6.46,   16.  ,   19.93],
       [ 532.64,  535.92,  513.73, ...,  -28.91,  -70.02,  -96.67],
       ..., 
       [ 273.39,  278.  ,  261.73, ...,   88.42,   79.07,   79.43],
       [   3.82,    2.09,   -3.29, ...,  -14.55,   -6.41,   -2.55],
       [ 323.28,  306.36,  293.16, ...,  -16.72,  -14.09,   27.82]])

In [22]:
# Tranposing X_train before applying scaling such that mean is zero and variance is one
X_train.transpose()

array([[  93.85,  -38.88,  532.64, ...,  273.39,    3.82,  323.28],
       [  83.81,  -33.83,  535.92, ...,  278.  ,    2.09,  306.36],
       [  20.1 ,  -58.54,  513.73, ...,  261.73,   -3.29,  293.16],
       ..., 
       [  61.42,    6.46,  -28.91, ...,   88.42,  -14.55,  -16.72],
       [   5.08,   16.  ,  -70.02, ...,   79.07,   -6.41,  -14.09],
       [ -39.54,   19.93,  -96.67, ...,   79.43,   -2.55,   27.82]])

In [23]:
# # Checking X_train data
X_test

array([[  1.19880000e+02,   1.00210000e+02,   8.64600000e+01, ...,
          3.57800000e+01,   2.69430000e+02,   5.77200000e+01],
       [  5.73659000e+03,   5.69998000e+03,   5.71716000e+03, ...,
         -2.36619000e+03,  -2.29486000e+03,  -2.03472000e+03],
       [  8.44480000e+02,   8.17490000e+02,   7.70070000e+02, ...,
         -1.62680000e+02,  -3.67900000e+01,   3.06300000e+01],
       ..., 
       [ -5.40100000e+01,  -4.41300000e+01,  -4.12300000e+01, ...,
          5.47000000e+00,   1.44600000e+01,   1.87000000e+01],
       [  9.13600000e+01,   8.56000000e+01,   4.88100000e+01, ...,
         -8.43000000e+00,  -6.48000000e+00,   1.76000000e+01],
       [  3.07119000e+03,   2.78253000e+03,   2.60869000e+03, ...,
         -2.77220000e+02,  -6.96300000e+01,   1.21560000e+02]])

In [24]:
# Tranposing X_test before applying scaling such that mean is zero and variance is one
X_test.transpose()

array([[  1.19880000e+02,   5.73659000e+03,   8.44480000e+02, ...,
         -5.40100000e+01,   9.13600000e+01,   3.07119000e+03],
       [  1.00210000e+02,   5.69998000e+03,   8.17490000e+02, ...,
         -4.41300000e+01,   8.56000000e+01,   2.78253000e+03],
       [  8.64600000e+01,   5.71716000e+03,   7.70070000e+02, ...,
         -4.12300000e+01,   4.88100000e+01,   2.60869000e+03],
       ..., 
       [  3.57800000e+01,  -2.36619000e+03,  -1.62680000e+02, ...,
          5.47000000e+00,  -8.43000000e+00,  -2.77220000e+02],
       [  2.69430000e+02,  -2.29486000e+03,  -3.67900000e+01, ...,
          1.44600000e+01,  -6.48000000e+00,  -6.96300000e+01],
       [  5.77200000e+01,  -2.03472000e+03,   3.06300000e+01, ...,
          1.87000000e+01,   1.76000000e+01,   1.21560000e+02]])

In [25]:
# Standardize features by removing the mean and scaling to unit variance (sklearn.preprocessing)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train) # Output in a numpy.ndarray
X_test = scaler.fit_transform(X_test) # Output in a numpy.ndarray

In [26]:
X_train.transpose() # Transpose back to original dimensions

array([[-0.00235557, -0.00852774,  0.01804893, ...,  0.00599336,
        -0.00654212,  0.00831333],
       [-0.00205404, -0.0074516 ,  0.01868969, ...,  0.00685579,
        -0.00580352,  0.00815701],
       [-0.00579778, -0.00938685,  0.01673115, ...,  0.00523005,
        -0.00686528,  0.00666449],
       ..., 
       [ 0.0341983 ,  0.03109682,  0.02910084, ...,  0.03572195,
         0.0299112 ,  0.02978874],
       [ 0.02736753,  0.02803863,  0.02275218, ...,  0.03191466,
         0.0266614 ,  0.02618942],
       [ 0.01805157,  0.02216476,  0.01410023, ...,  0.02628002,
         0.02060995,  0.02271046]])

In [27]:
X_test.transpose() # Transpose back to original dimensions

array([[-0.03143654,  0.41497541,  0.02615413, ..., -0.04525719,
        -0.03370329,  0.20313139],
       [-0.05057432,  0.39343385,  0.00629912, ..., -0.0620191 ,
        -0.05173276,  0.16210798],
       [-0.03559448,  0.41363759,  0.01894572, ..., -0.04578193,
        -0.0385983 ,  0.16563567],
       ..., 
       [-0.00907555, -0.2624662 , -0.03001166, ..., -0.01227304,
        -0.01373939, -0.04209481],
       [ 0.00470934, -0.26591026, -0.02760726, ..., -0.02219864,
        -0.02440852, -0.03107299],
       [-0.00786554, -0.22375419, -0.01066057, ..., -0.01189145,
        -0.01200494, -0.00127881]])

In [28]:
# Fix random seed for reproducibility
seed = 10
np.random.seed(seed)

In [29]:
# Importing Keras libraries

from keras.models import Sequential, Model
from keras.layers import Conv1D, MaxPool1D, Dense, Dropout, Flatten
from keras.layers import BatchNormalization, Input, concatenate, Activation
from keras.optimizers import Adam, SGD, RMSprop, Adagrad, Adadelta, Adamax, Nadam
from keras.callbacks import ModelCheckpoint 

Using TensorFlow backend.


In [30]:
#Convert data into 3d tensor (Input 0 in Conv1D is incompatible with layer conv1d_1: expected ndim=3, found ndim=2)
X_train = np.reshape(X_train,(X_train.shape[0],X_train.shape[1],1))
X_test = np.reshape(X_test,(X_test.shape[0],X_test.shape[1],1))

In [31]:
# Checking shape of X_test tensor
X_test.shape

(570, 3197, 1)

In [32]:
# Checking shape of X_train_new tensor
X_train.shape

(5087, 3197, 1)

#### Data exploration and preparation ended (above)

# Ending Data Preparation

# GRID SEARCH 

## Grid Search Implementation

Reference on using GridSearch for tuning Hyperparameters for DL models:

https://machinelearningmastery.com/grid-search-hyperparameters-deep-learning-models-python-keras/

# GRID SEARCH 3cnn, 2dnn

In [33]:
# Selecting a subset of the training set
X_GS1000 = X_train[0:1000,:]
y_GS1000 = y_train[0:1000,:]

In [34]:
# Use scikit-learn to grid search the batch size and epochs
from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier

# Function to create model, required for KerasClassifier
def create_model(optimizer='adam'):
    # create model
    model = Sequential()
    # Defining network architecture
    model.add(Conv1D(filters=8, kernel_size=8, activation='relu', input_shape=(3197,1)))
    model.add(MaxPool1D(strides=4))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Conv1D(filters=16, kernel_size=8, activation='relu'))
    model.add(MaxPool1D(strides=4))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Conv1D(filters=32, kernel_size=8, activation='relu'))
    model.add(MaxPool1D(strides=4))
    model.add(Dropout(0.3))
    model.add(BatchNormalization())
    model.add(Flatten())
    model.add(Dense(32, activation='relu'))
    model.add(Dropout(0.3))
    model.add(Dense(1, activation='sigmoid'))
    # Compile model
    model.compile(optimizer=optimizer, loss="binary_crossentropy", metrics=["accuracy"])
    return model
# Fix random seed for reproducibility (done above on previous cells)

# Dataset was created in previous section
# We can use the normalized X_train and y_train from previous cells for Grid Search 

In [35]:
# Create model
model = KerasClassifier(build_fn=create_model, epochs=20, batch_size=20, verbose=2)
# define the grid search parameters
optimizer = ['SGD', 'RMSprop', 'Adagrad', 'Adadelta', 'Adam', 'Adamax', 'Nadam']
param_grid = dict(optimizer=optimizer)
grid = GridSearchCV(estimator=model, param_grid=param_grid, n_jobs=-1)
grid_result = grid.fit(X_train, y_train)


Epoch 1/20
Epoch 1/20
Epoch 1/20
Epoch 1/20
21s - loss: 0.0773 - acc: 0.9693
Epoch 2/20
21s - loss: 0.1630 - acc: 0.9499
Epoch 2/20
22s - loss: 0.1567 - acc: 0.9531
Epoch 2/20
21s - loss: 0.0416 - acc: 0.9858
Epoch 2/20
21s - loss: 0.0159 - acc: 0.9953
Epoch 3/20
21s - loss: 0.1045 - acc: 0.9767
Epoch 3/20
21s - loss: 0.0018 - acc: 0.9997
Epoch 3/20
21s - loss: 0.1063 - acc: 0.9767
Epoch 3/20
21s - loss: 0.0170 - acc: 0.9935
Epoch 4/20
21s - loss: 0.0960 - acc: 0.9799
Epoch 4/20
21s - loss: 3.7209e-04 - acc: 1.0000
Epoch 4/20
21s - loss: 0.0912 - acc: 0.9823
Epoch 4/20
21s - loss: 0.0113 - acc: 0.9965
Epoch 5/20
21s - loss: 0.0824 - acc: 0.9850
Epoch 5/20
21s - loss: 3.5792e-04 - acc: 0.9997
Epoch 5/20
21s - loss: 0.0875 - acc: 0.9844
Epoch 5/20
21s - loss: 0.0102 - acc: 0.9976
Epoch 6/20
21s - loss: 0.0757 - acc: 0.9876
Epoch 6/20
21s - loss: 1.9656e-05 - acc: 1.0000
Epoch 6/20
21s - loss: 0.0849 - acc: 0.9847
Epoch 6/20
21s - loss: 0.0076 - acc: 0.9976
Epoch 7/20
21s - loss: 0.0734 -

Epoch 7/20
22s - loss: 0.0780 - acc: 0.9888
Epoch 6/20
21s - loss: 0.0015 - acc: 0.9994
Epoch 7/20
21s - loss: 0.0821 - acc: 0.9882
Epoch 7/20
21s - loss: 0.0616 - acc: 0.9891
Epoch 8/20
22s - loss: 0.0824 - acc: 0.9882
Epoch 7/20
21s - loss: 7.4817e-04 - acc: 0.9997
Epoch 8/20
22s - loss: 0.0835 - acc: 0.9882
Epoch 8/20
21s - loss: 0.0633 - acc: 0.9885
Epoch 9/20
22s - loss: 0.0815 - acc: 0.9888
Epoch 8/20
22s - loss: 5.8495e-04 - acc: 0.9997
Epoch 9/20
22s - loss: 0.0748 - acc: 0.9891
Epoch 9/20
21s - loss: 0.0647 - acc: 0.9891
Epoch 10/20
22s - loss: 0.0785 - acc: 0.9885
Epoch 9/20
22s - loss: 1.7670e-04 - acc: 1.0000
Epoch 10/20
21s - loss: 0.0811 - acc: 0.9876
Epoch 10/20
21s - loss: 0.0627 - acc: 0.9888
Epoch 11/20
22s - loss: 0.0853 - acc: 0.9885
Epoch 10/20
22s - loss: 1.9957e-04 - acc: 1.0000
Epoch 11/20
21s - loss: 0.0783 - acc: 0.9891
Epoch 11/20
21s - loss: 0.0542 - acc: 0.9894
Epoch 12/20
22s - loss: 0.0792 - acc: 0.9888
Epoch 11/20
22s - loss: 3.5950e-05 - acc: 1.0000
Epo

Epoch 12/20
21s - loss: 0.0722 - acc: 0.9888
Epoch 12/20
21s - loss: 1.5096e-05 - acc: 1.0000
Epoch 12/20
22s - loss: 0.0625 - acc: 0.9891
Epoch 11/20
21s - loss: 0.0654 - acc: 0.9891
Epoch 13/20
21s - loss: 0.0645 - acc: 0.9885
Epoch 13/20
21s - loss: 4.9084e-06 - acc: 1.0000
Epoch 13/20
22s - loss: 0.0634 - acc: 0.9888
Epoch 12/20
21s - loss: 0.0542 - acc: 0.9891
Epoch 14/20
22s - loss: 0.0615 - acc: 0.9894
Epoch 14/20
22s - loss: 1.8085e-05 - acc: 1.0000
Epoch 14/20
22s - loss: 0.0629 - acc: 0.9888
Epoch 15/20
22s - loss: 0.0665 - acc: 0.9894
Epoch 13/20
21s - loss: 0.0591 - acc: 0.9891
Epoch 15/20
21s - loss: 4.5566e-06 - acc: 1.0000
Epoch 15/20
21s - loss: 0.0558 - acc: 0.9891
Epoch 16/20
22s - loss: 0.0555 - acc: 0.9888
Epoch 14/20
21s - loss: 0.0621 - acc: 0.9894
Epoch 16/20
21s - loss: 7.4063e-06 - acc: 1.0000
Epoch 16/20
21s - loss: 0.0570 - acc: 0.9894
Epoch 17/20
22s - loss: 0.0603 - acc: 0.9897
Epoch 15/20
21s - loss: 0.0629 - acc: 0.9891
Epoch 17/20
22s - loss: 1.0271e-04 

In [37]:
# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

Best: 0.992727 using {'optimizer': 'Adam'}
0.991547 (0.009560) with: {'optimizer': 'SGD'}
0.991154 (0.011281) with: {'optimizer': 'RMSprop'}
0.992333 (0.010018) with: {'optimizer': 'Adagrad'}
0.674268 (0.444826) with: {'optimizer': 'Adadelta'}
0.992727 (0.010285) with: {'optimizer': 'Adam'}
0.992727 (0.010285) with: {'optimizer': 'Adamax'}
0.992727 (0.010285) with: {'optimizer': 'Nadam'}


### Run # 1: Best: 0.992727 using {'optimizer': 'Adam'}
    0.991547 (0.009560) with: {'optimizer': 'SGD'}
    0.991154 (0.011281) with: {'optimizer': 'RMSprop'}
    0.992333 (0.010018) with: {'optimizer': 'Adagrad'}
    0.674268 (0.444826) with: {'optimizer': 'Adadelta'}
    0.992727 (0.010285) with: {'optimizer': 'Adam'}
    0.992727 (0.010285) with: {'optimizer': 'Adamax'}
    0.992727 (0.010285) with: {'optimizer': 'Nadam'}
