## Model Training
The modeling phase of the machine learning workflow consists in defining the models to be trained, train the created models, and subsequently test each model's accuracy.

I am using Amazon Sagemaker for the training, testing, and deployment of the defined models. 

In [49]:
import os
from os.path import isfile, join
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import statistics
from statistics import mean, mode, median, stdev
from keras.optimizers import Adam
from keras.callbacks import ModelCheckpoint
from keras.models import Model
from keras.layers import Dense, Input, Dropout, Activation

In [68]:
# Constants to use for this notebook


time_words = {1:'day', 5:'week', 10:'two_weeks', 20:'month', 90:'four_months', 270:'year'}


# directory containing training and testing datasets
data_dir = join('data_1/') 
final_data_dir = join(data_dir + 'final/')
top_results_file = 'results/top_accuracy.txt'
model_directory = 'models_1/'

with open(top_results_file) as f:
    text = f.read()
    top_5 = [line for line in text.split('\n')]
    top_5 = [(p.split()[0], p.split()[1]) for p in top_5]
    

### Model Definition
Under the ``source/`` directory is the file ``model.py``, which contains the definition for a class named ``BinaryClassifier``. This class defines our base ANN model for this project which hast the following structure:
1. Three parameters need to be passed to the model:
    * ``input_features``: the number of neurons to create for input (11 in this case)
    * ``hidden_dim``: a parameter used to define the ANN hidden layers.
    * ``output_dim``: the number of neurons in the final layer of the ANN. For a binary classifier this is 1, and the result ranges from [0,1].
2. The number of neurons in the 4 hidden layers of the model are defined as:
    * ``hidden_dim``
    * ``2 * hidden_dim``
    * ``3 * hidden_dim``
    * ``hidden_dim``
3. The forward pass of the model
    * Input layer -> Linear transform to the first hidden layer
    * Passed into Rectifier Linear Unit function
    * Dropout layer (for training only)
    * Repeat the above steps until the final hidden layer...
    * Last hidden layer -> Linear transform to the output layer
    * Sigmoid Activation Function -> Result
    

In [47]:
# Define this model's hidden layer nodes and parameters
input_dim = 11
d1 = 500
d2 = 2*d1
d3 = 3*d1
d4 = d1
activation = 'relu'
dropout = 0.2
epochs = 10
lr = 0.001
batch = 10

In [81]:
models = []
for m, _ in top_5:
    input_layer = Input(name='the_input', shape=(input_dim,), batch_shape=(None, input_dim))
    # Add dense layers
    dense_1 = Dense(d1, activation=activation)(input_layer)
    drop_1 = Dropout(dropout)(dense_1)
    dense_2 = Dense(d2, activation=activation)(drop_1)
    drop_2 = Dropout(dropout)(dense_2)
    dense_3 = Dense(d3, activation=activation)(drop_2)
    drop_3 = Dropout(dropout)(dense_3)
    dense_4 = Dense(d4, activation=activation)(drop_3)
    drop_4 = Dropout(dropout)(dense_4)
    dense_5 = Dense(1, activation=activation)(drop_4)

    # Add sigmoid activation layer
    y_pred = Activation('sigmoid', name='sigmoid')(dense_5)
    
    # Specify the model
    model = Model(inputs=input_layer, outputs=y_pred, name=m)
    model.output_length = lambda x: x
    model.compile(optimizer=Adam(learning_rate=lr), loss='binary_crossentropy', metrics=['accuracy'])
    models.append(model)
    
print(models[0].summary())

Model: "270-90-10"
_________________________________________________________________
Layer (type)                 Output Shape              Param #   
the_input (InputLayer)       (None, 11)                0         
_________________________________________________________________
dense_215 (Dense)            (None, 500)               6000      
_________________________________________________________________
dropout_164 (Dropout)        (None, 500)               0         
_________________________________________________________________
dense_216 (Dense)            (None, 1000)              501000    
_________________________________________________________________
dropout_165 (Dropout)        (None, 1000)              0         
_________________________________________________________________
dense_217 (Dense)            (None, 1500)              1501500   
_________________________________________________________________
dropout_166 (Dropout)        (None, 1500)              0 

### Model Training
The model training will be performed by Amazon Sagemaker. Training jobs will be created for each training dataset in the ``final/`` directory. Under ``source/`` there is a file named ``train.py``, which contains the structure for a PyTorch entry point. This is necesssary for creating estimators through Sagemaker.

In [82]:
# Read training and testing data
data = {}
for mod, _ in top_5:
    data[mod] = {}
    m = mod.split('-')
    train_file = final_data_dir+time_words[int(m[0])]+'/'+m[1]+'_'+m[2]+'/train.csv'
    test_file = final_data_dir+time_words[int(m[0])]+'/'+m[1]+'_'+m[2]+'/test.csv'

    df_train = pd.read_csv(train_file, header=None)
    df_test = pd.read_csv(test_file, header=None)

    y_train = df_train[0]
    X_train = df_train.drop(labels=0, axis=1)
    y_test = df_test[0]
    X_test = df_test.drop(labels=0, axis=1)

    data[mod]['y_train'] = y_train
    data[mod]['X_train'] = X_train
    data[mod]['y_test'] = y_test
    data[mod]['X_test'] = X_test
    

In [83]:
for model in models:
    name = model.name
    model.fit(x=X_train, 
              y=y_train,
              batch_size=batch,
              epochs=epochs)
    
    model.save(model_directory+name+'.h5')
    print('Finished training model: '+name+'\n\n')

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Finished training model: 270-90-10


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Finished training model: 270-90-270


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Finished training model: 90-270-10


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Finished training model: 90-270-90


Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10
Finished training model: 20-10-270




### Model Evaluation
For evaluation I am deploying each training job created. For each estimator, a predictor endpoint is created briefly to be sent data to make predictions. For each predictor, their respective test datasets are passed. The endpoint for the predictor is then deleted. Then, accuracy calculations are made against the labeled test datasets, they are printed and stored into ``.txt`` files under the ``results/`` directory.


In [93]:
for model in models:
    print(model.metrics_names)
    print(model.evaluate(data[model.name]['X_test'], data[model.name]['y_test'], verbose=1))

['loss', 'accuracy']
[0.6931471824645996, 0.05593542382121086]
['loss', 'accuracy']
[0.6931471824645996, 0.4661848247051239]
['loss', 'accuracy']
[0.6931471822462676, 0.38840049505233765]
['loss', 'accuracy']
[0.6931471822462676, 0.38840049505233765]
['loss', 'accuracy']
[0.6931471824645996, 0.40163570642471313]
