# Test S3 for training

Let's run an end-to-end Keras training script with data from our S3 bucket. The data is stored on the S3 bucket in an HDF5 file. This test will give us an idea of the speed and cost of the training.

In [1]:
import keras
import numpy as np
import h5py
import os

Using TensorFlow backend.


## Load HDF5 on the S3 bucket for training in Keras

This assumes you have [goofys](https://github.com/kahing/goofys) setup on your local machine.
You'll probably first need to download and install the [AWS CLI](https://aws.amazon.com/cli/). If AWS CLI is properly installed then you should be able to run this command from your local Linux machine:

` aws s3 ls s3://dse-cohort3-group5`

If that works, then you can create a local directory with the command:

`mkdir -p s3bucket`

If that works, then you can use goofys to link the local directory with the s3 bucket.

`./goofys dse-cohort3-group5 s3bucket`

Once that is done, then you can access the s3bucket as if it were a local folder on your Linux machine.


In [2]:
s3bucket_path = '/home/tony/Documents/Capstone/ucsd-dse-capstone/s3bucket/' # remote S3 via goofys
#s3bucket_path = '/home/tony/Documents/Capstone/ucsd-dse-capstone/' # Local storage (for sanity test)
path_to_hdf5 = s3bucket_path + 'LUNA16/hdf5-files/32dim_patches.hdf5'
hdf5_file = h5py.File(path_to_hdf5, 'r') # open in read-only mode

In [3]:
print("Valid hdf5 file in 'read' mode: " + str(hdf5_file))
file_size = os.path.getsize(path_to_hdf5)
print('Size of hdf5 file: {:.3f} GB'.format(file_size/2.0**30))

Valid hdf5 file in 'read' mode: <HDF5 file "32dim_patches.hdf5" (mode r)>
Size of hdf5 file: 0.012 GB


## Custom HDF5 dataloader

This is the first pass at our custom HDF5 data loader.
We'll need to add data augmentation and class balancing to this.

In [4]:
def generate_data(hdf5_file, batch_size=50, num_rows=96, input_shape = (32,32,32,1)):
    """Replaces Keras' native ImageDataGenerator."""
    """ Randomly select batch_size rows from the hdf5 file dataset """
    
    input_shape = tuple([batch_size] + list(input_shape))
    while True:
        
        random_idx = np.sort(np.random.choice(num_rows, batch_size, replace=False))  
        imgs = hdf5_file["patches"][random_idx,:]
        imgs = imgs.reshape(input_shape)
        classes = hdf5_file["classes"][random_idx,0]
        yield imgs, classes

## 3D CNN

This is a very simple 3D CNN just to test the pipeline.

In [None]:
from keras.layers import Dense, Activation,Conv3D,MaxPooling3D,Flatten,Dropout, Input
from keras.models import Model

input_shape = (32,32,32,1)
inputs = Input(input_shape, name='Images')

conv1 = Conv3D(filters=96, kernel_size=(3, 3, 3), activation='relu', padding='valid',
              kernel_initializer='glorot_uniform')(inputs)

max2 = MaxPooling3D(pool_size=(2,2,2))(conv1)

layer6 = Flatten()(max2)

layer7 = Dense(32, activation='relu')(layer6)

layer8 = Dropout(0.5)(layer7)

layer9 = Dense(4, activation='relu')(layer8)

layer10 = Dense(1, activation='sigmoid')(layer9)

model = Model(inputs=[inputs], outputs=[layer10])
model.compile(loss='binary_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
model.summary()

_________________________________________________________________
Layer (type)                 Output Shape              Param #   
Images (InputLayer)          (None, 32, 32, 32, 1)     0         
_________________________________________________________________
conv3d_1 (Conv3D)            (None, 30, 30, 30, 96)    2688      
_________________________________________________________________
max_pooling3d_1 (MaxPooling3 (None, 15, 15, 15, 96)    0         
_________________________________________________________________
flatten_1 (Flatten)          (None, 324000)            0         
_________________________________________________________________
dense_1 (Dense)              (None, 32)                10368032  
_________________________________________________________________
dropout_1 (Dropout)          (None, 32)                0         
_________________________________________________________________
dense_2 (Dense)              (None, 4)                 132       
__________

## Train with fit_generator


In [None]:
batch_size = 50
history = model.fit_generator(generate_data(hdf5_file, batch_size, input_shape = (32,32,32,1)),
                    steps_per_epoch=10000, epochs=10)

Epoch 1/10
 2207/10000 [=====>........................] - ETA: 38:42 - loss: 0.0028 - acc: 0.9998