This is the analysis for the ConvX (a Udacity Capstone Project). Before starting, be sure to follow the instructions in the README for the project on github (https://github.com/justiniann/ConvX).

This code was originally run on a personal desktop computer. The specs were...

CPU: Intel i5 6600K, 
GPU: Nvidia 1060 GTX, 
RAM: 16GB

I highly recommend that anyone attempting to run this code on the full dataset use hardware that is comparable or better.

The first step in this process is getting the data set up correctly. We start by running the preprocess function. This will turn the given Data_Entry_2017.csv file into a version we can use, as well as split the images in to train, test, and validation sets. The code for this is not shown, but you can find all of it in the convx_utils.py file.

In [None]:
from convx_utils import *

preprocess()

Now we can start training the data. First, we need to load our base model. We will also define a few variables that we will need later.

In [None]:
from keras.applications import VGG16

# This is our base model. We will use the weights and structure already known to be successful in other domains and
#   adjust it to fit our current problem
base_model = applications.VGG16(include_top=False, weights='imagenet', input_shape=(250, 250, 3))

target_image_size = (250, 250)
batch_size = 32
transfer_learning_epochs = 200
fine_tuning_epochs = 5

We'll also define some paths that we will need later for saving various results as we run through our analysis

In [None]:
# The following are directories used for reading/saving data
RES_PATH = "..{0}..{0}resources{0}".format(os.path.sep)
IMG_PATH = "..{0}..{0}images{0}".format(os.path.sep)
BOTTLENECK_PATH = "..{0}..{0}bottleneck{0}".format(os.path.sep)
SAVE_PATH = "..{0}..{0}saved_models{0}".format(os.path.sep)
TRAIN_PATH = os.path.join(IMG_PATH, "train")
VAL_PATH = os.path.join(IMG_PATH, "validation")
TEST_PATH = os.path.join(IMG_PATH, "test")

model_name = "convx_model"  # The directory all data will be saved in will be named whatever this value is.
models_save_directory = os.path.join(SAVE_PATH, model_name)
build_dir_path(models_save_directory)  # build the directory structure we need for saving results

We haven't included the top layer because we are going to build and train the top layer ourselves. This is known as transfer learning, and it is the first major step in training our model.

Before starting that, however, we need to get a few variables we are going to need durring processing

In [None]:
def count_files(root_dir):
    return sum([len(files) for r, d, files in os.walk(root_dir)])

def get_iterations_per_epoch(total_images, batch_size):
    return np.ceil(total_images / batch_size)

healthy_train_images = count_files(os.path.join(TRAIN_PATH, "healthy"))
unhealthy_train_images = count_files(os.path.join(TRAIN_PATH, "unhealthy"))
healthy_validation_images = count_files(os.path.join(VAL_PATH, "healthy"))
unhealthy_validation_images = count_files(os.path.join(VAL_PATH, "unhealthy"))

num_training_steps = get_iterations_per_epoch((healthy_train_images + unhealthy_train_images), batch_size)
num_validation_steps = get_iterations_per_epoch((healthy_validation_images + unhealthy_validation_images), batch_size)

For efficiency, we are going to get and save the bottleneck features for this model before we start with the transfer learning. By obtaining and saving these once, we can avoid having to run every image through the entire network durring every epoch. 

In [None]:
train_bottleneck_file = os.path.join(BOTTLENECK_PATH, model_name, "train.npy")
validation_bottleneck_file = os.path.join(BOTTLENECK_PATH, model_name, "validation.npy")

data_generator = ImageDataGenerator(rescale=1. / 255)

train_generator = data_generator.flow_from_directory(
    TRAIN_PATH,
    target_size=target_image_size,
    batch_size=batch_size,
    class_mode=None,
    shuffle=False
)
bottleneck_features_train = model.predict_generator(train_generator, num_training_steps)
build_dir_path(bottleneck_file_path)
np.save(open(train_bottleneck_file, 'wb'), bottleneck_features_train)

validation_path_generator = data_generator.flow_from_directory(
    VALIDATION_PATH,
    target_size=target_image_size,
    batch_size=batch_size,
    class_mode=None,
    shuffle=False
)
bottleneck_features_validation = model.predict_generator(validation_path_generator, num_validation_steps)
np.save(open(validation_bottleneck_file, 'wb'), bottleneck_features_validation)

With the bottleneck features established, we can start transfer learning.

In [None]:
def build_fully_connected_top_layer(connecting_shape):
    top_layers = Sequential()
    top_layers.add(Flatten(input_shape=connecting_shape))
    top_layers.add(Dense(256, activation='relu'))
    top_layers.add(Dropout(0.5))
    top_layers.add(Dense(1, activation='sigmoid'))
    return top_layers


# load training data
train_data = np.load(open(train_bottleneck_file, 'rb'))
train_labels = np.array([0] * healthy_train_images + [1] * unhealthy_train_images)

# load validation data
validation_data = np.load(open(validation_bottleneck_file, 'rb'))
validation_labels = np.array([0] * healthy_validation_images + [1] * unhealthy_validation_images)

top_layer = build_fully_connected_top_layer(train_data.shape[1:])

top_layer.compile(loss='binary_crossentropy',
                  optimizer=optimizers.SGD(momentum=0.95),
                  metrics=['accuracy'])

top_layer.fit(train_data, train_labels,
              epochs=epochs,
              batch_size=batch_size,
              validation_data=(validation_data, validation_labels),
              verbose=0)

top_layers_weights_path = os.path.join(models_save_directory, "transfer_learning_weights.h5")

Transfer learning has been completed and the results have been saved! We can now combine the base model with our newly trained top layer and analyze the results.

In [None]:
top_layer = build_fully_connected_top_layer(base_model.output_shape[1:])
top_layer.load_weights(top_model_weights_path)

convx_model = Model(inputs=base_model.input, outputs=top_layer(base_model.output))
convx_model.compile(loss='binary_crossentropy',
                    optimizer=optimizers.SGD(momentum=0.95),
                    metrics=['accuracy'])

# TODO analyze that shit

We can further improve our results by finetuning the model. Using the transfer learning model that we have already trained, we can 'unfreeze' a few of the layers from the base model. This will allow them to be trained, giving us an even better fit on the data. 

In [None]:
convx_model = Model(inputs=base_model.input, outputs=top_layer(base_model.output))

for layer in model_for_finetune.layers[:len(model_for_finetune.layers) - layers_to_train]:
    layer.trainable = False

convx_model.compile(loss='binary_crossentropy',
                    optimizer=optimizers.SGD(momentum=0.95),
                    metrics=['accuracy'])

train_generator = data_generator.flow_from_directory(
    TRAIN_PATH,
    target_size=target_image_size,
    batch_size=batch_size,
    class_mode='binary')

validation_generator = data_generator.flow_from_directory(
    VAL_PATH,
    target_size=target_image_size,
    batch_size=batch_size,
    class_mode='binary')

model_for_finetune.fit_generator(
    train_generator,
    steps_per_epoch=num_training_steps,
    epochs=epochs,
    validation_data=validation_generator,
    validation_steps=num_validation_steps,
    verbose=0
)

Our model has now been fine tuned and the training process is complete! Lets evaluate the results.

In [None]:
# TODO analyze that shit