## Homework #2: Convolutional Neural Networks
Due Monday, April 29th by 11:59pm

Name:


### Question 1
(a) You have an input volume that is 15x15x8 and pad it using *p = 2*. What is the dimension of the resulting volume?

Your answer: 


(b) You have an input volume that is 32x32x16 and apply max pooling with a stride of 2 and a filter of size of 2x2. What is the output volume?

Your answer:


(c) You have an input volume that is 63x63x16 and convolve it with 32 filters that are each 7x7, and a stride of 1. You want to use a "same" convolution. What is the padding *p*?

Your answer:


(d) You have an input volume that is 63x63x16 and convolve it with 32 filters that are each 7x7, using a stride of 2 and no padding. What is the output volume?

Your answer:


### Question 2: Classification of Chest X-rays
Chest X-ray exams are one of the most frequent and cost-effective medical imaging examinations available. However, clinical diagnosis of a chest X-ray can be challenging and sometimes more difficult than diagnosis via chest CT imaging. The lack of large publicly available datasets with annotations means it is still very difficult, if not impossible, to achieve clinically relevant computer-aided detection and diagnosis (CAD) in real world medical sites with chest X-rays. One major hurdle in creating large X-ray image datasets is the lack resources for labeling so many images. Prior to the release of this dataset, Openi was the largest publicly available source of chest X-ray images with 4,143 images available.

This NIH Chest X-ray Dataset is comprised of 112,120 X-ray images with disease labels from 30,805 unique patients. To create these labels, the authors used Natural Language Processing to text-mine disease classifications from the associated radiological reports. The labels are expected to be >90% accurate and suitable for weakly-supervised learning. The original radiology reports are not publicly available but you can find more details on the labeling process in this Open Access paper: "ChestX-ray8: Hospital-scale Chest X-ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases." (Wang et al.)

Here we will use a subset of the data, rather than all 112,120 images. The images have been split into training, validation and test sets, each with folders containing the images for a particular diagnosis (class). The full data set contains images with single labels and multi-lables, with a total of 15 unique diagnoses. Our subsample contains only single label images with a total of 7 diagnoses: atelectasis, effusion, infiltration, mass, nodule, none (no finding), and pneumothorax. Your task is to classify the images correctly by building multiple CNNs and comparing their performance.

Here are what a few of the X-rays look like:

<img src="chest_xrays.jpg" width="500">

In [1]:
import sys
if not sys.warnoptions:
    import warnings
    warnings.simplefilter("ignore")
import keras
keras.__version__
import numpy as np
import os, shutil
#import cv

Using TensorFlow backend.


### Load the data

The data are available on Dropbox and can be accessed [here](https://www.dropbox.com/sh/k23tp1a0um8u2xv/AACDoV9K9BJdU5ObjBw0mmwSa?dl=0). Be sure to unzip the folders before running the code below.

Load the data and print the number of training, validation and test set examples there are of each class. Be sure to change the directory path provided below to your own data path.

In [2]:
base_dir = '/Users/heathermattie/Dropbox/Summer Teaching/Summer Project 2018/Multiple Classes Data' # change this path

train_dir = os.path.join(base_dir, 'train_dir')
validation_dir = os.path.join(base_dir, 'validation_dir')
test_dir = os.path.join(base_dir, 'test_dir')

# Training Data
train_atelectasis  = os.path.join(train_dir, 'atelectasis')
train_effusion     = os.path.join(train_dir, 'effusion')
train_infiltration = os.path.join(train_dir, 'infiltration')
train_mass         = os.path.join(train_dir, 'mass')
train_nodule       = os.path.join(train_dir, 'nodule')
train_none         = os.path.join(train_dir, 'none')
train_pneumothorax = os.path.join(train_dir, 'pneumothorax')

# Validation Data
val_atelectasis  = os.path.join(validation_dir, 'atelectasis')
val_effusion     = os.path.join(validation_dir, 'effusion')
val_infiltration = os.path.join(validation_dir, 'infiltration')
val_mass         = os.path.join(validation_dir, 'mass')
val_nodule       = os.path.join(validation_dir, 'nodule')
val_none         = os.path.join(validation_dir, 'none')
val_pneumothorax = os.path.join(validation_dir, 'pneumothorax')

# Test Data
test_atelectasis  = os.path.join(test_dir, 'atelectasis')
test_effusion     = os.path.join(test_dir, 'effusion')
test_infiltration = os.path.join(test_dir, 'infiltration')
test_mass         = os.path.join(test_dir, 'mass')
test_nodule       = os.path.join(test_dir, 'nodule')
test_none         = os.path.join(test_dir, 'none')
test_pneumothorax = os.path.join(test_dir, 'pneumothorax')

In [3]:
print('Total training atelectasisat images:', len(os.listdir(train_atelectasis)))
print('Total training effusion images:', len(os.listdir(train_effusion)))
print('Total training infiltration images:', len(os.listdir(train_infiltration)))
print('Total training mass images:', len(os.listdir(train_mass)))
print('Total training nodule images:', len(os.listdir(train_nodule)))
print('Total training no finding images:', len(os.listdir(train_none)))
print('Total training pneumothorax images:', len(os.listdir(train_pneumothorax)))

Total training atelectasisat images: 400
Total training effusion images: 400
Total training infiltration images: 400
Total training mass images: 200
Total training nodule images: 300
Total training no finding images: 400
Total training pneumothorax images: 300


In [4]:
print('Total validation atelectasisat images:', len(os.listdir(val_atelectasis)))
print('Total validation effusion images:', len(os.listdir(val_effusion)))
print('Total validation infiltration images:', len(os.listdir(val_infiltration)))
print('Total validation mass images:', len(os.listdir(val_mass)))
print('Total validation nodule images:', len(os.listdir(val_nodule)))
print('Total validation no finding images:', len(os.listdir(val_none)))
print('Total validation pneumothorax images:', len(os.listdir(val_pneumothorax)))

Total validation atelectasisat images: 100
Total validation effusion images: 100
Total validation infiltration images: 100
Total validation mass images: 55
Total validation nodule images: 70
Total validation no finding images: 100
Total validation pneumothorax images: 70


In [5]:
print('Total test atelectasisat images:', len(os.listdir(test_atelectasis)))
print('Total test effusion images:', len(os.listdir(test_effusion)))
print('Total test infiltration images:', len(os.listdir(test_infiltration)))
print('Total test mass images:', len(os.listdir(test_mass)))
print('Total test nodule images:', len(os.listdir(test_nodule)))
print('Total test no finding images:', len(os.listdir(test_none)))
print('Total test pneumothorax images:', len(os.listdir(test_pneumothorax)))

Total test atelectasisat images: 101
Total test effusion images: 100
Total test infiltration images: 100
Total test mass images: 55
Total test nodule images: 70
Total test no finding images: 100
Total test pneumothorax images: 70


### A. Build a CNN from scratch
Build a shallow (2-4 convolution layers) CNN. You are free to choose any values you wish for the filter size(s), pooling window size(s), and activation function(s). Please use an input shape of `(150, 150, 3)`. Include a dense layer on top along with an appropriate output layer (number of neurons and activation function). Be sure to also include the `model.compile` function with an appropriate choice of loss function and performance metric.

In [None]:
# Your code here

#### Define the image generator
Using the `ImageDataGenerator` function, create `train_datagen` and `test_datagen` generators that rescale the images appropriately. Then define a training set generator and validation set generator using the generators `train_datagen` and `test_datagen` and the `.flow_from_directory` function. Specify the `target_size` (it should match the input size above), set the `batch_size` to 20 and choose an appropriate `class_mode`."

In [7]:
# Your code here

Found 2400 images belonging to 7 classes.
Found 595 images belonging to 7 classes.


Use this code chunck to view the shapes of one batch of your training images and labels.

In [15]:
for data_batch, labels_batch in train_generator:
    print('data batch shape:', data_batch.shape)
    print('labels batch shape:', labels_batch.shape)
    break

data batch shape: (20, 150, 150, 3)
labels batch shape: (20, 7)


#### Compile your model
Be sure to choose appropriate numbers for the `steps_per_epoch` and `validation_steps` parameters. If one of the numbers is not a multiple of the batch size, round up to the nearest integer. Run this model for 30 epochs. Be sure to save your trained model.

In [None]:
# Your code here

In [18]:
# Save your model
model.save('chest_1.h5')

#### Plot training and validation loss
Plot the training and validation loss. Does the model seem to be overfitting?

In [None]:
# Your code here

#### Test accuracy
Calculate and report the test set accuracy using the code below.

In [None]:
test_generator = test_datagen.flow_from_directory(
                 test_dir,
                 target_size=(150, 150),
                 batch_size=20,
                 class_mode='categorical')

test_loss, test_acc = model.evaluate_generator(test_generator, steps=30)
print('test acc:', test_acc)

### B. Using data augmentation
Using the same architecture above, fit the CNN using data augmentation. You are free to choose the type of alterations made to the training images and the batch size (you should increase this for data augmentation). Be sure to include a `Dropout` layer before the first dense layer. Run this model for 30 epochs. Be sure to save your trained model.

In [11]:
# Your code here

In [None]:
# Save your model
model.save('chest_2.h5')

#### Plot training and validation loss
Plot the training and validation loss. Does the model seem to be overfitting?

In [None]:
# Your code here

#### Test accuracy
Calculate and report the test set accuracy.

In [None]:
# Your code here

### C. Using a pre-trained CNN without data augmentation

Use one of the pre-trained models in Keras that has been trained using the Imagenet data set as a convolutional base. Extract features by running your training set through the base. 

In [None]:
# Your code here

Take this output and train a classifier. You may use the classifier from previous parts of this question.

In [None]:
# Your code here

In [None]:
# Save your model
model.save('chest_3.h5')

#### Plot training and validation loss
Plot the training and validation loss. Does the model seem to be overfitting?

In [None]:
# Your code here

#### Test accuracy
Calculate and report the test set accuracy.

### D. Summarize results
Summarize the results from the 3 models you built. Which model would you choose to make future predictions?

### E. **Optional**: Data augmentation with pre-trained CNN
If you would like to try cloud computing, you can build a CNN and use data augmentation along with a pre-trained network for classification. **You will need to use Google Cloud Platform to train this model**. Please follow the instructions on how to set up access to GPUs and run code from a Jupyter notebook. These instructions can be found on canvas and the course GitHub repository. Be sure to save your trained model.

If you choose to do this, you will get 20 extra credit homework points. Homework #3 is worth 25 points, so you wouldn't have to do much work on that assignment to get all of your homework credits.

In [None]:
# Your code here

In [None]:
# Save your model
model.save('chest_4.h5')

#### Plot training and validation loss
Plot the training and validation loss. Does the model seem to be overfitting?

#### Test accuracy
Calculate and report the test set accuracy.