# Data Augmentation with cGANs for Improved Prostate Segmentation
## Abstract 

Segmentation of medical images has many diverse applications, ranging from surgical planning to disease diagnosis. Machine learning approaches to segmentation using deep convolutional neural networks (CNNs) have demonstrated state-of-the-art results compared to the current gold-standard of manual segmentation. However, CNNs require large amounts of labeled training data and it is often impractical or impossible to obtain a sufficient number of medical images to successfully develop segmentation models. We present a novel method of data augmentation to increase the size of the labeled training set, involving a conditional generative adversarial network (cGAN) which is used to generate labeled synthetic data. First, a UNet is trained on the segmentation task using the available data. Then, segmentations corresponding to an anatomical atlas are fed back through a new UNet to generate synthetic data. This backward fed UNet is used as the generator of the cGAN, which has in-effect been pre-trained using the reverse segmentation task. The addition of the discriminator fine-tunes the output to encourage realistic synthetic data which corresponds to the conditioned segmentation mask (i.e result is a labeled training example). Our model is demonstrated on the Medical Segmentation Decathlon Prostate dataset consisting of 32 T2 weighted MRI volumes and shows improved segmentation performance compared to the non-augmented dataset.


## Step 1: Establishing Baseline UNet 

The first thing is to establish a baseline that can be used to compare the exxperimental model to. This is 
used using three codes in combination; a Data generator, a Trainer, and a Predictor. 

### Data Generator 
Requirements: 

    data_dir: directory of where training data is saved. MRI data should be in the format of nii.gz
    target_dir: corresponding segmentation files for each example in the data_dir. File format: nii.gz
    batch_size: batch size
    shuffle: choice if data should be reshhuffled between epochs. Boolea
    num_channels: number of channels for the input shape
    num_classes: number of class labels for the output shape
    input_size: size of the image input
    regular: direction that the network is being trained, True means that the input is an image and the output is a segmentation, False means the reverse
   
### Trainer
Requirements: 

    batch_folder_train: directory of where training data is saved. MRI data should be in the format of nii.gz
    target_folder_train: corresponding segmentation files for each example in the data_dir. File format: nii.gz
    ofolder: output folder to save the model weights and json files in 
    samples_per_card=None: Used for multi-gpu training
    epochs=50: number of epochs during training
    batch_size=50: batch size given to the Data Generator
    gpus_used=1: amount off GPUs available for training 
    training_direction=True: Direction of training. See 'regular' in Data Generator
    num_classes=1: number of class labels for the output shape
    train_aug=False: choice of using a generator to augment the data 
    aug_folder=None: The folder the holds the weights and json file for the generator to create synthetic data
    batch_folder_val=None: folder of testing/validation set. File Format: nii.gz
    target_folder_val=None: corresponding target values for the testing/validation set. File Format: nii.gz
    num_syn_data=None: Choice of how many synthetic data examples are used. 
    
### Predictor 
Requirements: 

    model_folder: Folder that contains the model weights and json file you are looking to test
    data_folder: Folder of the testing/validation data you are testing on. File Format: nii.gz 
    target_folder: Folder of the corresponding targets for the testing/validation data. File Format: nii.gz
    ofolder: Output folder where results are saved. Result numpy array of segmentation probability map and a csv of DSC calculated for the probability mask thersholded above 0.5. 
    opt: Optimizer that was used for traing. 
    testing_direction=True: direction of testing. See regular in Data Generator. 

### Data Set 
The data set can be downloaded from this URL: https://drive.google.com/open?id=1Ff7c21UksxyT4JfETjaarmuKEjdqe1-a
Then take training data and split into testing and training folders. 

In [None]:
from Trainer import Trainer
from Predictor import Predictor
import os
import keras as K

'''This script is used to train a basic UNet on prostate segmentation in
MRI. This will be used as a base line in our project. It is traained on
75% of the volumes, resampled to istropic voxel size of 2.0. It is then tested on 25% of the volumes,
resampled the same, resulting in 278 testing examples.
'''

### Train the Model

# Change these folders to your desired directories.
data_dir_train = '/prostate_data/Task05_Prostate/imagesTr/'

target_dir_train = '/prostate_data/Task05_Prostate/labelsTr/'

data_dir_val = '/prostate_data/Task05_Prostate/imagesTs/'

target_dir_val = '/prostate_data/Task05_Prostate/labelsTs/'

ofolder = 'ModelOutputs/UNet_regular_rev2'

if 'CUDA_VISIBLE_DEVICES' in os.environ.keys():
    CUDA_VISIBLE_DEVICES = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
else:
    CUDA_VISIBLE_DEVICES = ['None']

a = Trainer(data_dir_train, target_dir_train, ofolder, samples_per_card=None,
            epochs=50, gpus_used=len(CUDA_VISIBLE_DEVICES), num_classes=1,
            batch_size=16, training_direction=True,
            batch_folder_val=data_dir_val, target_folder_val=target_dir_val
            )

a.train_the_model(t_opt=K.optimizers.adam(lr=1e-5))

### Test the Model
# Change these folders to your desired directories.
data_dir_val = '/prostate_data/Task05_Prostate/imagesTs

target_dir_val = '/prostate_data/Task05_Prostate/labelsTs/'

model_folder = 'ModelOutputs/UNet_regular_rev2'

ofolder = 'ModelOutputs/UNet_regular_rev2/test_results'

a = Predictor(model_folder=model_folder,
              data_folder=data_dir_val,
              target_folder=target_dir_val,
              ofolder=ofolder,
              opt='ADAM',
              testing_direction=True)
a.predict_and_evaluate()
print('done')

## Step 2: Pre-Training the UNet Generrator 
For the next step we turn the training in reverse to pre-train weights to generate images given the segmentation ground truth. This just sets the training_direction to False and runs the trainer again. 

In [None]:
from Trainer import Trainer
import os
import keras as K

import tensorflow as tf
import numpy as np

tf.set_random_seed(1)
np.random.seed(1)

# Train the Augmentation Model

data_dir = '/prostate_data/Task05_Prostate/imagesTr/'

target_dir = '/prostate_data/Task05_Prostate/labelsTr/'

ofolder = 'ModelOutputs/UNetAugmentor_rev3/'


if 'CUDA_VISIBLE_DEVICES' in os.environ.keys():
    CUDA_VISIBLE_DEVICES = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
else:
    CUDA_VISIBLE_DEVICES = ['None']


a = Trainer(data_dir, target_dir, ofolder, samples_per_card=None,
            epochs=100, gpus_used=len(CUDA_VISIBLE_DEVICES),
            batch_size=16, training_direction=False)

a.train_the_model(t_opt=K.optimizers.adam(lr=1e-4),
                  loss=K.losses.mae,
                  t_depth=4,
                  t_dropout=0.5)

print('done')

## Step 3: Training and Testing a UNet using the UNet Generator 
This is an optional step. This can be ran to show that the pre-trained weights have found anything helpful for augmenting the data. The process is the same as Step 1, only setting train_aug=True and giving an aug_folder for the weights and json file of the generator. 

In [None]:
from Trainer import Trainer
from Predictor import Predictor
import os
import keras as K

# Train the Model

data_dir = '/prostate_data/Task05_Prostate' \
           '/imagesTr/'
target_dir = '/prostate_data/Task05_Prostate' \
             '/labelsTr/'
ofolder = '/2019-03-30-CSC2541Project/UNet_regularWAug/'

aug_folder = '/2019-03-30-CSC2541Project/UNetAugmentor/'

if 'CUDA_VISIBLE_DEVICES' in os.environ.keys():
    CUDA_VISIBLE_DEVICES = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
else:
    CUDA_VISIBLE_DEVICES = ['None']

a = Trainer(data_dir, target_dir, ofolder, samples_per_card=10,
            epochs=50, gpus_used=len(CUDA_VISIBLE_DEVICES),
            batch_size=None, training_direction=True,
            data_aug=True,
            aug_folder=aug_folder)

a.train_the_model(t_opt=K.optimizers.adam(lr=1e-5))

# Test the Model

data_dir = '/prostate_data/Task05_Prostate' \
           '/imagesTs/'
target_dir = '/prostate_data/Task05_Prostate' \
             '/labelsTs/'
model_folder = '/2019-03-30-CSC2541Project/UNet_regularWAug/'

ofolder = '/2019-03-30-CSC2541Project/UNet_regularWAug' \
          '/test_results'

a = Predictor(model_folder=model_folder,
              data_folder=data_dir,
              target_folder=target_dir,
              ofolder=ofolder,
              opt='ADAM',
              testing_direction=True)
a.predict_and_evaluate()
print('done')

## Step 4: Fine Tuning the Unet Generator in a GANs training Framework 
This script used an adaptation of a pix2pix.py code published at https://github.com/eriklindernoren/Keras-GAN 

### Pix2Pix 
Requirements: 

    data_dir: directory where the training data is located. File Format nii.gz
    target_dir: directory where the corresponding target masks are found for the training data. File Format: nii.gz 
    batch_size: batch size to be used in training
    pretrained_folder: folder where the pre-trained model is saved 
    ofolder: Output folder. Will create png files for each epoch, save weights and a json file 



In [None]:
from Pix2Pix import Pix2Pix
import keras as K

import tensorflow as tf
import numpy as np



# Train the Augmentation Model using a cGAN

data_dir = '/prostate_data/Task05_Prostate' \
           '/imagesTr/'
target_dir = '/prostate_data/Task05_Prostate' \
             '/labelsTr/'
pretrained_model = 'ModelOutputs/UNetAugmentor_rev3/'
ofolder = 'ModelOutputs/cGANUnetAugmentor_rev2/'

tf.set_random_seed(1)
np.random.seed(1)

cgan = Pix2Pix(pretrained_folder=pretrained_model,
               data_dir=data_dir,
               target_dir=target_dir,
               ofolder=ofolder,
               batch_size=32)

cgan.train(epochs=2000, sample_interval=1000)

print('done')

## Step 5: Training and Testing using the cGAN trained UNet Generator 
Again, this is a repeat of step 1 and 3, just with setting data_aug=True and giving a aug_folder for the UNet generator. In this experiment we also ran training adn testing with different portions of synthetic data used in the training set, as per the results shown in the final report. 

In [None]:
from Trainer import Trainer
from Predictor import Predictor
import os
import keras as K
import numpy as np


# Train the Model

data_dir_train = '/prostate_data/Task05_Prostate' \
           '/imagesTr/'
target_dir_train = '/prostate_data/Task05_Prostate' \
             '/labelsTr/'

data_dir_val = '/prostate_data/Task05_Prostate' \
           '/imagesTs/'
target_dir_val = '/prostate_data/Task05_Prostate' \
             '/labelsTs/'

aug_folder = 'ModelOutputs/cGANUnetAugmentor_rev2/'

num_syn_data_vec = np.linspace(100, 2000, 20)

for num_syn_data in num_syn_data_vec:

    ofolder = 'ModelOutputs/UNet_reuglarWAugcGAN_rev2/{}_syn_samples/'.format(int(num_syn_data))

    if os.path.exists(ofolder) is False:

        os.makedirs(ofolder)

    if 'CUDA_VISIBLE_DEVICES' in os.environ.keys():
        CUDA_VISIBLE_DEVICES = os.environ['CUDA_VISIBLE_DEVICES'].split(',')
    else:
        CUDA_VISIBLE_DEVICES = ['None']

    a = Trainer(data_dir_train, target_dir_train, ofolder, samples_per_card=None,
                epochs=50, gpus_used=len(CUDA_VISIBLE_DEVICES), num_classes=1,
                batch_size=16, training_direction=True,
                aug_folder=aug_folder,
                data_aug=True, num_syn_data=int(num_syn_data),
                batch_folder_val=data_dir_val, target_folder_val=target_dir_val
                )

    a.train_the_model(t_opt=K.optimizers.adam(lr=1e-5))

    # Test the Model

    data_dir = '/jaylabs/amartel_data2/prostate_data/Task05_Prostate' \
               '/imagesTs/'
    target_dir = '/jaylabs/amartel_data2/prostate_data/Task05_Prostate' \
                 '/labelsTs/'
    model_folder = ofolder

    ofolder_pred = ofolder + 'test_results'

    a = Predictor(model_folder=model_folder,
                  data_folder=data_dir,
                  target_folder=target_dir,
                  ofolder=ofolder_pred,
                  opt='ADAM',
                  testing_direction=True)
    a.predict_and_evaluate()
print('done')

## Visualization 

For the final results visualization was done using 2 codes in the repositorry 

### plot_images.py 
This script needs manual editing and given the directory for the model you want to build visual examples for, as well as the testing folders for data and targets. 

### plot_hisotgram
This script also requries manual editing. Given a csv file, it will plot a histogram of the DSC for each test subject in the csv database. 