# Overview

In this tutorial, we will explore several strategies for optimizing Jupyter notebook code and hyperparameter search.

# Environment

### Jarvis library

In this notebook we will Jarvis, a custom Python package to facilitate data science and deep learning for healthcare. Among other things, this library will be used for low-level data management, stratification and visualization of high-dimensional medical data.

In [None]:
# --- Install jarvis (only in Google Colab or local runtime)
% pip install jarvis-md

### Imports

Use the following lines to import any additional needed libraries:

In [None]:
import os, numpy as np, pandas as pd
import tensorflow as tf
from tensorflow import losses, optimizers
from tensorflow.keras import Input, Model, layers
from jarvis.train import datasets, params
from jarvis.train.client import Client

# Data

The data used in this tutorial will consist of (frontal projection) chest radiographs from a subset of the RSNA / Kaggle pneumonia challenge (https://www.kaggle.com/c/rsna-pneumonia-detection-challenge). From the complete cohort, a random subset of 1,000 exams will be used for training and evaluation.

### Download

The custom `datasets.download(...)` method can be used to download a local copy of the dataset. By default the dataset will be archived at `/data/raw/xr_pna`; as needed an alternate location may be specified using `datasets.download(name=..., path=...)`. 

In [None]:
# --- Download dataset
paths = datasets.download(name='xr/pna-512')

While the location of downloaded data is not required for baseline functionality, manipulation of the underlying data structures and configuration files may be needed for implementing customized training. As a result it may be useful to note this location for future reference. All code and key configuration files are located relative to a project `code` root directory. This directory can be retrieved via the `paths` variable that is returned from the `datasets.download(...)` call, which is comprised of a Python dictionary with two entries:

* `paths['code']`: root directory of all Python code and configuration files
* `paths['data']`: root directory of all raw data

Note that by default, these two directories are identical and placed in `/data/raw/[datatset_name]`.

# Hyperparameters

While interactive coding (encouraged by the Jupyter notebook design) is well-suited for initial algorithm debugging and testing, training a model accross a grid-search of hyperparameters can be very challenging without a strategy that helps automate and keep track of experiments. In this tutorial, we will explore one potential mechanisms using prebuilt `jarvis.train.params` module.

To begin, let us create a `*.csv` file where each *column* represents a potential hyperparameter to test and each *row* represents one possible experimental configuration. Let us look at one possible such file now:

```
output_dir | fold | batch_size | LR       | alpha | iterations
-----------------------------------------------------------------
./exp01-0  | 0    | 8          | 0.0002   | 1     | 20000
./exp01-1  | 1    | 8          | 0.0002   | 1     | 20000
./exp01-2  | 2    | 8          | 0.0002   | 1     | 20000
./exp01-3  | 4    | 8          | 0.0002   | 1     | 20000
./exp01-4  | 5    | 8          | 0.0002   | 1     | 20000
./exp02-0  | 5    | 8          | 0.0002   | 2     | 20000
```

In this example, we will consider the following hyperparameters:

* **output_dir**: location to save outputs
* **fold**: cross-validation fold
* **batch_size**: training batch size
* **LR**: learning rate
* **alpha**: multiplication factor for number of channels
* **iterations**: number of training iterations

**Important**: we will create the following `*.csv` file here programatically for demonstration purposes, *however* in practice you should create the file independently outside of Jupyter.

In [None]:
def create_hyper_csv(fname='./hyper.csv', overwrite=False):
    
    if os.path.exists(fname) and not overwrite:
        return
    
    df = {'output_dir': [], 'fold': [], 'batch_size': [], 'LR': [], 'alpha': [], 'iterations': []}
    
    # --- Create exp01
    for fold in range(5):
        df['output_dir'].append('./exp01-{}'.format(fold))
        df['fold'].append(fold)
        df['batch_size'].append(8)
        df['LR'].append(0.0002)
        df['alpha'].append(1)
        df['iterations'] = 20000
        
    # --- Create exp02
    fold = 0
    df['output_dir'].append('./exp02-{}'.format(fold))
    df['fold'].append(fold)
    df['batch_size'].append(8)
    df['LR'].append(0.0002)
    df['alpha'].append(2)
    df['iterations'] = 20000  
    
    # --- Save *.csv file
    df = pd.DataFrame(df)
    df.to_csv(fname, index=False)
    
    print('Created {} successfully'.format(fname))

In [None]:
# --- Create 
create_hyper_csv()

To use this `*.csv` file, simple use the `params.load(...)` function in `jarvis.train.params`:

In [None]:
# --- Load params
p = params.load(csv='./hyper.csv', row=5)

Now all the parameters for **row 0** are availabe in the dictionary `p`. 

Additionally, it should be noted that if certain OS environmental variables are set prior to running this notebook, the `params.load(...)` function will *ignore* the kwargs passed in this code, and instead use the `csv` and `row` specifications designed by the environment variables instead:

```
$ export JARVIS_PARAMS_CSV=hyper.csv
$ export JARVIS_PARAMS_ROW=0 
$ python [training_script].py
```

# Client

The `Client()` object helps facilitate a simple interface for loading data efficiently for model training. By default, one (or several) `Client()` `*.yml` files are provided as a template with baseline configurations (e.g. data matrix shape, normalization, preprocessing, etc). By passing a template `*.yml` file into the `Client()` constructor, one can create a default object (and Python generator) to use for training:

```python
CLIENT_TEMPLATE = '{}/data/ymls/client.yml'.format(paths['code'])
client = Client(CLIENT_TEMPLATE)
```

During the course of training, the `Client()` object will aggregate additional specific information regarding the current state of data (e.g. number of passes through each data sample, epoch-specific randomization, etc). At any given time during training, the client state may be saved using:

```python
# --- Save client state
CLIENT_TRAINING = '/path/to/save/client.yml'
client.to_yml(CLIENT_TRAINING)
```

Upon resuming a training session, this new *experiment specific* `Client()` `*.yml` file can be directly loaded instead of the default template `*.yml` file to ensure that the training resumes with the exact same data sample and randomization as before. To account for this flexibility, use the following modification to the `Client()` initialization code:

In [None]:
# --- Initialize client
CLIENT_TEMPLATE = '{}/data/ymls/client-cls-512.yml'.format(paths['code'])
CLIENT_TRAINING = '{}/client.yml'.format(p['output_dir'])

client = Client(CLIENT_TRAINING if os.path.exists(CLIENT_TRAINING) else CLIENT_TEMPLATE, configs={
    'batch': {
        'size': p['batch_size'],
        'fold': p['fold']}})

Notice additionally that several hyperparameters from the `p` dictionary are referenced during `Client()` initialization here.

### Generators

In [None]:
# --- Create gen_train, gen_valid
gen_train, gen_valid = client.create_generators()

### Inputs

In [None]:
# --- Create model inputs
inputs = client.get_inputs(Input)

# Model

Let us first define a standard classifier network architecture:

In [None]:
# --- Define kwargs dictionary
kwargs = {
    'kernel_size': (1, 3, 3),
    'padding': 'same'}

# --- Define lambda functions
conv = lambda x, filters, strides : layers.Conv3D(filters=filters, strides=strides, **kwargs)(x)
norm = lambda x : layers.BatchNormalization()(x)
relu = lambda x : layers.ReLU()(x)

# --- Define stride-1, stride-2 blocks
conv1 = lambda filters, x : relu(norm(conv(x, filters, strides=1)))
conv2 = lambda filters, x : relu(norm(conv(x, filters, strides=2)))

# --- Define concatenation
concat = lambda a, b : layers.Concatenate()([a, b])

For the 2D U-net backbone, we will add a reference the `p['alpha']` hyperparameter that will modify the relative size of filter channel depth at all points through the network:

In [None]:
# --- Extract alpha value
a = p['alpha']

# --- Define contracting layers
l1 = conv1(int(8  * a), inputs['dat'])
l2 = conv1(int(16 * a), conv2(int(16 * a), l1))
l3 = conv1(int(24 * a), conv2(int(24 * a), l2))
l4 = conv1(int(32 * a), conv2(int(32 * a), l3))
l5 = conv1(int(40 * a), conv2(int(40 * a), l4))
l6 = conv1(int(48 * a), conv2(int(48 * a), l5))
l7 = conv1(int(56 * a), conv2(int(56 * a), l6))
l8 = conv1(int(64 * a), conv2(int(64 * a), l7))

# --- Flatten
f0 = layers.Flatten()(l8)

Creat logits:

In [None]:
# --- Create logits
logits = {}
logits['pna'] = layers.Dense(2, name='pna')(f0)

Now the model is ready:

In [None]:
model = Model(inputs=inputs, outputs=logits)

# Model

Putting everything together, use the following cell to create and compile the convolutional neural network:

In [None]:
# --- Create model
model = Model(inputs=inputs, outputs=logits)

# --- Compile model
model.compile(
    optimizer=optimizers.Adam(learning_rate=p['LR']), 
    loss={'pna': losses.SparseCategoricalCrossentropy(from_logits=True)}, 
    metrics={'pna': 'sparse_categorical_accuracy'})

In [None]:
# --- Load data into memory for faster training
client.load_data_in_memory()

# Train

Before we start training, let us load an existing model files in the `output_dir`:

In [None]:
# --- Create output_dir
os.makedirs(p['output_dir'], exist_ok=True)

# --- Load existing model if present
MODEL_NAME = '{}/model.hdf5'.format(p['output_dir'])
if os.path.exists(MODEL_NAME):
    print('Loading existing model weights: {}'.format(MODEL_NAME))
    model.load_weights(MODEL_NAME)

To set up a training loop, consider the following configurations (certainly these may be modified programmatically via `params` as above if desired):

In [None]:
# --- Assume a 1000:250 ratio of train:valid
steps_per_epoch = 250
validation_freq = 4

# --- Determine total loop iterations needed
N = int(p['iterations'] / 1000)

Using this information, we will set up a training loop that saves the `client.yml` and `model.hdf5` intermediates after each loop iteration:

In [None]:
for n in range(N):
    
    # --- Train
    model.fit(
        x=gen_train, 
        steps_per_epoch=steps_per_epoch, 
        validation_data=gen_valid,
        validation_steps=steps_per_epoch,
        validation_freq=validation_freq)
    
    # --- Save model
    model.save(MODEL_NAME)
    
    # --- Save client
    client.to_yml(CLIENT_TRAINING)