# Overview of Client

The `Client()` object as part of the `dl_train.client` module provides a simple yet powerful interface for loading and preprocessing data in the context of neural network training. With a properly formatted `*.csv` file referencing data location, a `Client()` object can be initialized with nothing more than a simple `*.yml` configuration file. In this notebook we will explore the following functionality:

**Configuration**

* `*.csv` fileformat
* `*.yml` file settings 
* batch
* specs

**Loading Data**

* via `self.get(...)` method
* via `self.generator(...)` method
* via `self.create_generators(...)` method

**Testing**

* via
`self.test(...)` method

# Set Up

Let us first set up the required imports and
dependencies:

In [None]:
from dl_train.client import Client
from dl_utils.general import *

# Configuration

A fully functional `Client()` object can be instantiated with a `*.yml` file containing the necessary configurations:

In [None]:
# --- Instantiate a new Client
client = Client('./client.yml')

Using the instantiated client, a single call to the `client.create_generators(...)` method will return Python generators for training and validation data that confrom to the Tensorflow/Keras API e.g. can be trained directly with `model.fit_generator(...)`. 

The following shows the contents of the `./client.yml` file in the current directory:

```yml
_db: ../../../../dl_utils/data/bet/ymls/db.yml
batch:
  fold: -1
  sampling:
    fg: 0.5
    bg: 0.5 
specs:
  xs:
    dat:
      dtype: float32
      loads: dat
      norms:
        clip: 
          min: 0
          max: 256
        shift: 64
        scale: 64
      shape:
      - 1
      - 512
      - 512
      - 1
  ys:
    bet:
      dtype: uint8
      loads: bet 
      norms: null
      shape:
      - 1
      - 512
      - 512
      - 1
```

Let us walk through this configuration file in detail.

## Database

```yml
_db: ../../../../dl_utils/data/bet/ymls/db.yml
```

The prepared database file path is set as `_db` in the first line of the configuration file; it may be referenced as either a `*.csv` for `*.yml` file. See tutorial on generic database objects for further additional information.

For the `Client()` object, a special database is required whereby **each row** in the `*.csv` file represents a unique signature for every training **example** in the dataset. For 3D datasets, this may yield a new row for every slice or slab in the entire volume (if the network is designed to train on a slice- or slab- basis). 

This schematic shows a representative `*.csv` file:

```
             fname-dat       fname-lbl       coord    bg      fg        mu      sd      valid
             
patient_00   /path/to/file   /path/to/file   0.0      True    False     100.0   20.0    3
patient_00   /path/to/file   /path/to/file   0.1      True    False     100.0   20.0    3
patient_00   /path/to/file   /path/to/file   0.2      True    False     100.0   20.0    3
patient_00   /path/to/file   /path/to/file   0.3      False   True      100.0   20.0    3
patient_00   /path/to/file   /path/to/file   0.4      False   True      100.0   20.0    3
patient_00   /path/to/file   /path/to/file   0.5      False   True      100.0   20.0    3
...
patient_00   /path/to/file   /path/to/file   1.0      True    False     100.0   20.0    3
patient_01   /path/to/file   /path/to/file   0.0      True    False     150.0   25.0    0
...

```

This format is consistent with the standard database object format (e.g. all columns containing filenames must be prefixed with `fname-`, the row indices represent patient studyids, etc). In addition, the following four types of header columns should be considered as needed:

1. *(required)* The `valid` column contains a random integer between `[0, n)` where `n` represents the total number of cross-validation folds for the specific experiment. For a standard 5-fold cross-validation, the integers range from `[0, 4]` . Note that **all rows for the patient must be in the same cross-validation fold**.  

2. If you are using a 3D individual slice- or slab-based network, you will need to provide a column name `coord` that contains the **normalized** coordinate between `[0, 1]` for that slice- or slab- of data. 

3. If you are using any stratified sampling technique, you will need to create columns containing a boolean vector the corresponds to each individual cohort. In the example above, we have two cohorts (`bg` and `fg` which correspond to slices containing background and foreground mask values, respectively) for which we plan to sample from at a balanced 50/50% distribution. Keep in mind that the defined cohorts **do not** need to be either mutually exclusive or inclusive of the entire dataset---they simply need to correspond to cohorts for which you plan to implement stratified sampling.  

4. If you are using parameters for image preprocessing that cannot be dynamically inferred during data loading (e.g. the mean of the entire 3D volume when loading data in a slice-by-slice manner), you will need to create column(s) that contain the required information. Commonly used metrics include the mean and standard deviation of an entire 3D volume used to perform z-score normalization for MRI exams. See `specs` below for more details.

## Batch Composition

```yml
batch:
  fold: -1
  sampling:
    fg: 0.5
    bg: 0.5 
```

The composition (source) of each data batch used for network training derives from two primary factors:

* cross validation fold for current experiment
* stratified sampling (if any)

### Validation fold

For a given experiment, the specified `fold` represents the data split to use for validation. For example if fold is set to 0, then all rows with a value of 0 in the `valid` column will be used for validation. Set the fold to -1 in order to use **all data** for both training and validation (e.g. no validation).

### Stratified sampling

For certain experiments, it is critical for specific rows to be used during training at a higher or lower frequency than others. This technique is known as stratified sampling. A common example is to oversample from the images containing a positive disease finding (e.g. pathology tends to be rare) so that the distribution of positive and negative training examples is balanced.

The syntax to define stratified sampling is a series of key-value pairs such that:

* key: name of header column, containing a boolean vector, that defines rows which are part of a specific cohort
* value: rate between `[0, 1]` for which to sample from this specific cohort

All values in total must add to 1.0.

In the example above, the `fg` column contains a value of True or False for each row depending on whether or not a positive mask value is present in the corresponding label. The `bg` column contains the opposite information. 

### Optional `batch` parameters

* `size`: batch size; if not specified here, pass the batch size as an argument into `client.create_generators(...)`

## Data Specifications

```yml
specs:
  xs:
    dat:
      dtype: float32
      loads: dat
      norms:
        clip: 
          min: 0
          max: 256
        shift: 64
        scale: 64
      shape:
      - 1
      - 512
      - 512
      - 1
  ...
```

The `specs` entry in the configuration file defines the type, shape and origin of data to be loaded, as well as any necessary normalization operations to perform on the loaded data. The two primary entries in `specs` are `xs` and `ys` which represent training input data and output label(s), respectively. For each entry `xs` and `ys`, one or many individual volumes may be defined depending on architecture requirements.

For each individual volume, the following parameters must be defined:

* `dtype`: str representing the data type (often `float32` for input data, `uint8` for output label)
* `shape`: 4D shape of input (Z x Y x X x channel); note channel often == 1; for 2D data, Z == 1
* `loads`: column (if any) to load data from (see below for more details)
* `norms`: normalization parameter (see below for more details); use `null` keyword to indicate no normalization

### Data loading

The `loads` entry determines the column name from which to populate data for this input or output variable. If the column is part of the `fnames` DataFrame then the corresponding file will be loaded (at the slice location specified by `coord` if the data is 3D or 4D). If the column is part of the `header` DataFrame then the raw value will be coverted to a Numpy array.

If no corresponding data should be loaded, use the keyword `null`. In this scenario, the corresponding array should be dynamically defined in an overloaded `client.preprocess(...)` method.

### Normalization parameters

A total of up to three parameters can be used to define a data normalization strategy:

* clip (includes min and/or max)
* shift
* scale

These parameters are implemented in the following normalization formula:

```
arr = (arr.clip(min=..., max=...) - shift) / scale
```

There are three ways to define these normalization parameters:

1. **Constant value (integer or float)**. Use a literal value if you know that the required parameter value is constant for all data. This is most commonly used for CT imaging data (voxel values are precalibrated as Hounsfeld Units).

```yml
norms:
  clip: 
    min: 0
    max: 256
  shift: 64
  scale: 64
```

2. **Numpy function (@ keyword)**. Use a keyword prefixed by `@` to represent valid Numpy function (e.g. `np.(...)`) to be applied dynamically to each input image upon load. Common usage here includes the `@mean` and `@std` functions to implement a simple z-score transformation. *Important*: these methods should be used only if the input image is guaranteed to provide valid return(s); as a counterexample, the standard deviation on a uniform 2D image (seen at the top and bottom of 3D volumes) will be undefined. Thus this should rarely be used for 3D or 4D volumes (see option 3 below instead).

```yml
norms: 
  shift: @mean
  scale: @std
```

3. **Column name**. Use a regular str (no @ prefix) to indicate a column name containing any custom normalization parameters. For 3D volumes, a common strategy is to normalize each slice by the mean and standard deviation of the entire volume. However, since loading the entire 3D volume for each slice during training is inefficient, volume statistics can instead be stored in each row of the `*.csv` file and simply referenced for preprocessing. A z-score normalization implemented using this technique is the recommended approach for MR imaging data.

```yml
norms:
  shift: mu
  scale: sd
```

### Optional `specs` parameters

* `tiles`: a 4-element boolean list corresponding to axes that may be expanded during inference
* `infos`: custom `infos` dictionary values to pass to data loader

# Loading Data

After creating a `Client()` object, data may be retrieved using one of the following three methods:

* via `self.get(...)` method
* via `self.generator(...)` method
* via `self.create_generators(...)` method

Note, for the purposes of algorithm training, the `self.create_generators(...)` method will be most commonly utilized.

### Single Load - `self.get(...)`

The `self.get(...)` method can be used to manually load a single row of data. The return is a single Python dictionary with a structure that matches the `specs` attribute above:

```python
arrays = {
    'xs': {'dat': ...},
    'ys': {'bet': ...}}
```


In [None]:
# --- Load a single train or valid case
arrays = client.get(split='train')
arrays = client.get(split='valid')

# --- Load a single fg or bg case
arrays = client.get(cohort='fg')
arrays = client.get(cohort='bg')

# --- Load a single case at row
arrays = client.get(row=0)

# --- Load a single random case
arrays = client.get()

### Batch Load - `self.generator()`

The `self.generator(...)` method can be used to create a Python generator that yields a batch of data from the specified split. If batch size was not previously defined in `specs`, it must be passed as an argument here. The yield of the generator is a tuple `(xs, ys)` that represents the stacked `xs` and `ys` dictionary of arrays above. This output conforms to the Tensorflow / Keras `model.fit_generator(...)` API. 

In [None]:
# --- Create train / valid generators
gen_train = client.generator(split='train', batch_size=16)
gen_valid = client.generator(split='valid', batch_size=16)

# --- Iterate
N = 10
for i in range(N):
    printp('Loading batch', (i + 1) / N)
    train_batch = next(gen_train)
    valid_batch = next(gen_valid)

As a convenience method, the `client.create_generators(...)` method combines the creation of train and valid generators into a single function

In [None]:
# --- Create train / valid generators
gen_train, gen_valid = client.create_generators(batch_size=16)

# Testing

The final section here describes some useful tips for testing and debugging the `Client()` object.

## Checklist

Answer **each question** in this checklist explicitly to ensure that the `client.yml` file is created properly:

`_db`

What is the full path to your database file (`*.yml` or `*.csv`)?

`batch`

What is your current experiment validation fold?

What are the different cohorts (if any) that you plan to stratify sampling from?

* Is each cohort defined by a corresponding header column?
* Is each cohort column composed entirely of boolean (True/False) values?

Do your sampling rates add to 1.0?

`specs`

What are you data input(s) (including masks)?

What are you data output(s)?

What is the data type (most commonly `float32` for `xs` and `uint8` for `ys`)?

Which column does the data derive from?

* Is the column in the `fnames` DataFrame or `header` DataFrame?
* If the column is in `fnames`, does it exist?
* If the column is in `header`, is it a numeric value?

What is the data shape?

* Is the shape defined by a full four-element list?
* Is the data 2D (z == 1) or 3D (z > 1)?
* Is the data 4D (channels > 1)?
* Are you certain that the serialized data matches this shape?

What are the normalization parameters?

* For XR (or other 2D), consider:

```yml
norms:
  shift: @mean
  scale: @std
  
```

* For CT, consider:

```yml
norms:
  clip:
    min: 0
    max: 256
  shift: 64
  scale: 64
  
```

* For MR, consider manual custom columns for mean and SD:

```yml
norms:
  shift: mu
  scale: sd
  
```

# Tear Down