# Overview of Client

The `Client()` object as part of the `dl_train.client`
module provides a simple yet powerful interface for loading and preprocessing
data in the context of neural network training. With a properly formatted
`*.csv` file referencing data location, a `Client()` object can be initialized
with nothing more than a simple `*.yml` configuration file. In this notebook we
will explore the following functionality:

**Configuration**

* `*.csv` file
format
* `*.yml` file settings 
* rates
* specs
* split

**Loading Data**

* via `self.get(...)` method
* via `self.generator(...)` method
* via `self.create_generators(...)` method

**Testing**

* via
`self.test(...)` method

# Set Up

Let us first set up the required imports and
dependencies:

In [None]:
from dl_train.client import Client

# Configuration

A fully functional `Client()` object can be instantiated with a
`*.yml` file containing the necessary configurations:

In [None]:
# --- Instantiate a new Client
client = Client('./client.yml')

Using the instantiated client, a single call to the `client.create_generators(...)` method will return Python generators for training and validation data that confrom to the Tensorflow/Keras API e.g. can be trained directly with `model.fit_generator(...)`. 

The following shows the contents of the `./client.yml` file in the current directory:

```yml
_db: ../../../../dl_utils/data/bet/ymls/db.yml
rates:
  sampling:
    fg: 0.5
    bg: 0.5 
  training:
    train: 0.8
    valid: 0.2
specs:
  xs:
    dat:
      dtype: float32
      loads: dat
      norms:
        clip: 
          min: 0
          max: 256
        shift: 64
        scale: 64
      shape:
      - 1
      - 512
      - 512
      - 1
  ys:
    bet:
      dtype: uint8
      loads: bet 
      norms: null
      shape:
      - 1
      - 512
      - 512
      - 1
split: 
  fold: -1
  cohorts:
    fg: fg
    bg: bg 
```

Let us walk through this configuration file in detail.

## Database

```yml
_db: ../../../../dl_utils/data/bet/ymls/db.yml
```

The prepared database file path is set as `_db` in the first line of the configuration file; it may be referenced as either a `*.csv` for `*.yml` file. See tutorial on generic database objects for further additional information.

For the `Client()` object, a special database is required whereby **each row** in the `*.csv` file represents a unique signature for every training **example** in the dataset. For 3D datasets, this may yield a new row for every slice or slab in the entire volume (if the network is designed to train on a slice- or slab- basis). 

This schematic shows a representative `*.csv` file:

```
             fname-dat       fname-lbl       coord    bg      fg        mu      sd
             
patient_00   /path/to/file   /path/to/file   0.0      True    False     100.0   20.0
patient_00   /path/to/file   /path/to/file   0.1      True    False     100.0   20.0
patient_00   /path/to/file   /path/to/file   0.2      True    False     100.0   20.0
patient_00   /path/to/file   /path/to/file   0.3      False   True      100.0   20.0
patient_00   /path/to/file   /path/to/file   0.4      False   True      100.0   20.0
patient_00   /path/to/file   /path/to/file   0.5      False   True      100.0   20.0
...
patient_00   /path/to/file   /path/to/file   1.0      True    False     100.0   20.0
patient_01   /path/to/file   /path/to/file   0.0      True    False     150.0   25.0
...

```

This format is consistent with the standard database object format (e.g. all columns containing filenames must be prefixed with `fname-`, the row indices represent patient studyids, etc). In addition, the following three types of header columns should be considered as needed:

1. If you are using a 3D individual slice- or slab-based network, you will need to provide a column name `coord` that contains the **normalized** coordinate between `[0, 1]` for that slice- or slab- of data. 

2. If you are using any stratified sampling technique, you will need to create columns containing a boolean vector the corresponds to each individual cohort. In the example above, we have two cohorts (`bg` and `fg` which correspond to slices containing background and foreground mask values, respectively) for which we plan to sample from at a balanced 50/50% distribution. Keep in mind that the defined cohorts **do not** need to be either mutually exclusive or inclusive of the entire dataset---they simply need to correspond to cohorts for which you plan to implement stratified sampling.  

3. If you are using parameters for image preprocessing that cannot be dynamically inferred during data loading (e.g. the mean of the entire 3D volume when loading data in a slice-by-slice manner), you will need to create column(s) that contain the required information. Commonly used metrics include the mean and standard deviation of an entire 3D volume used to perform z-score normalization for MRI exams. See `specs` below for more details.

## Data Specifications

```yml
specs:
  batch: 16
  xs:
    dat:
      dtype: float32
      loads: dat
      norms:
        clip: 
          min: 0
          max: 256
        shift: 64
        scale: 64
      shape:
      - 1
      - 512
      - 512
      - 1
  ...
```

The `specs` entry in the configuration file defines the type, shape and origin of data to be loaded, as well as any necessary normalization operations to perform on the loaded data. The two primary entries in `specs` are `xs` and `ys` which represent training input data and output label(s), respectively. For each entry `xs` and `ys`, one or many individual volumes may be defined depending on architecture requirements.

For each individual volume, the following parameters must be defined:

* `dtype`: str representing the data type (often `float32` for input data, `uint8` for output label)
* `shape`: 4D shape of input (Z x Y x X x channel); note channel often == 1; for 2D data, Z == 1
* `loads`: column (if any) to load data from (see below for more details)
* `norms`: normalization parameter (see below for more details); use `null` keyword to indicate no normalization

### Data loading

The `loads` entry determines the column name from which to populate data for this input or output variable. If the column is part of the `fnames` DataFrame then the corresponding file will be loaded (at the slice location specified by `coord` if the data is 3D or 4D). If the column is part of the `header` DataFrame then the raw value will be coverted to a Numpy array.

If no corresponding data should be loaded, use the keyword `null`. In this scenario, the corresponding array should be dynamically defined in an overloaded `client.preprocess(...)` method.

### Normalization parameters

A total of up to three parameters can be used to define a data normalization strategy:

* clip (includes min and/or max)
* shift
* scale

These parameters are implemented in the following normalization formula:

```
arr = (arr.clip(min=..., max=...) - shift) / scale
```

There are three ways to define these normalization parameters:

1. **Constant value (integer or float)**. Use a literal value if you know that the required parameter value is constant for all data. This is most commonly used for CT imaging data (voxel values are precalibrated as Hounsfeld Units).

```yml
norms:
  clip: 
    min: 0
    max: 256
  shift: 64
  scale: 64
```

2. **Numpy function (@ keyword)**. Use a keyword prefixed by `@` to represent valid Numpy function (e.g. `np.(...)`) to be applied dynamically to each input image upon load. Common usage here includes the `@mean` and `@std` functions to implement a simple z-score transformation. *Important*: these methods should be used only if the input image is guaranteed to provide valid return(s); as a counterexample, the standard deviation on a uniform 2D image (seen at the top and bottom of 3D volumes) will be undefined. Thus this should rarely be used for 3D or 4D volumes (see option 3 below instead).

```yml
norms: 
  shift: @mean
  scale: @std
```

3. **Column name**. Use a regular str (no @ prefix) to indicate a column name containing any custom normalization parameters. For 3D volumes, a common strategy is to normalize each slice by the mean and standard deviation of the entire volume. However, since loading the entire 3D volume for each slice during training is inefficient, volume statistics can instead be stored in each row of the `*.csv` file and simply referenced for preprocessing. A z-score normalization implemented using this technique is the recommended approach for MR imaging data.

```yml
norms:
  shift: mu
  scale: sd
```

### Other `specs` parameters

* **`batch`**: (optional) training batch size
* **`tiles`**: (optional) a 4-element boolean list corresponding to axes that may be expanded during inference
* **`infos`**: (optional) custom `infos` dictionary values to pass to data loader

# Tear Down