# Overview of Client

The pipeline for loading data for machine learning training is a critical process that requires thoughtful and deliberate planning. Key considerations that need to be accounted for include:

* keeping track of all data(s) and/or label(s) for each individual training case
* consistent stratified sampling between training and validation splits
* consistent stratified sampling between different data cohorts (e.g. from different hospitals or sources)
* randomization of data loading order between epochs
* any "real time" data preprocessing (e.g. normalization)

Other advanced functionality may include:

* in-memory loading of all data before training starts (if dataset is small)
* asynchronous loading of data (if dataset is large)

In this tutorial, we will cover the basics of creating a `client` for loading data, covering much of the key functionality described above. For each individual project, you must create your **own individualized** client for loading data. To help you get started, a template fully functional `client` is availabe in this repository at `/dl_core/clients/client.py`. Example usage of this module is available at the end of this tutorial.

## Key Concepts

All terms in **bold** will reference specific concepts that are reused throughout this tutorial. Please review these terms before proceeding further.

Machine learning algorithms require data to be **split** into *training* and *validation* sets. All training paradigms *require* this baseline division of data. For the purposes of this tutorial, a **split** represents the current usage of any particular example as *training* or *validation* data. Note that in the setting of cross-validation, all data will be used as *both* training and validation cases at different points during algorithm development.

For certain problems, it is necessary to further subdivide data into **cohorts**. For the purposes of this tutorial, a **cohort** represents *any arbitrary* division of data into user defined subgroups. Why are **cohorts** important? It turns out that stratified sampling (e.g. selecting data at fixed rates from specific cohorts) oftentimes improves training dynamics for heterogenous datasets, including those commonly seen in medical problems. For example, given the low prevalence of most disease states, it is often beneficial to load data at an even 50-50% distribution between positive and negative examples (e.g supersample from the positive category). 

Given the above, two different sampling **rates** are defined:

* **training_rate**: represents the rate of randomly selecting training / validation cases
* **sampling_rate**: represents the rate at which each individual **cohort** is sampled from

# Creating a Data Summary

Given the need to careful subselect splits or cohorts of data, it is often valuable to first extract key characteristics of data *first* as an independent step prior to loading any data. In this tutorial, implementation will including the following steps:

* find all data(s) and/or label(s) for each individual training case
* extract summary information about each training case
* aggregate all summarized data
* determine training / validation splits
* serialize summary as a Pickle file

## Finding Data

This portion of the code will be most the variable depending on the directory hierachy for the data in your project. In general, the goal will be to create a list of dictionaries, with each key-value pair representing a single full file path name. For example:

```
d = {
    'dat_0': /full/path/to/dat/0,
    'lbl_0': /full/path/to/lbl/0,
    'lbl_1': /full/path/to/lbl/1, ... }
```

Note that the keys you choose can be arbitrary, as long as you remember what is what!

In our example, the data is currently organized as follows:

```
/hdfs/[ID-...0]/dat.hdf5
/hdfs/[ID-...0]/bet.hdf5
/hdfs/[ID-...1]/dat.hdf5
/hdfs/[ID-...1]/lbl.hdf5
...
```

Assuming that your data is organized in a similar way, here is a simple method to generate such a list of dictionaries:

In [None]:
import os, glob

def find_data(query):
    """
    Method to read all data and make summary dictionary 

    :params

      (dict) query : {

        'root': '/path/to/root/dir',
        [key_00]: [query_00],
        [key_01]: [query_01], ...

      }
      
    """
    assert 'root' in query
    assert len(query) > 1

    root = query.pop('root')
    keys = sorted(query.keys())

    q = query.pop(keys[0])
    matches = glob.glob('%s/**/%s' % (root, q), recursive=True)

    DATA = []
    
    for n, m in enumerate(matches):

        print('CREATING SUMMARY (%07i/%07i): %s' % (n + 1, len(matches), m))

        d = {keys[0]: m}
        b = os.path.dirname(m)

        # --- Find other matches
        for key in keys[1:]:
            ms = glob.glob('%s/%s' % (b, query[key]))

            if len(ms) == 1: 
                d[key] = ms[0]
                
        if len(d) == len(keys):
            DATA.append(d)
    
    return DATA

The following code demonstrates example usage:

In [None]:
# --- Set query
query = {
    'root': '../../data/hdfs',
    'dat': 'dat.hdf5',
    'bet': 'bet.hdf5'}

# --- Find data
data = find_data(query)
print(data)

## Extracting Slice Location

Next, we need to identify information about **each slice** of data which will be used for algorithm training. To do so, we will first create a system to reference each individual slice data using an **index** and a **coord** variable:

* **coord**: a *normalized* coordinate `[0, 1]` that represents the z-position of the slice
* **index**: a value from `[0, n - 1]` representing the n-th sample in the dataset 

After all the data has been loaded and summarized, we should have two vectors, `coord` and `index`, *equal in size* to the total number of slices of all data. For example, if we had five volumes, each with 10 slices, then:

```
index = [0, 0, 0 ..., 1, 1, 1 ..., 9, 9, 9, ... 9, 9, 9]
coord = [0, 1, 2 ..., 0, 1, 2 ..., 0, 1, 2, ... 7, 8, 9]
```

Assuming `data` contains a 4D Numpy volume, the following snippet pseudocode will accomplish this:

```
index = []
coord = []

for INDEX in range(len(EXAMS)):

    [... load data ...]
    
    index.append(np.ones(data.shape[0], dtype='int') * INDEX)
    coord.append(np.arange(data.shape[0]) / (data.shape[0] - 1))
```

See below for a rough implementation of the above concepts. Keep in mind we need an additional variable, `LABELED`, which references the *key* in query from which to load data and use for calculations (in our case, `bet`). To load `*.hdf5` files we will use the `dl_core.io.hdf5` library. See dedicated notebook for more information.

In [None]:
import os, glob
import numpy as np

import sys
PATH = '../../'
if PATH not in sys.path:
    sys.path.append('../../')
from dl_core.io import hdf5

In [None]:
def make_summary(query, LABELED):
    """
    Method to read all data and make summary dictionary 

    :params

      (dict) query : {

        'root': '/path/to/root/dir',
        [key_00]: [query_00],
        [key_01]: [query_01], ...

      }

    """
    assert 'root' in query
    assert len(query) > 1
    assert LABELED in query

    root = query.pop('root')
    keys = sorted(query.keys())

    q = query.pop(keys[0])
    matches = glob.glob('%s/**/%s' % (root, q), recursive=True)

    DATA = []
    META = {}
    META['index'] = []
    META['coord'] = []

    for n, m in enumerate(matches):

        d = {keys[0]: m}
        b = os.path.dirname(m)

        # --- Find other matches
        for key in keys[1:]:
            ms = glob.glob('%s/%s' % (b, query[key]))

            if len(ms) == 1: 
                d[key] = ms[0]

        # --- Caculate summary meta information from LABELED
        if len(d) == len(keys):

            data, _ = hdf5.load(d[LABELED])

            # --- Aggregate information
            META['index'].append(np.ones(data.shape[0], dtype='int') * len(DATA))
            META['coord'].append(np.arange(data.shape[0]) / (data.shape[0] - 1))
            DATA.append(d)

    # --- Concatenate all vectors
    META = {k: np.concatenate(v) for k, v in META.items()}
    
    return DATA, META

The following code demonstrates example usage:

In [None]:
# --- Set query
query = {
    'root': '../../data/hdfs',
    'dat': 'dat.hdf5',
    'bet': 'bet.hdf5'}

# --- Find data
data, meta = make_summary(query, LABELED='bet')
print(data)
print(meta)

## Extracting Slice Data

Finally, in addition to `index` and `coord`, we will want to read the provided label and extract some information regarding what values are present. This is important if later, we want to select slices that have a certain abnormality of finding present. In our example, the labels contain brain masks (1 == background, 2 == brain). Thus, we will now create two additional vectors, equal in size to the total number slices of all data (same as `index` and `coord`), that contain a binary True or False as to whether that slice contains background and/or brain.

The following snippet of pseudocode will accomplish this for an arbitrary number of `CLASSES`:

```
META = {}
for INDEX in range(len(EXAMS)):

    [... load data ...]
    
    for c in range(CLASSES + 1):
        s = np.sum(data == c, axis=(1, 2, 3)) > 0
        META[c].append(s)
```

See below for a rough implementation of the above concepts. 

In [None]:
import os, glob
import numpy as np

import sys
PATH = '../../'
if PATH not in sys.path:
    sys.path.append('../../')
from dl_core.io import hdf5

In [None]:
def make_summary(query, LABELED, CLASSES=2):
    """
    Method to read all data and make summary dictionary 

    :params

      (dict) query : {

        'root': '/path/to/root/dir',
        [key_00]: [query_00],
        [key_01]: [query_01], ...

      }

    """
    assert 'root' in query
    assert len(query) > 1
    assert LABELED in query

    root = query.pop('root')
    keys = sorted(query.keys())

    q = query.pop(keys[0])
    matches = glob.glob('%s/**/%s' % (root, q), recursive=True)

    DATA = []
    META = {c: [] for c in range(CLASSES + 1)}
    META['index'] = []
    META['coord'] = []

    for n, m in enumerate(matches):

        d = {keys[0]: m}
        b = os.path.dirname(m)

        # --- Find other matches
        for key in keys[1:]:
            ms = glob.glob('%s/%s' % (b, query[key]))

            if len(ms) == 1: 
                d[key] = ms[0]

        # --- Caculate summary meta information from LABELED
        if len(d) == len(keys):

            # --- Aggregate slice-by-slice label information
            data, _ = hdf5.load(d[LABELED])

            for c in range(CLASSES + 1):
                s = np.sum(data == c, axis=(1, 2, 3)) > 0
                META[c].append(s)

            # --- Aggregate information
            META['index'].append(np.ones(data.shape[0], dtype='int') * len(DATA))
            META['coord'].append(np.arange(data.shape[0]) / (data.shape[0] - 1))
            DATA.append(d)

    # --- Concatenate all vectors
    META = {k: np.concatenate(v) for k, v in META.items()}
    
    return DATA, META

The following code demonstrates example usage:

In [None]:
# --- Set query
query = {
    'root': '../../data/hdfs',
    'dat': 'dat.hdf5',
    'bet': 'bet.hdf5'}

# --- Find data
data, meta = make_summary(query, LABELED='bet', CLASSES=2)
print(data)
print(meta)

# Preparing the Client

# Loading Data

# Usage

In this section, we will explore example usage of the template `client` provided in this repoository. 

**IMPORTANT**: This `client` has been written to load 1 x 512 x 512 (e.g. single slice) arrays from the provided example head CT. You *will need to modify* this code for your own individual projects.

## Import dl_core

To use the `dl_core` library, you need to ensure that the repository path has been set. If you are using the python interpreter directlying (e.g. command line) you will need to add the repository path to the `$PYTHONPATH` environment variable. If you are using an iPython interface (e.g. including Jupyter) you will need to set the path using the `sys` module: 

In [None]:
# --- Set PATH to dl_core library path
PATH = '../../' 

# --- Use sys module to set $PYTHONPATH
import sys
if PATH not in sys.path:
    sys.path.append(PATH)

## Import client

In [None]:
from dl_core.clients import Client

In [None]:
# --- Set default path locations
SUMMARY_PATH = '../../data/pkls/summary.pkl'
HDFS_PATH = '../../data/hdfs'

## Creating a summary of the data

Recall that in order to properly handle stratified sampling requirements, we need to know more information about the underlying data (e.g. which slice(s) are positive, etc). This information will then be used to randomize and organize cohorts for future data loading pipelines. See section above for more information.

In [None]:
client = Client(SUMMARY_PATH=SUMMARY_PATH)

client.make_summary(
    query={
        'root': HDFS_PATH,
        'dat': 'dat.hdf5',
        'bet': 'bet.hdf5'},
    LABELED='bet',
    CLASSES=2
)

## Preparing a client

Prior to loading data, we need to prepare the client with specifications regarding the desired cohort and sampling rates. See section above for more information. 

In [None]:
client = Client(SUMMARY_PATH=SUMMARY_PATH)
client.load_summary()

client.prepare_cohorts(fold=0)
client.set_sampling_rates(rates={
    1: 0.5,
    2: 0.5
})

## Loading data

At last, we are ready to use the client to load data.

In [None]:
for i in range(10):
    arrays = client.get()
    print(arrays['dat'].shape)