# Overview of Client

The pipeline for loading data for machine learning training is a critical process that requires thoughtful and deliberate planning. Key considerations that need to be accounted for include:

* keeping track of all data(s) and/or label(s) for each individual training case
* consistent stratified sampling between training and validation splits
* consistent stratified sampling between different data cohorts (e.g. from different hospitals or sources)
* randomization of data loading order between epochs
* any "real time" data preprocessing (e.g. normalization)

Other advanced functionality may include:

* in-memory loading of all data before training starts (if dataset is small)
* asynchronous loading of data (if dataset is large)

In this tutorial, we will cover the basics of creating a `client` for loading data, covering much of the key functionality described above. For each individual project, you must create your **own individualized** client for loading data. To help you get started, a template fully functional `client` is availabe in this repository at `/dl_core/clients/client.py`. Example usage of this module is available at the end of this tutorial.

## Key Concepts

All terms in **bold** will reference specific concepts that are reused throughout this tutorial. Please review these terms before proceeding further.

Machine learning algorithms require data to be **split** into *training* and *validation* sets. All training paradigms *require* this baseline division of data. For the purposes of this tutorial, a **split** represents the current usage of any particular example as *training* or *validation* data. Note that in the setting of cross-validation, all data will be used as *both* training and validation cases at different points during algorithm development.

For certain problems, it is necessary to further subdivide data into **cohorts**. For the purposes of this tutorial, a **cohort** represents *any arbitrary* division of data into user defined subgroups. Why are **cohorts** important? It turns out that stratified sampling (e.g. selecting data at fixed rates from specific cohorts) oftentimes improves training dynamics for heterogenous datasets, including those commonly seen in medical problems. For example, given the low prevalence of most disease states, it is often beneficial to load data at an even 50-50% distribution between positive and negative examples (e.g supersample from the positive category). 

Given the above, two different sampling **rates** are defined:

* **training_rate**: represents the rate of randomly selecting training / validation cases
* **sampling_rate**: represents the rate at which each individual **cohort** is sampled from

# Creating a Data Summary

Given the need to careful subselect splits or cohorts of data, it is often valuable to first extract key characteristics of data *first* as an independent step prior to loading any data. In this tutorial, implementation will including the following steps:

* find all data(s) and/or label(s) for each individual training case
* extract summary information about each training case
* aggregate all summarized data
* determine training / validation splits
* serialize summary as a Pickle file

# Preparing the Client

# Loading Data

# Usage

In this section, we will explore example usage of the template `client` provided in this repoository. 

**IMPORTANT**: This `client` has been written to load 1 x 512 x 512 (e.g. single slice) arrays from the provided example head CT. You *will need to modify* this code for your own individual projects.

## Import dl_core

To use the `dl_core` library, you need to ensure that the repository path has been set. If you are using the python interpreter directlying (e.g. command line) you will need to add the repository path to the `$PYTHONPATH` environment variable. If you are using an iPython interface (e.g. including Jupyter) you will need to set the path using the `sys` module: 

In [None]:
# --- Set PATH to dl_core library path
PATH = '../../' 

# --- Use sys module to set $PYTHONPATH
import sys
if PATH not in sys.path:
    sys.path.append(PATH)

## Import client

In [None]:
from dl_core.clients import Client

In [None]:
# --- Set default path locations
SUMMARY_PATH = '../../data/pkls/summary.pkl'
HDFS_PATH = '../../data/hdfs'

## Creating a summary of the data

Recall that in order to properly handle stratified sampling requirements, we need to know more information about the underlying data (e.g. which slice(s) are positive, etc). This information will then be used to randomize and organize cohorts for future data loading pipelines. See section above for more information.

In [None]:
client = Client(SUMMARY_PATH=SUMMARY_PATH)

client.make_summary(
    query={
        'root': HDFS_PATH,
        'dat': 'dat.hdf5',
        'bet': 'bet.hdf5'},
    LABELED='bet',
    CLASSES=2
)

## Preparing a client

Prior to loading data, we need to prepare the client with specifications regarding the desired cohort and sampling rates. See section above for more information. 

In [None]:
client = Client(SUMMARY_PATH=SUMMARY_PATH)
client.load_summary()

client.prepare_cohorts(fold=0)
client.set_sampling_rates(rates={
    1: 0.5,
    2: 0.5
})

## Loading data

At last, we are ready to use the client to load data.

In [None]:
for i in range(10):
    arrays = client.get()
    print(arrays['dat'].shape)