**Table of contents:**
- [Concepts](#Concepts)
  - [Wrangler](#Wrangler)
    - [Members](#Members)
  - [Learner](#Learner)
    - [Members](#Members)
- [Examples](#Examples)
  - [Generic example](#Generic_example)
  - [Example wrangler step-by-step](#Example_wrangler_step-by-step)
  - [Example learner step-by-step](#Example_learner_step-by-step)
- [Requirements](#Requirements)

# Concepts<a name="Concepts"></a>

The two central concepts of this package are the _wrangler_ and the _learner_. In a nutshell,
- the [_wrangler_](#Wrangler) extracts the raw data so as to explore it and prepare it for ingestion by
- the [_learner_](#Learner) which then analyses it using some predictive model that it learns.

Both concepts are encapsulated as Python classes which can be inherited with few constraints other than the fact that they are <a href="https://en.wikipedia.org/wiki/Duck_typing">duck-typed</a> as per the descriptions below.

## Wrangler<a name="Wrangler"></a>

The wrangler parent class is defined in [`wrangler.py`](./wrangler.py). It takes in parameters that specify where the raw data is to be found (`data_source`) and how it is to be read and processed (`**kwargs`). Upon initialization, the raw data is acquired by a call to `self.acquire()`, stored as `self.dataset`, and validated with `self.validate()`. After the initialization, further wrangling of the raw data can then be perfomed using a call to `self.wrangle()`.

Optionally, snapshots of the successive transformations of the data can be stored in a dictionary `self.datasets`, of which `self.dataset` is typically the latest element to be inserted.

### Members<a name="Wrangler members"></a>

- `self.data_source` [str, ...]: Path to the data source (e.g., CSV file, connection to a database, etc).
- `self.dataset` [tf.data.Dataset, dict, ...]: Dataset (potentially batched). Note that this becomes a [dict] of datasets upon a call to `self.split()`.
- `self.datasets` [dict]: Dictionary of snapshots of the data at various stages in the processing. Due to memory limitations, this may only feasible when the data is computed lazily.
- `self.acquire()`: Acquire the raw data.
- `self.validate()`: Validate the raw data.
- `self.wrangle()`: Wrangle the data into a more "high-grade" representation, typically an all-numerical one that's readily ingestible by a machine learning algorithm.
- `self.view()`: View the data, typically in batches.
- `self.explore()`: Explore the data by producing preliminary, descriptive statistics (e.g, correlation matrices, means, etc).
- `self.split()`: Split the data, typically into training, testing, and serving sets. Note that this affects both the "head" dataset `self.dataset` and all of its previous snapshots in `self.datasets`, if any.
- `self.shuffle()`: Shuffle the data.
- `self.stratify()`: Stratify the dataset.
- `self.normalize()`: Normalize the dataset.

## Learner<a name="Learner"></a>

The learner parent class is defined in `learner.py`. Its core components are the data `self.data`  as well as the model `self.model` which is trained, tested, and served on that data. The data and the model are respectively parametrized by
- the data wrangling parameters (`data_params`), and
- the model hyperparameters per se (`hyperparams`).

Both `data_params` and `hyperparams` are potentially subject to optimization in their respective search spaces `hyperparams_space` and `data_params_space`. Additionally, the results of the learner are stored in `metrics` as a dictionary of the various learning stages (train, test, etc). All these results, as well as the learner itself can be saved under the path specified by `learner_dir`.

A [mock run](#Generic_learner) of a learner is shown below for the most generic case.

### Members<a name="Learner members"></a>

- `self.model` [tf.keras]: Actual model used for classification, regression, reinforcement learning, etc.
- `self.data` [Wrangler()]: Wrangler object used by the learner.
- `self.learner_dir` [str, list]: Directory where all the files of the model are stored (including metrics, statistical summaries, plots, etc).
- `self.data_params` [dict]: Dictionary which parametrizes the data wrangling.
- `self.hyperparams` [dict]: Dictionary which parametrizes the model.
- `self.metrics` [dict]: Metrics, statistics, health-checks, runtimes, or data samples for the various stages of the learner. Namely:
   - `explore` [dict]: mean, standard deviation, counts of the features, etc.
   - `select` [dict]: F1-score, contingency matrices, RMS, runtime for the various combinations of hyperparameters.
   - `train` [dict]: F1-score, contingency matrices, RMS, runtime, etc, both for the training set proper and for the validation set.
   - `test` [dict]: predicted and actual targets, F1-score, contingency matrices, RMS, runtime, etc.
   - `serve` [dict]: predicted and actual targets, runtime.
- `self.wrangle()`: Assemble the data into a child object of `Wrangle()` using `self.data_params`, apply any splitting, stratification, or further wrangling as necessary, and save the object as `self.data`.
- `self.design()`: Design the model as per the hyperparameters `self.hyperparams` and save it as `self.model`.
- `self.explore()`: Explore the data and save the results under `self.metrics['explore']`. This can also produce graphical summaries.
- `self.select()`: Select the model by optimizing over the hyperparameter space `self.hyperparams_space`. The best model is to be overwritten into `self.model` and the metrics of the hyperparameter search into `self.metrics['select']`. Optionally, the model selection can be expanded to also cover a search over the data wrangling space `self.data_params_space`.
- `self.select_report()`: Graphical and textual report on the model model selection.
- `self.train()`: Train the model (by default on `self.data.dataset['train']`).
- `self.train_report()`: Graphical and textual report on training the model.
- `self.test()`: Test the model (by default on `self.data.dataset['test']`).
- `self.test_report()`: Graphical and textual report on testing the model.
- `self.serve()`: Serve the model (by default on `self.data.dataset['serve']`).
- `self.serve_report()`:Graphical and textual report on serving the model.
- `self.save()`: Save the learner under `self.learner_dir`.
- `self.run()`: Helper function which runs a whole sequence of learning stages (typically exploration, selection, training, testing, and serving).

# Examples<a name="Examples"></a>

## Generic learner<a name="Generic_learner"></a>

In [1]:
!python main.py

2020-10-21 13:27:13.468633: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer.so.6'; dlerror: libnvinfer.so.6: cannot open shared object file: No such file or directory
2020-10-21 13:27:13.468712: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libnvinfer_plugin.so.6'; dlerror: libnvinfer_plugin.so.6: cannot open shared object file: No such file or directory
2020-10-21 13:27:13.468722: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:30] Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
- Perform any necessary pre-processing or validation of the arguments.

- Acquire and wrangle the data -> `self.data`, or
  alternatively, invoke the environment (for reinforcement learning) -> `self.env`.
- Split the data set.

- Design the model -> `self.model`.

- Explore the d

## Example wrangler step-by-step<a name="Example_wrangler_step-by-step"></a>

In [2]:
import os

# This requires that the learners package is pip-installed.
import learners.utilities as util
import learners.wrangler as wra

data_params=dict(
    label_name='target',
    data_source=os.path.join(*['data', 'heart.csv']),
    categories=dict(thal=['fixed', 'normal', 'reversible']),
    categorical_features=['thal'],
    numeric_features=['age', 'sex', 'cp', 'trestbps', 'chol',
                      'fbs', 'restecg', 'thalach', 'exang',
                      'oldpeak', 'slope', 'ca'],
    split_sizes={'train': 8, 'validate': 1, 'test': 1},
    batch_size=5,
    shuffle=False)

wrangler_object = wra.FromFileTF(**data_params)
wrangler_object.view(return_list=True, print2screen=False)[0]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,b'fixed',0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,b'normal',1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,b'reversible',0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,b'normal',0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,b'normal',0


In [3]:
wrangler_object.wrangle()
wrangler_object.view(return_list=True, print2screen=False)[0]

Unnamed: 0,thal,numeric,target
0,b'fixed',"(tf.Tensor(63.0, shape=(), dtype=float32), tf....",0
1,b'normal',"(tf.Tensor(67.0, shape=(), dtype=float32), tf....",1
2,b'reversible',"(tf.Tensor(67.0, shape=(), dtype=float32), tf....",0
3,b'normal',"(tf.Tensor(37.0, shape=(), dtype=float32), tf....",0
4,b'normal',"(tf.Tensor(41.0, shape=(), dtype=float32), tf....",0


In [4]:
wrangler_object.view(wrangler_object.datasets['raw'], return_list=True, print2screen=False)[0].describe()

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,target
count,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0,5.0
mean,55.0,0.8,2.8,137.0,240.4,0.2,1.6,149.2,0.4,2.26,2.2,1.0,0.2
std,14.764823,0.447214,1.30384,15.652476,30.336447,0.447214,0.894427,31.838656,0.547723,0.861974,0.83666,1.414214,0.447214
min,37.0,0.0,1.0,120.0,204.0,0.0,0.0,108.0,0.0,1.4,1.0,0.0,0.0
25%,41.0,1.0,2.0,130.0,229.0,0.0,2.0,129.0,0.0,1.5,2.0,0.0,0.0
50%,63.0,1.0,3.0,130.0,233.0,0.0,2.0,150.0,0.0,2.3,2.0,0.0,0.0
75%,67.0,1.0,4.0,145.0,250.0,0.0,2.0,172.0,1.0,2.6,3.0,2.0,0.0
max,67.0,1.0,4.0,160.0,286.0,1.0,2.0,187.0,1.0,3.5,3.0,3.0,1.0


## Example learner step-by-step<a name="Example_learner_step-by-step"></a>

In [5]:
import learners.learner as lea

learner_object = lea.Heart(learner_dir=os.path.join(*['learners', 'heart']))




In [6]:
learner_object.data.view(learner_object.data.datasets['raw']['train'], return_list=True, print2screen=False)[0]

Unnamed: 0,age,sex,cp,trestbps,chol,fbs,restecg,thalach,exang,oldpeak,slope,ca,thal,target
0,63,1,1,145,233,1,2,150,0,2.3,3,0,b'fixed',0
1,67,1,4,160,286,0,2,108,1,1.5,2,3,b'normal',1
2,67,1,4,120,229,0,2,129,1,2.6,2,2,b'reversible',0
3,37,1,3,130,250,0,0,187,0,3.5,3,0,b'normal',0
4,41,0,2,130,204,0,2,172,0,1.4,1,0,b'normal',0
5,56,1,2,120,236,0,0,178,0,0.8,1,0,b'normal',0
6,62,0,4,140,268,0,2,160,0,3.6,3,2,b'normal',1
7,57,0,4,120,354,0,0,163,1,0.6,1,0,b'normal',0
8,63,1,4,130,254,0,2,147,0,1.4,2,1,b'reversible',1
9,53,1,4,140,203,1,2,155,1,3.1,3,0,b'reversible',0


In [7]:
learner_object.model

<tensorflow.python.keras.engine.sequential.Sequential at 0x7f6bc5732790>

# Requirements<a name="Requirements"></a>

- Do not rename any method or member which already exists as per the duck-typing.
- Use TensorFlow's Keras for the models whenever they can do the job.
- Use TensorFlow's `Datasets` for the data whenever they can do the job. 
- All learners and wranglers should be self-contained. I.e., any helper function should be placed under the child classes, not scattered in different modules or standalone functions or global variables.