In [1]:
# Load noTeXbook theme extension
%load_ext notexbook

In [2]:
# Enable noTeXbook theme
%texify

# Beautiful ML Data: Patterns and Best Practice with PyTorch

**Thank you** very much for getting here 🤗

In this tutorial I will present **patterns** and **best practice** to work with data in your `ML` applications in what I think would be the most efficient and effective way. We will focus on *good* OOP abstractions, while aiming at integration and *not reinventing the wheel*.
Our main framework of reference will be `PyTorch`, demonstrating its unique ability to perfectly integrates within the **Python Data Model**, as well as with other ML libraries (i.e. `sklearn`).


I hope you will find the material useful, and most of all that you'll enjoy the journey 😊

For any question, suggestion, or **new pattern** proposal, feel free to open an [issue](https://github.com/leriomaggio/pytorch-beautiful-ml-data/issues) or a [PR](https://github.com/leriomaggio/pytorch-beautiful-ml-data/pulls) on [GitHub](https://github.com/leriomaggio/pytorch-beautiful-ml-data).


**Table of Content**

- [Motivations](#Motivations)
    - [Description](#Longer-Description)
- [Zen of Data Abstractions](#The-Zen-of-Data-Abstractions)
- [Outline](#Outline)
- [Patterns Catalogue](#Patterns-Catalogue)

## Patterns Catalogue

_for a quick and easy access to examples and case studies_

- `Dataset` and data encapsulation
    - from `bunch` to `Dataset` $\Rightarrow$ [link](2_torch_dataset/dataset_and_torch.ipynb#From-Bunch-to-Dataset)
    - iteration, collation, and batches $\Rightarrow$ [link](2_torch_dataset/dataset_and_torch.ipynb#iteration-time)
    - data representation $\Rightarrow$ [link](1_prelude/ml_data_model.ipynb#Data-Representation-for-Machine-Learning)
    
- Dataset and Transform 
    - `jit-scriptable` preprocessing $\Rightarrow$ [link](3_transformer_samplers/transformers_and_samplers.ipynb#Dataset-and-transform)
    - transform, random, and seed $\Rightarrow$ [link](3_transformer_samplers/transformers_and_samplers.ipynb#Transformers,-Random,-and-seeds)

- Imbalanced dataset $\Rightarrow$ [link](3_transformer_samplers/transformers_and_samplers.ipynb#DataLoaders-and-Imbalanced-data)
    - Weighted Sampler $\Rightarrow$ [link](3_transformer_samplers/transformers_and_samplers.ipynb#Sampler)
    
- Data Partitioning
    - Split `PyTorch` Dataset $\Rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#Split-a-PyTorch-Dataset)
    - Combining `PyTorch` Dataset $\Rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#Combining-multiple-PyTorch-Dataset)
    - Train/Test split and `Subset` $\Rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#Train-Test-Split)
    - K-Fold CV with `PyTorch` $\Rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#KFold-and-torch.Dataset)
        - `ProxyDataset` pattern $\rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#As-for-1.A:)
        - `FolderSampler` patter $\rightarrow$ [link](4_data_partitioning/data_partitioning.ipynb#Changing-Data-Loading-strategy,-not-dataset)
        
- Structured Labels and Data Abstractions
    - `Region` abstraction for segmentation contours $\Rightarrow$ [link](5_case_study/case_study_digital_pathology.ipynb#Data-Abstractions)
    - Label Transformation: from Annotation to Bounding-boxes $\Rightarrow$ [link](5_case_study/case_study_digital_pathology.ipynb#Tiles)
    - Label Rescaling $\Rightarrow$ [link](5_case_study/case_study_digital_pathology.ipynb#Identify-Tiles-for-Annotations)
    - Tiles Transform operator $\Rightarrow$ [link](|5_case_study/case_study_digital_pathology.ipynb#Tiling-as-a-Transformer)

#### The Zen of Data Abstractions

The general _mantra_ that we will try to support in this tutorial is:

- Data does **never** come *already* pre-processed (🚫 `MNIST`)
    - either in `features` or in `partitions`

- A dataset for DL is (and requires) more than `numpy.ndarray` $\mapsto$ `torch.Tensor`[$^{\star}$](#fnstar2)
    - same applies to `pandas.DataFrame` $\mapsto$ `torch.Tensor`

- Data Science 💙 OOP

- *good* OOP abstractions make your life a lot easier
    - and are also a lot of fun to use!
    
- FYI: **Python Data Model** rocks 🚀

I called this the *Zen of Data Abstractions* (for _deep learning_).

*Note*: I did come up with this *name* while working at this tutorial...so I presume the list is still _incomplete_ and would benefit for some revisions. 
*If only* the **Global PyData Community** would be gathered into a unique conference...

<span id="fnstar2"><i>[$^{\star}$]: </i>Although there exists an instance of `torch.utils.data.TensorDataset` 😊</span>

### Motivations

Data is essential in Machine Learning, and PyTorch offers a very Pythonic solution to load complex and heterogeneous dataset. However, data loading is merely the first step: *preprocessing* | *batching* | *sampling* | *partitioning* | *augmenting*. 
`ML` data requires _good_ (`OOP`, *ed.*) abstractions to deal with different complex and evolving data pipelines, so `numpy.ndarray` (as well as `torch.Tensor`) are *wonderful*, but simply _not enough_.     

This tutorial explores the internals of `torch.utils.data`, and describes patterns and best practices for elegant data solutions in Machine and Deep learning with PyTorch.

#### Longer Description

Data processing is at the heart of every Machine Learning (ML) model _training&evaluation_ loop; and PyTorch has revolutionised the way in which data is managed.
Very Pythonic `Dataset` and `DataLoader` classes substitutes substitutes (_nested_) `list` of Numpy `ndarray`.

However data `loading` is merely the first step. Data *preprocessing* | *batching* | *sampling* | *partitioning* are fundamental operations that are usually required in a complete ML pipeline.

If not properly managed, this could ultimately lead to lots of _boilerplate_ code,  _re-inventing the wheel_ ™.

This tutorial will dig into the internals of `torch.utils.data` to present patterns and best practice to load heterogeneous and custom dataset in the most elegant and Pythonic way.

The tutorial is organised in five parts, each focusing on specific patterns for ML data and scenarios (see the general [outline](#Outline) below).

### Outline

1. Prelude: Data Representation for Machine Learning
    - Representation Model of Machine Learning data
        - Example: Images
    - Exercise: Transforming textual features from `source code`
    - Extra: `numpy` internals and perfomance exploration

2. Introdution to `Dataset` and `DataLoader`
    * `torch.utils.data.Dataset` at a glance
        * Case Study: FER Dataset
    * `torch.utils.data.DataLoader` at a glance
        - Customising the `collate_fn` function

3. Data Transformation and Sampling
    * Preprocessing samples with `torchvision` transforms
        - `FERVision` $\Longleftarrow$  `FER`
        * Case Study: Random transformers and `LocalRandomApply`
        * Transformer pipelines with `torchvision.transforms.Compose`
    * Data Sampling and Data Loader
        * Case Study: Handling imbalanced samples in FER data

4. Data Partitioning: the PyTorch way
    * One _Dataset_ is One `Dataset`
    * Subset and `random_split` meets `sklearn`
        - `train_test_split_dataset`
    * `Dataset` and Cross-Validation
        * `torch.utils.data.Dataset` $\rightleftharpoons$ `sklearn.model_selection.KFold` 
        * `ProxySubset` vs `FoldSampler`

5. Data Abstractions for Image Segmentation
    * `dataclass` and Python Data Model
    * Case Study for Digital Pathology
    * Working with tiles and Patches
        * Patches in Batches for Spleen Segmentation