Beautiful Data for Machine Learning

Patterns & Best Practice for effective Data solutions with PyTorch

This tutorial will be presented at PyData Global 2020 conference

Abstract

Data is essential in Machine Learning, and PyTorch offers a very Pythonic solution to load complex and heterogeneous dataset. However, data loading is merely the first step: preprocessing|batching|sampling|partitioning|augmenting.

This tutorial explores the internals of torch.utils.data, and describes patterns and best practices for elegant data solutions in Machine and Deep learning with PyTorch.

Get started

If you want to start digging into examples and patterns, there is a Cover notebook to get you started.

Outline

Part 1 (Prelude)
- Data Representation for Machine Learning
Part 2 Intro to Dataset and DataLoader
- torch.utils.data.Dataset at a glance
- Case Study: FER Dataset
Part 3 Data Transformation and Sampling
- torchvision transformers
- Case Study: Custom (Random) transformers
  - Transformer pipelines with torchvision.transforms.Compose
- Data Sampling and Data Loader
  - Handling imbalanced samples in FER data
Part 4 Data Partitioning (training / validation / test ): the PyTorch way
- One Dataset is One Dataset
- Subset and random_split
- Case Study: Dataset and Cross-Validation
  - How to combine torch.utils.data.Dataset and sklearn.model_selection.KFold (without using skorch)
  - Combining Data Partitioning and Transformers
Part 5 Data Abstractions for Image Segmentation
- dataclass and Python Data Model
- Case Study for Digital Pathology
- Working with tiles and Patches
  - Patches in Batches for Spleen Segmentation

Description

Data processing is at the heart of every Machine Learning (ML) model training&evaluation loop; and PyTorch has revolutionised the way in which data is managed. Very Pythonic Dataset and DataLoader classes substitutes substitutes (nested) list of Numpy ndarray.

However data loading is merely the first step. Data preprocessing|sampling|batching|partitioning are fundamental operations that are usually required in a complete ML pipeline.

If not properly managed, this could ultimately lead to lots of boilerplate code, re-inventing the wheel ™. This tutorial will dig into the internals of torch.utils.data to present patterns and best practice to load heterogeneous and custom dataset in the most elegant and Pythonic way.

The tutorial is organised in four parts, each focusing on specific patterns for ML data and scenarios. These parts will share the same internal structure: (I) general introduction; (II) case study.

The first section will provide a technical introduction of the problem, and a description of the torch internals. Case studies are then used to deliver concrete examples, and application, as well as engaging with the audience, and fostering the discussion. Off-the-shelf and/or custom heterogeneuous datasets will be used to comply with the broadest possible interests from the audience (e.g. Images, Text, Mixed-Type Datasets).

Pre-requisites

Basic concepts of Machine/Deep learning Data processing are required to attend this tutorial. Similarly, proficiency with the Python language and the Python Object Model is also required. Basic knowledge of the PyTorch main features is preferable.

Setting up the Python Environment

It is possible to create the Python virtual environment to run all the notebooks in this repository either using conda (for Anaconda Python distribution) or pyenv and pip.

To setup the Anaconda environment:

$ conda env create -f torch_beautiful_data.yml

This will create a new virtual environment called torch-beautiful-data.

$ conda activate torch-beautiful-data

to activate the environment.

At this stage, you're all set and you should be ready to start playing with the notebook. So, run a jupyter notebook server on your local computer, by running the following command in your Terminal:

$ jupyter notebook

Have fun! 🎉

Note: Alternatively, if you would prefer installing the required packages using pip, it is very simple. Just run the following command:

$ pip install -r requirements.txt

Acknowledgments

Public shout out to all PyData Global organisers, and to Matthijs in particular for his wonderful support during the preparation of this tutorial!

Name		Name	Last commit message	Last commit date
Latest commit History 54 Commits
1_prelude		1_prelude
2_torch_dataset		2_torch_dataset
3_transformer_samplers		3_transformer_samplers
4_data_partitioning		4_data_partitioning
5_case_study		5_case_study
.gitignore		.gitignore
Cover.ipynb		Cover.ipynb
LICENSE-CC-BY-SA		LICENSE-CC-BY-SA
LICENSE.txt		LICENSE.txt
README.md		README.md
requirements.txt		requirements.txt
torch_beautiful_data.yml		torch_beautiful_data.yml
tutorial_pytorch_data.md		tutorial_pytorch_data.md

License

Licenses found

leriomaggio/pytorch-beautiful-ml-data

Folders and files

Latest commit

History

Repository files navigation

Beautiful Data for Machine Learning

Patterns & Best Practice for effective Data solutions with PyTorch

Abstract

Get started

Outline

Description

Pre-requisites

Setting up the Python Environment

Acknowledgments

About

Topics

Resources

License

Licenses found

Stars

Watchers

Forks

Languages