Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve documentation #28

Merged
merged 1 commit into from
Dec 3, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 28 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,41 +1,45 @@
# DataLoaders
# DataLoaders.jl

[Documentation (latest)](https://lorenzoh.github.io/DataLoaders.jl/dev)

A threaded data iterator for machine learning on out-of-memory datasets. Inspired by PyTorch's DataLoader.
A Julia package implementing performant data loading for deep learning on out-of-memory datasets that. Works like PyTorch's `DataLoader`.

It uses to load data **in parallel** while keeping the primary thread free. It can also load data **inplace** to avoid allocations.
### What does it do?

Many data containers work out of the box and it is easy to [extend with your own](docs/datacontainers.md).
- Uses multi-threading to load data in parallel while keeping the primary thread free for the training loop
- Handles batching and [collating](docs/collate.md)
- Is simple to [extend](docs/interface.md) for custom datasets
- Integrates well with other packages in the [ecosystem](docs/ecosystem.md)
- Allows for [inplace loading](docs/inplaceloading.md) to reduce memory load

`DataLoaders` is built on top of and fully compatible with `MLDataPattern.jl`'s [Data Access Pattern](https://mldatautilsjl.readthedocs.io/en/latest/data/pattern.html), a functional interface for machine learning datasets.
### When should you use it?

## Usage
- You have a dataset that does not fit into memory
- You want to reduce the time your training loop is waiting for the next batch of data

```julia
x = rand(128, 10000) # 10000 observations of size 128
y = rand(1, 10000)
### How do you use it?

Install like any other Julia package using the package manager (see [setup](docs/setup.md)):

```julia-repl
]add DataLoaders
```

After installation, import it, create a `DataLoader` from a dataset and batch size, and iterate over it:

dataloader = DataLoader((x, y), 16)
```julia
using DataLoaders
# 10.000 observations of inputs with 128 features and one target feature
data = (rand(128, 10000), rand(1, 10000))
dataloader = DataLoader(data, 16)

for (xs, ys) in dataloader
@assert size(xs) == (128, 16)
@assert size(ys) == (1, 16)
end
```

Of course, in the above example, we can keep the dataset in memory and don't need parallel workers. See [Custom data containers](docs/datacontainers.md) for a more realistic example.

## Getting Started

If you get the idea and know it from PyTorch, see [Quickstart for PyTorch users](docs/quickstartpytorch.md).

Otherwise, read on [here](docs/motivation.md).

Available methods are documented [here](docstrings.md).

## Acknowledgements
### Next, you may want to read

- [`MLDataPattern.jl`](https://github.com/JuliaML/MLDataPattern.jl)
- [`ThreadPools.jl`](https://github.com/tro3/ThreadPools.jl)
- [PyTorch `DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
- [What datasets you can use it with](docs/datacontainers.md)
- [How it compares to PyTorch's data loader](docs/quickstartpytorch.md)
47 changes: 47 additions & 0 deletions docs/collate.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Collating

Collating refers to combining a batch of observations so that the arrays in individual observations are stacked together. As an example, consider a dataset with 100 observations, each with 2 features.

{cell=main result=false}
```julia
data = rand(2, 100) # observation dimension is last by default
```

We can collect a batch of 4 observations into a vector as follows:
{cell=main}
```julia
using DataLoaders
batch = [getobs(data, i) for i in 1:4]
```

Many machine learning models, however, expect an input in a _collated_ format: instead of a nested vector of vectors, we need a single ND-array. DataLoaders.jl provides the [`collate`](#) function for this:

{cell=main}
```julia
DataLoaders.collate(batch)
```

As you can see, the batch dimension is the last one by default.


## Nested observations

The above case only shows how to collate observations that each consist of a single array. In practice, however, observations will often consist of multiple variables like input features and a target variable. For example, we could have an integer indicating the class of an input sample.

{cell=main}
```julia
inputs = rand(2, 100)
targets = rand(1:10, 100)
data = (inputs, targets)
batch = [getobs(data, i) for i in 1:4]
```

Collating here also works, by keeping the tuple structure and collating each element separately:


{cell=main}
```julia
DataLoaders.collate(batch)
```

This is also implemented for `NamedTuple`s and `Dict`s. You can also collate nested structures, e.g. a `Tuple` of `Dict`s and the structure is preserved. This also works when using [inplace loading](inplaceloading.md).
97 changes: 64 additions & 33 deletions docs/datacontainers.md
Original file line number Diff line number Diff line change
@@ -1,60 +1,91 @@
## Data containers
# Data containers

`DataLoaders` supports some data containers out of the box, like arrays
and tuples of arrays. For large datasets that don't fit into memory, however,
we need some custom logic that loads and preprocesses our data.
{.subtitle}
Introduction to data containers giving an overview over the kinds of datasets you can use

We can make any data container compatible with `DataLoaders` by implementing
the two methods `nobs` and `getobs` from the interface package [`LearnBase`](https://github.com/JuliaML/LearnBase.jl).
DataLoaders.jl is built to integrate with the further ecosystem and builds on a common interface to support datasets. We call such a dataset a **data container** and it needs to support the following operations:

`nobs(ds::MyDataset)` returns the number of observations in your data container
and `getobs(ds::MyDataset, idx)` loads a single observation.
- `getobs(data, i)` loads the `i`-th observation from a dataset
- `nobs(data)` gives the number of observations in a dataset.

For performance reasons, you may want to implement `getobs!(buf, ds::MyDataset, idx)`, a buffered version.
## Basic data containers

### At last, a realistic example
The simplest data container is a vector of values:

Image datasets are a good use case for `DataLoaders` because

- 20GB (or more) of images will likely not fit into your memory, so we need an
out-of-memory solution; and
- decoding images is CPU-bottlenecked (provided your secondary storage can keep up),
so we benefit from using multiple threads.
{cell=main}
```julia
using DataLoaders
@show v = rand(1:10, 10)
@show nobs(v)
getobs(v, 1)
```

A simple data container might simply be a struct that contains the paths to
a lot of image files, like so:
Multi-dimensional arrays also work, with the last dimension treated as the observation dimension:

{cell=main}
```julia
struct ImageDataset
files::Vector{String}
end
a = rand(50, 50, 10)
summary(getobs(a, 1))
```

Since we're only storing the file paths and not the actual images, `ImageDataset`
barely takes up memory.
You can also group multiple data containers with the same length together by putting them into a `Tuple`:

Implementing the data container interface is straightforward:
{cell=main}
```julia
data = (v, a)
getobs(data, 1)
```

You can pass any data container to [`DataLoader`](#) to create an iterator over batches:

```julia
import LearnBase: nobs, getobs
using Images: load
for batch in DataLoader(v, 2)
@assert size(batch) == (2,)
end

for batch in DataLoader(a, 2)
@assert size(batch) == (50, 50, 2)
end

for (vs, as) in DataLoader((v, a), 2)
@assert size(vs) == (2,)
@assert size(as) == (50, 50, 2)
end

nobs(ds::ImageDataset) = length(ds.files)
getobs(ds::ImageDataset, idx::Int) = load(ds.files[idx])
```

And now we can use it with `DataLoaders`:
## Out-of-memory data containers

Arrays, of course, are kept in memory, so we (1) cannot use them to store larger-than-memory datasets (2) don't need to use multithreading since loading an observation just involves indexing an array which is generally fast.

One way to quickly get into the territory of too-large-to-fit in memory is to work with image datasets. So instead of loading every image of a dataset into an array, we'll implement a data container that stores only the file names of each image. It will load the image itself only when `getobs` is called. To do that we'll implement a `struct` that stores a vector of file names, and implement `getobs` and `nobs` for that type.

```julia
data = ImageDataset(readdir("./IMAGENET_IMAGES"))
import DataLoaders.LearnBase: getobs, nobs
using Images

struct ImageDataset
files::Vector{String}
end
ImageDataset(folder::String) = ImageDataset(readdir(folder))

dataloader = DataLoader(data, 32; collate = false)
nobs(data::ImageDataset) = length(data.files)
getobs(data::ImageDataset, i::Int) = Images.load(data.files[i])
```
Now, if we have a folder full of images, we can create a data container and load them quickly into batches as follows:

for images in dataloader
# do your thing
```julia
data = ImageDataset("path/to/my/images")
for images in DataLoader(data, 16, collate = false)
# Do something
end
```

!!! note "Preprocessing"

Above we pass the `collate = false` argument because images may be of different sizes that cannot be collated. See [`collate`](#). In practice, it is common to apply some cropping and resizing to images so that they all have the same size.


!!! warning "Threads"

To use `DataLoaders`' multi-threading, you need to start Julia with multiple
Expand Down
12 changes: 12 additions & 0 deletions docs/ecosystem.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
# Ecosystem

{.subtitle}
Overview of packages that DataLoaders.jl builds on or that use it.

This package is part of an ecosystem of packages providing useful tools for machine learning in Julia. These compose nicely due to shared interface packages like [LearnBase.jl](https://github.com/JuliaML/LearnBase.jl) and the natural extensibility that Julia's multiple dispatch provides. DataLoaders.jl works with any package the implements the [data container interface](interface.md). This means you can easily drop it in to an existing workflow or use the functionality of other packages to work with DataLoaders.jl more effectively.

The most important package for manipulating data containers is [**MLDataPattern.jl**](https://github.com/JuliaML/MLDataPattern.jl) which provides a large set of tools for transforming and composing data containers. Some examples are given here: [Shuffling, subsetting, splitting](shuffling.md)

[**MLDatasets.jl**](https://github.com/JuliaML/MLDatasets.jl) makes it easy to load common benchmark datasets as data containers.

A package that makes heavy use of DataLoaders.jl to train large deep learning models is [**FastAI.jl**](https://github.com/FluxML/FastAI.jl). It also provides many easy-to-load data containers for larger computer vision, tabular, and NLP datasets.
Empty file added docs/howto/workingwith.md
Empty file.
8 changes: 8 additions & 0 deletions docs/inplaceloading.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Inplace loading

{.subtitle}
Background on inplace loading of data


When loading an observation of a [data container](datacontainers.md) requires allocating a lot of memory, it is sometimes possible to reuse a previous observation as a buffer to load into. To do so, the data container you're using, must [implement `getobs!`](interface.md). To use the buffered loading with this package, pass `buffered = true` to [`DataLoader`](#). This also works for collated batches.

24 changes: 24 additions & 0 deletions docs/interface.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Data container interface

{class="subtitle"}
Reference for implementing the data container interface. See [data containers](datacontainers.md) for an introduction.

To implement the data container interface for a custom type `T`, you must implement two functions:

- `LearnBase.getobs(data::T, i::Int)` loads the `i`-th observation
- `LearnBase.nobs(data::T, i::Int)::Int` gives the number of observations in a data container

You can _optionally_ also implement:

- `LearnBase.getobs!(buf, data::T, i::Int)`: loads the `i`-th observation into the preallocated buffer `buf`.


See [the MLDataPattern.jl documentation](https://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html) for a comprehensive discussion of and reference for data containers.

!!! note "Extending functions"

To define a method for the above functions, you need to import the functions explicitly. You can do this without installing `LearnBase` by running:

```julia
import DataLoaders.LearnBase: getobs, nobs, getobs!
````
17 changes: 0 additions & 17 deletions docs/motivation.md

This file was deleted.

47 changes: 16 additions & 31 deletions docs/quickstartpytorch.md
Original file line number Diff line number Diff line change
@@ -1,39 +1,24 @@
# Quickstart for PyTorch users
# Comparison to PyTorch

Like Pytorch's [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),
this package provides an iterator over your dataset that loads data in parallel in the
background.
This package is inspired by Pytorch's [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) and works a lot like it. The basic usage for both is `DataLoader(dataset, batchsize)`, but for other use cases there are some differences.

The basic interface is the same: `DataLoader(data, batchsize)`.
The most important things are:

See [`DataLoader`](#) for all supported options.
- DataLoaders.jl supports only map-style datasets at the moment
- It uses thread-based parallelism instead of process-based parallelism

## PyTorch vs. `DataLoaders.jl`
## Detailed comparison

### Dataset interface
Let's go through every argument to `torch.utils.data.DataLoader` and have a look at similarities and differences. See [`DataLoader`](#) for a full list of its arguments.

The dataset interface for map-style datasets is similar:
- `dataset`: This package currently only supports map-style datasets which work similarly to Python's, but instead of implementing `__getindex__` and `__len__`, you'd implement [`LearnBase.getobs`](#) and [`nobs`](#). [More info here](datacontainers.md).
- `batch_size = 1`: If not specified otherwise, the default batch size is 1 for both packages. In DataLoaders.jl, you can additionally pass in `nothing` to turn off batching.
- `shuffle = false`: This package's `DataLoader` does **not** support this argument. Shuffling should be applied to the dataset beforehand. See [working with data containers](howto/workingwith.md).
- `collate_fn`: DataLoaders.jl collates batches by default unless `collate = false` is passed. A custom collate function is not supported, but you can extend [`DataLoaders.collate`](#) for custom data types for the same effect.
- `drop_last = False`. DataLoaders.jl has the same behavior of returning a partial batch by default, but the keyword argument is `partial = false` with `partial = not drop_last`.
- `prefetch_factor`: This cannot be customized currently. The default behavior for DataLoaders.jl is for every thread to be preloading one batch.
- `pin_memory`: DataLoaders.jl does not interact with the GPU, but you can do this in your data container.
- `num_workers`, `persistent_workers`, `worker_init_fn`, `timeout`: Unlike PyTorch, this package does not use multiprocessing, but multithreading which is not practical in Python due to the GIL. As such these arguments do not apply. Currently, DataLoaders.jl will use either all threads except the primary one or all threads based on the keyword argument `useprimary = false`.
- `sampler`, `batch_sampler`, `generator`: This package does not currently support these arguments for customizing the randomness.

PyTorch:

- `mydataset.__getindex__(idx)`
- `mydataset.__len__()`

DataLoaders.jl:

- `LearnBase.getobs(mydataset, idx)`
- `LearnBase.nobs(mydataset)`

See [Data containers](datacontainers.md) for specifics.

### Sampling and shuffling

Unlike PyTorch's DataLoader, `DataLoaders.jl` delegates [shuffling, subsetting and sampling](shuffling.md) to existing packages. Consequently there are no `shuffle`, `sampler` and `batch_sampler` arguments.

### Automatic batching

Automatic batching is controlled with the `collate` keyword argument (default `true`). A custom `collate_fn` is not supported.

### Partial batches

PyTorch's `drop_last` becomes `partial` for compatibility with [`Flux`](https://github.com/FluxML/Flux.jl)s `DataLoader`.
24 changes: 24 additions & 0 deletions docs/setup.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
# Installation

### Julia

DataLoaders.jl is a package for the [Julia Programming Language](https://julialang.org/). To use the package you need to install Julia, which you can download [here](https://julialang.org/downloads/).

### DataLoaders.jl

Julia has a built-in package manager which is used to install packages. Running the installed `julia` command launches an interactive session. To install DataLoaders.jl, run the following command:

```julia-repl
using Pkg; Pkg.add("DataLoaders")
```

### Enabling multi-threading

To make use of multi-threaded data loading, you need to start Julia with multiple threads. If starting the `julia` executable yourself, you can pass a `-t <nthreads>` argument or set the environment variable `JULIA_NUM_THREADS` beforehand. To check that you have multiple threads available to you, run:

```julia-repl
julia> Threads.nthreads()
12
```

If you're running Julia in a Jupyter notebook, see [IJulia.jl's documentation](https://julialang.github.io/IJulia.jl/dev/manual/installation/#Installing-additional-Julia-kernels).
2 changes: 1 addition & 1 deletion docs/shuffling.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@

# Shuffling, subsetting, sampling
# Shuffling, subsetting, splitting

Shuffling your training data every epoch and splitting a dataset into training and validation splits are common practices.
While `DataLoaders` itself only provides tools to load your data effectively, using the underlying `MLDataPattern` package makes these things easy.
Expand Down
Loading