lorenzoh · lorenzoh · Dec 3, 2021 · Dec 3, 2021
diff --git a/README.md b/README.md
@@ -1,41 +1,45 @@
-# DataLoaders
+# DataLoaders.jl
 
 [Documentation (latest)](https://lorenzoh.github.io/DataLoaders.jl/dev)
 
-A threaded data iterator for machine learning on out-of-memory datasets. Inspired by PyTorch's DataLoader.
+A Julia package implementing performant data loading for deep learning on out-of-memory datasets that. Works like PyTorch's `DataLoader`.
 
-It uses  to load data **in parallel** while keeping the primary thread free. It can also load data **inplace** to avoid allocations.
+### What does it do?
 
-Many data containers work out of the box and it is easy to [extend with your own](docs/datacontainers.md).
+- Uses multi-threading to load data in parallel while keeping the primary thread free for the training loop
+- Handles batching and [collating](docs/collate.md)
+- Is simple to [extend](docs/interface.md) for custom datasets
+- Integrates well with other packages in the [ecosystem](docs/ecosystem.md)
+- Allows for [inplace loading](docs/inplaceloading.md) to reduce memory load
 
-`DataLoaders` is built on top of and fully compatible with `MLDataPattern.jl`'s [Data Access Pattern](https://mldatautilsjl.readthedocs.io/en/latest/data/pattern.html), a functional interface for machine learning datasets.
+### When should you use it?
 
-## Usage
+- You have a dataset that does not fit into memory
+- You want to reduce the time your training loop is waiting for the next batch of data
 
-```julia
-x = rand(128, 10000)  #  10000 observations of size 128
-y = rand(1, 10000)
+### How do you use it?
+
+Install like any other Julia package using the package manager (see [setup](docs/setup.md)):
+
+```julia-repl
+]add DataLoaders
+```
+
+After installation, import it, create a `DataLoader` from a dataset and batch size, and iterate over it:
 
-dataloader = DataLoader((x, y), 16)
+```julia
+using DataLoaders
+# 10.000 observations of inputs with 128 features and one target feature
+data = (rand(128, 10000), rand(1, 10000))
+dataloader = DataLoader(data, 16)
 
 for (xs, ys) in dataloader
     @assert size(xs) == (128, 16)
     @assert size(ys) == (1, 16)
 end
 ```
 
-Of course, in the above example, we can keep the dataset in memory and don't need parallel workers. See [Custom data containers](docs/datacontainers.md) for a more realistic example.
-
-## Getting Started
-
-If you get the idea and know it from PyTorch, see [Quickstart for PyTorch users](docs/quickstartpytorch.md).
-
-Otherwise, read on [here](docs/motivation.md).
-
-Available methods are documented [here](docstrings.md).
-
-## Acknowledgements
+### Next, you may want to read
 
-- [`MLDataPattern.jl`](https://github.com/JuliaML/MLDataPattern.jl)
-- [`ThreadPools.jl`](https://github.com/tro3/ThreadPools.jl)
-- [PyTorch `DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader)
+- [What datasets you can use it with](docs/datacontainers.md)
+- [How it compares to PyTorch's data loader](docs/quickstartpytorch.md)
diff --git a/docs/collate.md b/docs/collate.md
@@ -0,0 +1,47 @@
+# Collating
+
+Collating refers to combining a batch of observations so that the arrays in individual observations are stacked together. As an example, consider a dataset with 100 observations, each with 2 features.
+
+{cell=main result=false}
+```julia
+data = rand(2, 100)  # observation dimension is last by default
+```
+
+We can collect a batch of 4 observations into a vector as follows:
+{cell=main}
+```julia
+using DataLoaders
+batch = [getobs(data, i) for i in 1:4]
+```
+
+Many machine learning models, however, expect an input in a _collated_ format: instead of a nested vector of vectors, we need a single ND-array. DataLoaders.jl provides the [`collate`](#) function for this:
+
+{cell=main}
+```julia
+DataLoaders.collate(batch)
+```
+
+As you can see, the batch dimension is the last one by default.
+
+
+## Nested observations
+
+The above case only shows how to collate observations that each consist of a single array. In practice, however, observations will often consist of multiple variables like input features and a target variable. For example, we could have an integer indicating the class of an input sample.
+
+{cell=main}
+```julia
+inputs = rand(2, 100)
+targets = rand(1:10, 100)
+data = (inputs, targets)
+batch = [getobs(data, i) for i in 1:4]
+```
+
+Collating here also works, by keeping the tuple structure and collating each element separately:
+
+
+{cell=main}
+```julia
+DataLoaders.collate(batch)
+```
+
+This is also implemented for `NamedTuple`s and `Dict`s. You can also collate nested structures, e.g. a `Tuple` of `Dict`s and the structure is preserved. This also works when using [inplace loading](inplaceloading.md).
diff --git a/docs/datacontainers.md b/docs/datacontainers.md
@@ -1,60 +1,91 @@
-## Data containers
+# Data containers
 
-`DataLoaders` supports some data containers out of the box, like arrays
-and tuples of arrays. For large datasets that don't fit into memory, however,
-we need some custom logic that loads and preprocesses our data.
+{.subtitle}
+Introduction to data containers giving an overview over the kinds of datasets you can use
 
-We can make any data container compatible with `DataLoaders` by implementing
-the two methods `nobs` and `getobs` from the interface package [`LearnBase`](https://github.com/JuliaML/LearnBase.jl).
+DataLoaders.jl is built to integrate with the further ecosystem and builds on a common interface to support datasets. We call such a dataset a **data container** and it needs to support the following operations:
 
-`nobs(ds::MyDataset)` returns the number of observations in your data container
-and `getobs(ds::MyDataset, idx)` loads a single observation.
+- `getobs(data, i)` loads the `i`-th observation from a dataset
+- `nobs(data)` gives the number of observations in a dataset.
 
-For performance reasons, you may want to implement `getobs!(buf, ds::MyDataset, idx)`, a buffered version.
+## Basic data containers
 
-### At last, a realistic example
+The simplest data container is a vector of values:
 
-Image datasets are a good use case for `DataLoaders` because
-
-- 20GB (or more) of images will likely not fit into your memory, so we need an
-  out-of-memory solution; and
-- decoding images is CPU-bottlenecked (provided your secondary storage can keep up),
-  so we benefit from using multiple threads.
+{cell=main}
+```julia
+using DataLoaders
+@show v = rand(1:10, 10)
+@show nobs(v)
+getobs(v, 1)
+```
 
-A simple data container might simply be a struct that contains the paths to
-a lot of image files, like so:
+Multi-dimensional arrays also work, with the last dimension treated as the observation dimension:
 
+{cell=main}
 ```julia
-struct ImageDataset
-    files::Vector{String}
-end
+a = rand(50, 50, 10)
+summary(getobs(a, 1))
 ```
 
-Since we're only storing the file paths and not the actual images, `ImageDataset`
-barely takes up memory.
+You can also group multiple data containers with the same length together by putting them into a `Tuple`:
 
-Implementing the data container interface is straightforward:
+{cell=main}
+```julia
+data = (v, a)
+getobs(data, 1)
+```
+
+You can pass any data container to [`DataLoader`](#) to create an iterator over batches:
 
 ```julia
-import LearnBase: nobs, getobs
-using Images: load
+for batch in DataLoader(v, 2)
+    @assert size(batch) == (2,)
+end
+
+for batch in DataLoader(a, 2) 
+    @assert size(batch) == (50, 50, 2)
+end
+
+for (vs, as) in DataLoader((v, a), 2) 
+    @assert size(vs) == (2,)
+    @assert size(as) == (50, 50, 2)
+end
 
-nobs(ds::ImageDataset) = length(ds.files)
-getobs(ds::ImageDataset, idx::Int) = load(ds.files[idx])
 ```
 
-And now we can use it with `DataLoaders`:
+## Out-of-memory data containers
+
+Arrays, of course, are kept in memory, so we (1) cannot use them to store larger-than-memory datasets (2) don't need to use multithreading since loading an observation just involves indexing an array which is generally fast.
+
+One way to quickly get into the territory of too-large-to-fit in memory is to work with image datasets. So instead of loading every image of a dataset into an array, we'll implement a data container that stores only the file names of each image. It will load the image itself only when `getobs` is called. To do that we'll implement a `struct` that stores a vector of file names, and implement `getobs` and `nobs` for that type.
 
 ```julia
-data = ImageDataset(readdir("./IMAGENET_IMAGES"))
+import DataLoaders.LearnBase: getobs, nobs
+using Images
+
+struct ImageDataset
+    files::Vector{String}
+end
+ImageDataset(folder::String) = ImageDataset(readdir(folder))
 
-dataloader = DataLoader(data, 32; collate = false)
+nobs(data::ImageDataset) = length(data.files)
+getobs(data::ImageDataset, i::Int) = Images.load(data.files[i])
+```
+Now, if we have a folder full of images, we can create a data container and load them quickly into batches as follows:
 
-for images in dataloader
-    # do your thing
+```julia
+data = ImageDataset("path/to/my/images")
+for images in DataLoader(data, 16, collate = false)
+    # Do something
 end
 ```
 
+!!! note "Preprocessing"
+
+    Above we pass the `collate = false` argument because images may be of different sizes that cannot be collated. See [`collate`](#). In practice, it is common to apply some cropping and resizing to images so that they all have the same size.
+
+
 !!! warning "Threads"
 
     To use `DataLoaders`' multi-threading, you need to start Julia with multiple

diff --git a/docs/ecosystem.md b/docs/ecosystem.md
@@ -0,0 +1,12 @@
+# Ecosystem
+
+{.subtitle}
+Overview of packages that DataLoaders.jl builds on or that use it.
+
+This package is part of an ecosystem of packages providing useful tools for machine learning in Julia. These compose nicely due to shared interface packages like [LearnBase.jl](https://github.com/JuliaML/LearnBase.jl) and the natural extensibility that Julia's multiple dispatch provides. DataLoaders.jl works with any package the implements the [data container interface](interface.md). This means you can easily drop it in to an existing workflow or use the functionality of other packages to work with DataLoaders.jl more effectively.
+
+The most important package for manipulating data containers is [**MLDataPattern.jl**](https://github.com/JuliaML/MLDataPattern.jl) which provides a large set of tools for transforming and composing data containers. Some examples are given here: [Shuffling, subsetting, splitting](shuffling.md)
+
+[**MLDatasets.jl**](https://github.com/JuliaML/MLDatasets.jl) makes it easy to load common benchmark datasets as data containers.
+
+A package that makes heavy use of DataLoaders.jl to train large deep learning models is [**FastAI.jl**](https://github.com/FluxML/FastAI.jl). It also provides many easy-to-load data containers for larger computer vision, tabular, and NLP datasets.
diff --git a/docs/howto/workingwith.md b/docs/howto/workingwith.md
diff --git a/docs/inplaceloading.md b/docs/inplaceloading.md
@@ -0,0 +1,8 @@
+# Inplace loading
+
+{.subtitle}
+Background on inplace loading of data
+
+
+When loading an observation of a [data container](datacontainers.md) requires allocating a lot of memory, it is sometimes possible to reuse a previous observation as a buffer to load into. To do so, the data container you're using, must [implement `getobs!`](interface.md). To use the buffered loading with this package, pass `buffered = true` to [`DataLoader`](#). This also works for collated batches.
+
diff --git a/docs/interface.md b/docs/interface.md
@@ -0,0 +1,24 @@
+# Data container interface
+
+{class="subtitle"}
+Reference for implementing the data container interface. See [data containers](datacontainers.md) for an introduction.
+
+To implement the data container interface for a custom type `T`, you must implement two functions:
+
+- `LearnBase.getobs(data::T, i::Int)` loads the `i`-th observation
+- `LearnBase.nobs(data::T, i::Int)::Int` gives the number of observations in a data container
+
+You can _optionally_ also implement:
+
+- `LearnBase.getobs!(buf, data::T, i::Int)`: loads the `i`-th observation into the preallocated buffer `buf`.
+
+
+See [the MLDataPattern.jl documentation](https://mldatapatternjl.readthedocs.io/en/latest/documentation/container.html) for a comprehensive discussion of and reference for data containers.
+
+!!! note "Extending functions"
+
+    To define a method for the above functions, you need to import the functions explicitly. You can do this without installing `LearnBase` by running:
+
+    ```julia
+    import DataLoaders.LearnBase: getobs, nobs, getobs!
+    ````
diff --git a/docs/motivation.md b/docs/motivation.md
diff --git a/docs/quickstartpytorch.md b/docs/quickstartpytorch.md
@@ -1,39 +1,24 @@
-# Quickstart for PyTorch users
+# Comparison to PyTorch
 
-Like Pytorch's [DataLoader](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader),
-this package provides an iterator over your dataset that loads data in parallel in the
-background.
+This package is inspired by Pytorch's [`torch.utils.data.DataLoader`](https://pytorch.org/docs/stable/data.html#torch.utils.data.DataLoader) and works a lot like it. The basic usage for both is `DataLoader(dataset, batchsize)`, but for other use cases there are some differences.
 
-The basic interface is the same: `DataLoader(data, batchsize)`.
+The most important things are:
 
-See [`DataLoader`](#) for all supported options.
+- DataLoaders.jl supports only map-style datasets at the moment
+- It uses thread-based parallelism instead of process-based parallelism
 
-## PyTorch vs. `DataLoaders.jl`
+## Detailed comparison
 
-### Dataset interface 
+Let's go through every argument to `torch.utils.data.DataLoader` and have a look at similarities and differences. See [`DataLoader`](#) for a full list of its arguments.
 
-The dataset interface for map-style datasets is similar:
+- `dataset`: This package currently only supports map-style datasets which work similarly to Python's, but instead of implementing `__getindex__` and `__len__`, you'd implement [`LearnBase.getobs`](#) and [`nobs`](#). [More info here](datacontainers.md).
+- `batch_size = 1`: If not specified otherwise, the default batch size is 1 for both packages. In DataLoaders.jl, you can additionally pass in `nothing` to turn off batching.
+- `shuffle = false`: This package's `DataLoader` does **not** support this argument. Shuffling should be applied to the dataset beforehand. See [working with data containers](howto/workingwith.md).
+- `collate_fn`: DataLoaders.jl collates batches by default unless `collate = false` is passed. A custom collate function is not supported, but you can extend [`DataLoaders.collate`](#) for custom data types for the same effect.
+- `drop_last = False`. DataLoaders.jl has the same behavior of returning a partial batch by default, but the keyword argument is `partial = false` with `partial = not drop_last`.
+- `prefetch_factor`: This cannot be customized currently. The default behavior for DataLoaders.jl is for every thread to be preloading one batch.
+- `pin_memory`: DataLoaders.jl does not interact with the GPU, but you can do this in your data container.
+- `num_workers`, `persistent_workers`, `worker_init_fn`, `timeout`: Unlike PyTorch, this package does not use multiprocessing, but multithreading which is not practical in Python due to the GIL. As such these arguments do not apply. Currently, DataLoaders.jl will use either all threads except the primary one or all threads based on the keyword argument `useprimary = false`.
+- `sampler`, `batch_sampler`, `generator`: This package does not currently support these arguments for customizing the randomness.
 
-PyTorch:
 
-- `mydataset.__getindex__(idx)`
-- `mydataset.__len__()`
-
-DataLoaders.jl:
-
-- `LearnBase.getobs(mydataset, idx)`
-- `LearnBase.nobs(mydataset)`
-
-See [Data containers](datacontainers.md) for specifics.
-
-### Sampling and shuffling
-
-Unlike PyTorch's DataLoader, `DataLoaders.jl` delegates [shuffling, subsetting and sampling](shuffling.md) to existing packages. Consequently there are no `shuffle`, `sampler` and `batch_sampler` arguments.
-
-### Automatic batching
-
-Automatic batching is controlled with the `collate` keyword argument (default `true`). A custom `collate_fn` is not supported.
-
-### Partial batches
-
-PyTorch's `drop_last` becomes `partial` for compatibility with [`Flux`](https://github.com/FluxML/Flux.jl)s `DataLoader`.
diff --git a/docs/setup.md b/docs/setup.md
@@ -0,0 +1,24 @@
+# Installation
+
+### Julia
+
+DataLoaders.jl is a package for the [Julia Programming Language](https://julialang.org/). To use the package you need to install Julia, which you can download [here](https://julialang.org/downloads/).
+
+### DataLoaders.jl
+
+Julia has a built-in package manager which is used to install packages. Running the installed `julia` command launches an interactive session. To install DataLoaders.jl, run the following command:
+
+```julia-repl
+using Pkg; Pkg.add("DataLoaders")
+```
+
+### Enabling multi-threading
+
+To make use of multi-threaded data loading, you need to start Julia with multiple threads. If starting the `julia` executable yourself, you can pass a `-t <nthreads>` argument or set the environment variable `JULIA_NUM_THREADS` beforehand. To check that you have multiple threads available to you, run:
+
+```julia-repl
+julia> Threads.nthreads()
+12
+```
+
+If you're running Julia in a Jupyter notebook, see [IJulia.jl's documentation](https://julialang.github.io/IJulia.jl/dev/manual/installation/#Installing-additional-Julia-kernels).
diff --git a/docs/shuffling.md b/docs/shuffling.md
@@ -1,5 +1,5 @@
 
-# Shuffling, subsetting, sampling
+# Shuffling, subsetting, splitting
 
 Shuffling your training data every epoch and splitting a dataset into training and validation splits are common practices.
 While `DataLoaders` itself only provides tools to load your data effectively, using the underlying `MLDataPattern` package makes these things easy.