Dataset & DataLoader #118

alexander-g · 2020-12-04T15:07:52Z

Dataset and parallel DataLoader API similar to PyTorch. Can be used with Model.fit()

class MyDataset(elegy.data.Dataset):
    def __len__(self):
        return 128

    def __getitem__(self, i):
        #dummy data
        return np.random.random([224, 224, 3]),  np.random.randint(10)

ds     = MyDataset()
loader = elegy.data.DataLoader(ds, batch_size=8, n_workers=8, worker_type='thread', shuffle=True)

batch = next(iter(loader))
assert batch[0].shape == (8,224,224,3)
assert batch[1].shape == (8,)
assert len(loader) == 16

model.fit(loader, epochs=10)

codecov-io · 2020-12-04T15:11:22Z

Codecov Report

Merging #118 (42f1330) into master (5549de5) will increase coverage by 0.94%.
The diff coverage is 96.92%.

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
+ Coverage   77.51%   78.45%   +0.94%     
==========================================
  Files         106      108       +2     
  Lines        4856     5050     +194     
==========================================
+ Hits         3764     3962     +198     
+ Misses       1092     1088       -4

Impacted Files	Coverage Δ
elegy/data/dataset.py	`94.89% <94.89%> (ø)`
elegy/data/dataset_test.py	`98.92% <98.92%> (ø)`
elegy/data/__init__.py	`100.00% <100.00%> (ø)`
elegy/data/data_handler.py	`73.33% <100.00%> (+1.71%)`	⬆️
elegy/data/array_adapter.py	`90.32% <0.00%> (+1.61%)`	⬆️
elegy/callbacks/progbar_logger.py	`77.95% <0.00%> (+1.61%)`	⬆️
elegy/data/utils.py	`72.91% <0.00%> (+5.20%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5549de5...42f1330. Read the comment docs.

alexander-g · 2020-12-13T11:25:53Z

Requesting review @cgarciae

cgarciae · 2020-12-13T23:34:23Z

Hey @alexander-g, this is a very nice addition, thanks!

Some discussion:

While we are already compatible with tf.data and pytorch's DataLoader since we accept generators of numpy arrays, the bonus of having this would be that users don't requires those additional requirements which can be quite heavy.
Is this a direct port / copy of the Pytorch code (should be give them credits to respect the license) or is a fresh rewrite?
Maybe not relevant right away but in case the implementation is not the same, does pytorch do anything special to improve performance? Multi-processing sadly has to serialize data when communicating between processes but you can try to be clever by using things like memmap. This can be addressed later.

cgarciae · 2020-12-14T00:41:06Z

I just tested this branch, created examples/mnist_dataloader.py to play around with the API, it works just as expected.
LGTM! @alexander-g want me to merge this now?

alexander-g · 2020-12-14T09:47:54Z

to 1: Yes that was the idea behind it, to have a native solution without other heavy libs. One more small advantage to generators is that you don't have to specify steps_per_epoch in .fit(), it's done automatically by the DataLoader.
to 2: This is my own rewrite, only the API is similar.
to 3: I did take a look at PyTorch source code. It's much more complicated than mine so I guess there are many optimizations. I've set threads as the default worker type, with threads there should be no serialization between the workers. Threads are of course subject to the global interpreter lock but so far I have not experienced a performance disadvantage. I guess because it doesn't matter for IO and most functions like PIL.Image.resize are implemented in C and circumvent the GIL anyway.

I think the example for mnist is not very useful. This module is more meant for large datasets that don't fit into memory and have to be loaded on the fly. I will add something later.

Yes, I think this can be merged, it's already usable. I will add more functionality later.

cgarciae · 2020-12-14T13:40:48Z

Yeah the MNIST example was just a quick test of the basic API. We could add data augmentation to make it a bit more useful but it would probably make more sense to use a different dataset.
I think threads as default is good, the documentation about the worker types is clear.

I will go ahead and merge!

alexander-g added 4 commits December 2, 2020 17:46

Dataset and DataLoader API

b29feee

Added multi-process worker types

65f1aa6

Merge branch 'master' into dataset

3dd19b4

docs & black

d90a0d4

alexander-g added 2 commits December 6, 2020 15:40

jnp.asarray seems to be much faster than jnp.stack for small inputs

5925e8f

mkdocs

f496bb3

alexander-g and others added 2 commits December 13, 2020 12:35

Merge branch 'master' into dataset

03a46c9

Merge branch 'master' into dataset

7d47544

add examples/mnist_dataloader.py

42f1330

cgarciae merged commit 2165991 into poets-ai:master Dec 14, 2020

alexander-g deleted the dataset branch December 22, 2020 13:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset & DataLoader #118

Dataset & DataLoader #118

alexander-g commented Dec 4, 2020 •

edited

codecov-io commented Dec 4, 2020 •

edited

alexander-g commented Dec 13, 2020

cgarciae commented Dec 13, 2020 •

edited

cgarciae commented Dec 14, 2020

alexander-g commented Dec 14, 2020

cgarciae commented Dec 14, 2020

Dataset & DataLoader #118

Dataset & DataLoader #118

Conversation

alexander-g commented Dec 4, 2020 • edited

codecov-io commented Dec 4, 2020 • edited

Codecov Report

alexander-g commented Dec 13, 2020

cgarciae commented Dec 13, 2020 • edited

cgarciae commented Dec 14, 2020

alexander-g commented Dec 14, 2020

cgarciae commented Dec 14, 2020

alexander-g commented Dec 4, 2020 •

edited

codecov-io commented Dec 4, 2020 •

edited

cgarciae commented Dec 13, 2020 •

edited