Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset & DataLoader #118

Merged
merged 9 commits into from Dec 14, 2020
Merged

Dataset & DataLoader #118

merged 9 commits into from Dec 14, 2020

Conversation

alexander-g
Copy link
Contributor

@alexander-g alexander-g commented Dec 4, 2020

Dataset and parallel DataLoader API similar to PyTorch. Can be used with Model.fit()

class MyDataset(elegy.data.Dataset):
    def __len__(self):
        return 128

    def __getitem__(self, i):
        #dummy data
        return np.random.random([224, 224, 3]),  np.random.randint(10)

ds     = MyDataset()
loader = elegy.data.DataLoader(ds, batch_size=8, n_workers=8, worker_type='thread', shuffle=True)

batch = next(iter(loader))
assert batch[0].shape == (8,224,224,3)
assert batch[1].shape == (8,)
assert len(loader) == 16

model.fit(loader, epochs=10)

@codecov-io
Copy link

codecov-io commented Dec 4, 2020

Codecov Report

Merging #118 (42f1330) into master (5549de5) will increase coverage by 0.94%.
The diff coverage is 96.92%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #118      +/-   ##
==========================================
+ Coverage   77.51%   78.45%   +0.94%     
==========================================
  Files         106      108       +2     
  Lines        4856     5050     +194     
==========================================
+ Hits         3764     3962     +198     
+ Misses       1092     1088       -4     
Impacted Files Coverage Δ
elegy/data/dataset.py 94.89% <94.89%> (ø)
elegy/data/dataset_test.py 98.92% <98.92%> (ø)
elegy/data/__init__.py 100.00% <100.00%> (ø)
elegy/data/data_handler.py 73.33% <100.00%> (+1.71%) ⬆️
elegy/data/array_adapter.py 90.32% <0.00%> (+1.61%) ⬆️
elegy/callbacks/progbar_logger.py 77.95% <0.00%> (+1.61%) ⬆️
elegy/data/utils.py 72.91% <0.00%> (+5.20%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5549de5...42f1330. Read the comment docs.

@alexander-g
Copy link
Contributor Author

Requesting review @cgarciae

@cgarciae
Copy link
Collaborator

cgarciae commented Dec 13, 2020

Hey @alexander-g, this is a very nice addition, thanks!

Some discussion:

  1. While we are already compatible with tf.data and pytorch's DataLoader since we accept generators of numpy arrays, the bonus of having this would be that users don't requires those additional requirements which can be quite heavy.
  2. Is this a direct port / copy of the Pytorch code (should be give them credits to respect the license) or is a fresh rewrite?
  3. Maybe not relevant right away but in case the implementation is not the same, does pytorch do anything special to improve performance? Multi-processing sadly has to serialize data when communicating between processes but you can try to be clever by using things like memmap. This can be addressed later.

@cgarciae
Copy link
Collaborator

I just tested this branch, created examples/mnist_dataloader.py to play around with the API, it works just as expected.
LGTM! @alexander-g want me to merge this now?

@alexander-g
Copy link
Contributor Author

to 1: Yes that was the idea behind it, to have a native solution without other heavy libs. One more small advantage to generators is that you don't have to specify steps_per_epoch in .fit(), it's done automatically by the DataLoader.
to 2: This is my own rewrite, only the API is similar.
to 3: I did take a look at PyTorch source code. It's much more complicated than mine so I guess there are many optimizations. I've set threads as the default worker type, with threads there should be no serialization between the workers. Threads are of course subject to the global interpreter lock but so far I have not experienced a performance disadvantage. I guess because it doesn't matter for IO and most functions like PIL.Image.resize are implemented in C and circumvent the GIL anyway.

I think the example for mnist is not very useful. This module is more meant for large datasets that don't fit into memory and have to be loaded on the fly. I will add something later.

Yes, I think this can be merged, it's already usable. I will add more functionality later.

@cgarciae
Copy link
Collaborator

  • Yeah the MNIST example was just a quick test of the basic API. We could add data augmentation to make it a bit more useful but it would probably make more sense to use a different dataset.
  • I think threads as default is good, the documentation about the worker types is clear.

I will go ahead and merge!

@cgarciae cgarciae merged commit 2165991 into poets-ai:master Dec 14, 2020
@alexander-g alexander-g deleted the dataset branch December 22, 2020 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants