Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

changed DataLoader to BatchDataLoader #1879

Merged
merged 8 commits into from
May 7, 2020
Merged

changed DataLoader to BatchDataLoader #1879

merged 8 commits into from
May 7, 2020

Conversation

abditag2
Copy link
Collaborator

@abditag2 abditag2 commented Apr 15, 2020

Has to be merged after this PR is merged and released in Petastorm: uber/petastorm#540

@abditag2 abditag2 changed the title Petastorm changed DataLoader to BatchDataLoader Apr 15, 2020
@abditag2 abditag2 requested a review from tgaddair April 15, 2020 23:22
@tgaddair
Copy link
Collaborator

@abditag2 Petastorm v0.9.0 has been released. Is this ready to go?

Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
@@ -69,6 +69,9 @@ RUN if [[ ${MPI_KIND} != "None" ]]; then \
pip install mpi4py; \
fi

# Install PetaStorm
RUN pip install petastorm
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not necessary, we already install petastorm.

setup.py Outdated
@@ -1579,7 +1579,7 @@ def build_extensions(self):
keras_require_list = ['keras>=2.0.8,!=2.0.9,!=2.1.0,!=2.1.1']
pytorch_require_list = ['torch']
mxnet_require_list = ['mxnet>=1.4.1']
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.8.2', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.9.0', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no need to pin to a specific version anymore (that was only necessary when there was a pyarrow incompatibility. You can just change this to: petastorm>=0.9.0.

Also, we can remove this comment about Petastorm 0.7.7

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed the comment and petastorm version.

train_loader = DataLoader(train_reader,
batch_size=batch_size,
shuffling_queue_capacity=shuffle_buffer_size)
train_loader = BatchedDataLoader(train_reader,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is BatchedDataLoader strictly better than DataLoader in all cases? Are there scenarios where DataLoader will perform better than BatchedDataLoader?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my experiments it is strictly better, the only reason we still have DataLoader is that certain data types (string/decimals) are not supported by BatchedDataLoader.

Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
setup.py Outdated
@@ -1579,7 +1579,7 @@ def build_extensions(self):
keras_require_list = ['keras>=2.0.8,!=2.0.9,!=2.1.0,!=2.1.1']
pytorch_require_list = ['torch']
mxnet_require_list = ['mxnet>=1.4.1']
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.8.2', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm', 'pyarrow>=0.15.0', 'pyspark>=2.3.2']
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to specify petastorm>=0.9.0 (lower bound). Otherwise, the user could have an older version already installed, pip won't update it, and then BatchedDataLoader will fail to import.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fair point. Added.

Signed-off-by: fardin abdi <fardin@uber.com>
Copy link
Collaborator

@tgaddair tgaddair left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@tgaddair tgaddair merged commit e9b1e22 into master May 7, 2020
@tgaddair tgaddair deleted the petastorm branch May 7, 2020 02:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants