-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
changed DataLoader to BatchDataLoader #1879
Conversation
@abditag2 Petastorm v0.9.0 has been released. Is this ready to go? |
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Dockerfile.test.gpu
Outdated
@@ -69,6 +69,9 @@ RUN if [[ ${MPI_KIND} != "None" ]]; then \ | |||
pip install mpi4py; \ | |||
fi | |||
|
|||
# Install PetaStorm | |||
RUN pip install petastorm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not necessary, we already install petastorm.
setup.py
Outdated
@@ -1579,7 +1579,7 @@ def build_extensions(self): | |||
keras_require_list = ['keras>=2.0.8,!=2.0.9,!=2.1.0,!=2.1.1'] | |||
pytorch_require_list = ['torch'] | |||
mxnet_require_list = ['mxnet>=1.4.1'] | |||
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.8.2', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0 | |||
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.9.0', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no need to pin to a specific version anymore (that was only necessary when there was a pyarrow
incompatibility. You can just change this to: petastorm>=0.9.0
.
Also, we can remove this comment about Petastorm 0.7.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed the comment and petastorm version.
train_loader = DataLoader(train_reader, | ||
batch_size=batch_size, | ||
shuffling_queue_capacity=shuffle_buffer_size) | ||
train_loader = BatchedDataLoader(train_reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is BatchedDataLoader strictly better than DataLoader in all cases? Are there scenarios where DataLoader will perform better than BatchedDataLoader?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my experiments it is strictly better, the only reason we still have DataLoader is that certain data types (string/decimals) are not supported by BatchedDataLoader.
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
Signed-off-by: fardin abdi <fardin@uber.com>
setup.py
Outdated
@@ -1579,7 +1579,7 @@ def build_extensions(self): | |||
keras_require_list = ['keras>=2.0.8,!=2.0.9,!=2.1.0,!=2.1.1'] | |||
pytorch_require_list = ['torch'] | |||
mxnet_require_list = ['mxnet>=1.4.1'] | |||
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm==0.8.2', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] # Petastorm 0.7.7 is not compatible with pyarrow<0.15.0 | |||
spark_require_list = ['h5py>=2.9', 'numpy', 'petastorm', 'pyarrow>=0.15.0', 'pyspark>=2.3.2'] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to specify petastorm>=0.9.0
(lower bound). Otherwise, the user could have an older version already installed, pip won't update it, and then BatchedDataLoader
will fail to import.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fair point. Added.
Signed-off-by: fardin abdi <fardin@uber.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Has to be merged after this PR is merged and released in Petastorm: uber/petastorm#540