Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

Merged
merged 24 commits into from Sep 1, 2022

Conversation

leewyang
Copy link
Contributor

Checklist before submitting

  • Did you read the contributor guide?
  • Did you update the docs?
  • Did you write any tests to validate this change?
  • Did you update the CHANGELOG, if this change affects users?

Description

Fixes #3602.

This introduces a DataModule base class for Spark integrations like KerasEstimator, along with a default PetastormDataModule and a new NVTabularDataModule that leverages NVTabular for GPU-accelerated data loading of tabular datasets. It is loosely based on the existing DataModule for PyTorch Lightning's TorchEstimator, which is unfortunately tied directly to the pl.LightningDataModule. Ideally, this base class could also be used for the plain PyTorch TorchEstimator as well.

Review process to land

  1. All tests and other checks must succeed.
  2. At least one member of the technical steering committee must review and approve.
  3. If any member of the technical steering committee requests changes, they must be addressed.

@leewyang
Copy link
Contributor Author

@tgaddair Here is the NVTabular PR with a higher-level API to abstract away the Readers and a reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.

@github-actions
Copy link

github-actions bot commented Jul 19, 2022

Unit Test Results

  1 049 files   -   38    1 049 suites   - 38   11h 12m 38s ⏱️ + 5m 44s
     813 tests +    2       755 ✔️ ±    0       58 💤 +    2  0 ±0 
20 592 runs   - 800  14 536 ✔️  - 506  6 056 💤  - 294  0 ±0 

Results for commit 6e53391. ± Comparison against base commit 001260a.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Jul 19, 2022

Unit Test Results (with flaky tests)

  1 271 files  +     76    1 271 suites  +76   12h 4m 6s ⏱️ + 18m 0s
     813 tests +       2       755 ✔️ ±    0       58 💤 +    2  0 ±0 
25 070 runs  +1 454  17 244 ✔️ +940  7 826 💤 +514  0 ±0 

Results for commit 6e53391. ± Comparison against base commit 001260a.

♻️ This comment has been updated with latest results.

@leewyang leewyang force-pushed the leewyang_nvt branch 2 times, most recently from 8139a5b to bffd528 Compare August 5, 2022 23:05
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean with reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.? Does adding NVtabular inflate the image to 37GB? It is currently 15.3GB: https://github.com/horovod/horovod/runs/7841398530?check_suite_focus=true#step:231:16

I'd prefer to split this into to PRs, one for reducing the image size and one for the custom data loading.

Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
@leewyang
Copy link
Contributor Author

@EnricoMi Thanks for the review!

What do you mean with reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.?

This was just a reference to an earlier version of the dockerfile, which was 37GB due to the use of a different base image. By going back to the nvidia/cuda base image, this image was reduced down to 18.1GB. So, this was not a separate action to try to reduce the existing image, but just to produce a reasonable-size image for this PR.

horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
docs/spark.rst Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Show resolved Hide resolved
horovod/spark/keras/util.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
Dockerfile.test.gpu Outdated Show resolved Hide resolved
horovod/spark/keras/estimator.py Outdated Show resolved Hide resolved
docs/spark.rst Outdated Show resolved Hide resolved
@EnricoMi EnricoMi changed the title add support for custom data loading (e.g. NVTabular) in KerasEstimator Add custom data loading (e.g. NVTabular) in KerasEstimator Aug 26, 2022
@chongxiaoc
Copy link
Collaborator

It looks this PR needs to be rebased after #3665 is merged.
make_dataset_fn() in keras is changed there.

horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
docker/horovod-nvtabular/Dockerfile Show resolved Hide resolved
docs/spark.rst Outdated Show resolved Hide resolved
leewyang and others added 12 commits August 29, 2022 15:36
Signed-off-by: Lee Yang <leey@nvidia.com>
Signed-off-by: Lee Yang <leey@nvidia.com>
Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
leewyang and others added 9 commits August 29, 2022 15:36
Signed-off-by: Lee Yang <leewyang@gmail.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
…ar & conda from test dockerfiles; remove cupy dependency

Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Lee Yang <leewyang@gmail.com>
Signed-off-by: Lee Yang <leewyang@gmail.com>
docs/spark.rst Outdated Show resolved Hide resolved
docs/spark.rst Outdated Show resolved Hide resolved
Signed-off-by: Lee Yang <leewyang@gmail.com>
@leewyang leewyang requested review from chongxiaoc and EnricoMi and removed request for chongxiaoc and EnricoMi August 30, 2022 15:24
Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work!

@EnricoMi
Copy link
Collaborator

@chongxiaoc are you happy with this?

@chongxiaoc
Copy link
Collaborator

@leewyang @EnricoMi I will take a look asap. Thanks.

Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just catch a few inconsistencies.

horovod/spark/common/params.py Outdated Show resolved Hide resolved
horovod/spark/common/params.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/util.py Show resolved Hide resolved
Signed-off-by: Lee Yang <leewyang@gmail.com>
Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New round of checks. Just a few nits. For the NVTabular part, feel free to comment it since I'm not sure.

horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/common/params.py Outdated Show resolved Hide resolved
horovod/spark/keras/datamodule.py Show resolved Hide resolved
horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved
horovod/spark/keras/estimator.py Show resolved Hide resolved
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Outdated Show resolved Hide resolved
horovod/spark/keras/remote.py Show resolved Hide resolved
Signed-off-by: Lee Yang <leewyang@gmail.com>
Copy link
Collaborator

@chongxiaoc chongxiaoc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work. Wait for CI completes and merge.

@chongxiaoc chongxiaoc merged commit a658d0f into horovod:master Sep 1, 2022
@leewyang leewyang deleted the leewyang_nvt branch September 1, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Support custom data loaders in KerasEstimator
3 participants