Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

leewyang · 2022-07-13T23:29:02Z

Checklist before submitting

Did you read the contributor guide?
Did you update the docs?
Did you write any tests to validate this change?
Did you update the CHANGELOG, if this change affects users?

Description

This introduces a DataModule base class for Spark integrations like KerasEstimator, along with a default PetastormDataModule and a new NVTabularDataModule that leverages NVTabular for GPU-accelerated data loading of tabular datasets. It is loosely based on the existing DataModule for PyTorch Lightning's TorchEstimator, which is unfortunately tied directly to the pl.LightningDataModule. Ideally, this base class could also be used for the plain PyTorch TorchEstimator as well.

Review process to land

All tests and other checks must succeed.
At least one member of the technical steering committee must review and approve.
If any member of the technical steering committee requests changes, they must be addressed.

leewyang · 2022-07-13T23:34:43Z

@tgaddair Here is the NVTabular PR with a higher-level API to abstract away the Readers and a reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.

github-actions · 2022-07-19T08:59:00Z

Unit Test Results

  1 049 files -   38   1 049 suites - 38 11h 12m 38s ⏱️ + 5m 44s
    813 tests +    2     755 ✔️ ±    0     58 💤 +    2 0 ❌ ±0
20 592 runs - 800 14 536 ✔️ - 506 6 056 💤 - 294 0 ❌ ±0

Results for commit 6e53391. ± Comparison against base commit 001260a.

♻️ This comment has been updated with latest results.

github-actions · 2022-07-19T08:59:28Z

Unit Test Results (with flaky tests)

  1 271 files +    76   1 271 suites +76 12h 4m 6s ⏱️ + 18m 0s
    813 tests +      2     755 ✔️ ±    0     58 💤 +    2 0 ❌ ±0
25 070 runs +1 454 17 244 ✔️ +940 7 826 💤 +514 0 ❌ ±0

Results for commit 6e53391. ± Comparison against base commit 001260a.

♻️ This comment has been updated with latest results.

EnricoMi

What do you mean with reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.? Does adding NVtabular inflate the image to 37GB? It is currently 15.3GB: https://github.com/horovod/horovod/runs/7841398530?check_suite_focus=true#step:231:16

I'd prefer to split this into to PRs, one for reducing the image size and one for the custom data loading.

Dockerfile.test.gpu

leewyang · 2022-08-15T20:44:11Z

@EnricoMi Thanks for the review!

What do you mean with reduced image size (18.1GB vs. 37GB) for Dockerfile.test.gpu.?

This was just a reference to an earlier version of the dockerfile, which was 37GB due to the use of a different base image. By going back to the nvidia/cuda base image, this image was reduced down to 18.1GB. So, this was not a separate action to try to reduce the existing image, but just to produce a reasonable-size image for this PR.

horovod/spark/keras/remote.py

horovod/spark/common/datamodule.py

horovod/spark/keras/datamodule.py

Dockerfile.test.gpu

docs/spark.rst

horovod/spark/common/datamodule.py

horovod/spark/keras/datamodule.py

horovod/spark/keras/util.py

horovod/spark/keras/datamodule.py

Dockerfile.test.gpu

horovod/spark/keras/estimator.py

docs/spark.rst

chongxiaoc · 2022-08-26T20:19:25Z

It looks this PR needs to be rebased after #3665 is merged.
make_dataset_fn() in keras is changed there.

horovod/spark/keras/datamodule.py

CHANGELOG.md

docker/horovod-nvtabular/Dockerfile

docs/spark.rst

Signed-off-by: Lee Yang <leey@nvidia.com>

Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Lee Yang <leewyang@gmail.com>

Signed-off-by: Lee Yang <leewyang@gmail.com>

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: Lee Yang <leewyang@gmail.com>

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: Lee Yang <leewyang@gmail.com>

…ar & conda from test dockerfiles; remove cupy dependency Signed-off-by: Lee Yang <leewyang@gmail.com>

Signed-off-by: Lee Yang <leewyang@gmail.com>

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

Signed-off-by: Lee Yang <leewyang@gmail.com>

docs/spark.rst

Signed-off-by: Lee Yang <leewyang@gmail.com>

EnricoMi

Excellent work!

EnricoMi · 2022-08-30T15:27:16Z

@chongxiaoc are you happy with this?

chongxiaoc · 2022-08-30T17:29:11Z

@leewyang @EnricoMi I will take a look asap. Thanks.

chongxiaoc

just catch a few inconsistencies.

horovod/spark/common/params.py

horovod/spark/keras/datamodule.py

horovod/spark/keras/util.py

Signed-off-by: Lee Yang <leewyang@gmail.com>

chongxiaoc

New round of checks. Just a few nits. For the NVTabular part, feel free to comment it since I'm not sure.

horovod/spark/common/datamodule.py

horovod/spark/common/params.py

horovod/spark/keras/datamodule.py

horovod/spark/keras/estimator.py

horovod/spark/keras/remote.py

Signed-off-by: Lee Yang <leewyang@gmail.com>

chongxiaoc

Great work. Wait for CI completes and merge.

leewyang force-pushed the leewyang_nvt branch from e143d2a to e835225 Compare July 13, 2022 23:30

leewyang force-pushed the leewyang_nvt branch 2 times, most recently from 8139a5b to bffd528 Compare August 5, 2022 23:05

EnricoMi reviewed Aug 15, 2022

View reviewed changes

Dockerfile.test.gpu Outdated Show resolved Hide resolved

Dockerfile.test.gpu Outdated Show resolved Hide resolved

Dockerfile.test.gpu Outdated Show resolved Hide resolved

Dockerfile.test.gpu Outdated Show resolved Hide resolved

leewyang force-pushed the leewyang_nvt branch from 7e28c59 to b06d0a9 Compare August 15, 2022 21:09

EnricoMi reviewed Aug 16, 2022

View reviewed changes

EnricoMi requested changes Aug 19, 2022

View reviewed changes

EnricoMi reviewed Aug 26, 2022

View reviewed changes

EnricoMi changed the title ~~add support for custom data loading (e.g. NVTabular) in KerasEstimator~~ Add custom data loading (e.g. NVTabular) in KerasEstimator Aug 26, 2022

chongxiaoc requested changes Aug 26, 2022

View reviewed changes

horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved

horovod/spark/keras/datamodule.py Outdated Show resolved Hide resolved

chongxiaoc reviewed Aug 29, 2022

View reviewed changes

horovod/spark/keras/datamodule.py Show resolved Hide resolved

EnricoMi reviewed Aug 29, 2022

View reviewed changes

CHANGELOG.md Show resolved Hide resolved

docker/horovod-nvtabular/Dockerfile Show resolved Hide resolved

docs/spark.rst Outdated Show resolved Hide resolved

leewyang and others added 12 commits August 29, 2022 15:36

add support for custom data loading (e.g. NVTabular) in KerasEstimator

ba4038a

Signed-off-by: Lee Yang <leey@nvidia.com>

use parameter substring instead of new Dockerfile arg

b1c5734

Signed-off-by: Lee Yang <leey@nvidia.com>

move import inside datamodule

c62535d

Signed-off-by: Lee Yang <leey@nvidia.com>

Update Dockerfile.test.gpu

ed514e0

Co-authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Lee Yang <leewyang@gmail.com>

address review comments

2c5ba07

Signed-off-by: Lee Yang <leewyang@gmail.com>

address some more comments

9381a30

Signed-off-by: Lee Yang <leewyang@gmail.com>

move register into DataModule class

18f83d1

Signed-off-by: Lee Yang <leewyang@gmail.com>

Apply suggestions from code review

862736d

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

add/update license headers

c9e046b

Signed-off-by: Lee Yang <leewyang@gmail.com>

remove datamodule registry

4548c02

Signed-off-by: Lee Yang <leewyang@gmail.com>

broadcast seed from rank zero

8f245d7

Signed-off-by: Lee Yang <leewyang@gmail.com>

Update horovod/spark/keras/estimator.py

b0bb819

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

leewyang and others added 9 commits August 29, 2022 15:36

Update horovod/spark/keras/util.py

53623b8

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

remove extra whitespace

7ed82a6

Signed-off-by: Lee Yang <leewyang@gmail.com>

Update docs/spark.rst

65a1f44

Signed-off-by: Lee Yang <leewyang@gmail.com>

fix whitespace

1349567

Signed-off-by: Lee Yang <leewyang@gmail.com>

move NVTabular dockerfile to docker/horovod-nvtabular; remove nvtabul…

68bd0e5

…ar & conda from test dockerfiles; remove cupy dependency Signed-off-by: Lee Yang <leewyang@gmail.com>

update ci.yaml

07c0956

Signed-off-by: Lee Yang <leewyang@gmail.com>

Apply suggestions from code review

f614d94

Signed-off-by: Lee Yang <leewyang@gmail.com> Co-authored-by: Enrico Minack <github@enrico.minack.dev>

add link to docker/horovod-nvtabular/Dockerfile

11773d7

Signed-off-by: Lee Yang <leewyang@gmail.com>

rebase branch to latest master

2c4459e

Signed-off-by: Lee Yang <leewyang@gmail.com>

leewyang force-pushed the leewyang_nvt branch from 8288ada to 2c4459e Compare August 30, 2022 00:11

EnricoMi reviewed Aug 30, 2022

View reviewed changes

docs/spark.rst Outdated Show resolved Hide resolved

docs/spark.rst Outdated Show resolved Hide resolved

update docs/spark.rst

21fba94

Signed-off-by: Lee Yang <leewyang@gmail.com>

leewyang requested review from chongxiaoc and EnricoMi and removed request for chongxiaoc and EnricoMi August 30, 2022 15:24

EnricoMi approved these changes Aug 30, 2022

View reviewed changes

chongxiaoc requested changes Aug 30, 2022

View reviewed changes

move data_module param to KerasEstimator; fix PetaStormDataModule

e7bdeb0

Signed-off-by: Lee Yang <leewyang@gmail.com>

chongxiaoc requested changes Aug 31, 2022

View reviewed changes

remove unused args; move data_module getter/setter; fix shuffle arg

6e53391

Signed-off-by: Lee Yang <leewyang@gmail.com>

chongxiaoc approved these changes Aug 31, 2022

View reviewed changes

chongxiaoc merged commit a658d0f into horovod:master Sep 1, 2022

leewyang deleted the leewyang_nvt branch September 1, 2022 15:55

EnricoMi mentioned this pull request Sep 9, 2022

NVTabular Docker image does not build #3691

Closed

leewyang mentioned this pull request Dec 2, 2022

support custom data loaders in TorchEstimator #3787

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

leewyang commented Jul 13, 2022

leewyang commented Jul 13, 2022

github-actions bot commented Jul 19, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited

EnricoMi left a comment

leewyang commented Aug 15, 2022

chongxiaoc commented Aug 26, 2022

EnricoMi left a comment

EnricoMi commented Aug 30, 2022

chongxiaoc commented Aug 30, 2022

chongxiaoc left a comment

chongxiaoc left a comment

chongxiaoc left a comment

Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

Add custom data loading (e.g. NVTabular) in KerasEstimator #3603

Conversation

leewyang commented Jul 13, 2022

Checklist before submitting

Description

Review process to land

leewyang commented Jul 13, 2022

github-actions bot commented Jul 19, 2022 • edited

Unit Test Results

github-actions bot commented Jul 19, 2022 • edited

Unit Test Results (with flaky tests)

EnricoMi left a comment

Choose a reason for hiding this comment

leewyang commented Aug 15, 2022

chongxiaoc commented Aug 26, 2022

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi commented Aug 30, 2022

chongxiaoc commented Aug 30, 2022

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc left a comment

Choose a reason for hiding this comment

chongxiaoc left a comment

Choose a reason for hiding this comment

github-actions bot commented Jul 19, 2022 •

edited

github-actions bot commented Jul 19, 2022 •

edited