Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

Merged
merged 7 commits into from
Jan 9, 2020

Conversation

tgaddair
Copy link
Collaborator

Copy link
Collaborator

@EnricoMi EnricoMi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice wiring

test/test_spark.py Outdated Show resolved Hide resolved
test/test_spark.py Outdated Show resolved Hide resolved
@@ -0,0 +1,567 @@
# Copyright 2017 onwards, fast.ai, Inc.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 567 lines script deviates from examples/keras_spark_rosspann.py in only 11 rows. Differences should only be spark3 specific, but I suspect both files will diverge quite quickly on other changes.

I suggest to put some github actions in place that compare both scripts and flag up unexpected deviations, as put in place for README.rst and docs/summary.rst. I am happy to create a PR for this once this is merged into master.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. At some point we may want to further consolidate things so that we don't have so many nearly identical scripts. Once Spark 3 is out, we may just replace the old examples or add special handling for older versions of Spark.

@abditag2
Copy link
Collaborator

After @EnricoMi 's comments, LGTM.

@tgravescs
Copy link
Contributor

the changes look good for local-cluster and standalone deployment.

If you want to support yarn or k8s the configs for the gpu scheduling will be slightly different.

Note that your default - local-cluster[2,1,1024] is using 2 workers and thus is relying on you also having 2 GPUs on the host. If you try to run on a host without at least 2 GPUs, it will fail with a error saying not enough GPU addresses available.

Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Signed-off-by: Travis Addair <taddair@uber.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants