Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

tgaddair · 2019-12-13T22:05:02Z

EnricoMi

nice wiring

test/test_spark.py

EnricoMi · 2019-12-16T08:02:56Z

examples/keras_spark3_rossmann.py

@@ -0,0 +1,567 @@
+# Copyright 2017 onwards, fast.ai, Inc.


This 567 lines script deviates from examples/keras_spark_rosspann.py in only 11 rows. Differences should only be spark3 specific, but I suspect both files will diverge quite quickly on other changes.

I suggest to put some github actions in place that compare both scripts and flag up unexpected deviations, as put in place for README.rst and docs/summary.rst. I am happy to create a PR for this once this is merged into master.

Good idea. At some point we may want to further consolidate things so that we don't have so many nearly identical scripts. Once Spark 3 is out, we may just replace the old examples or add special handling for older versions of Spark.

abditag2 · 2019-12-18T19:21:59Z

After @EnricoMi 's comments, LGTM.

tgravescs · 2019-12-18T20:38:54Z

the changes look good for local-cluster and standalone deployment.

If you want to support yarn or k8s the configs for the gpu scheduling will be slightly different.

Note that your default - local-cluster[2,1,1024] is using 2 workers and thus is relying on you also having 2 GPUs on the host. If you try to run on a host without at least 2 GPUs, it will fail with a error saying not enough GPU addresses available.

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair requested a review from abditag2 December 13, 2019 22:05

EnricoMi reviewed Dec 16, 2019

View reviewed changes

abditag2 approved these changes Dec 18, 2019

View reviewed changes

tgaddair added 7 commits January 9, 2020 09:11

Added support for Spark 3 TaskContext to fetch GPU resources per task

7540382

Signed-off-by: Travis Addair <taddair@uber.com>

Added example of get_available_devices

a8fedd6

Signed-off-by: Travis Addair <taddair@uber.com>

Added Spark 3 example

f45506c

Signed-off-by: Travis Addair <taddair@uber.com>

Typo

e2798da

Signed-off-by: Travis Addair <taddair@uber.com>

Consolidated spark_session

8a1455a

Signed-off-by: Travis Addair <taddair@uber.com>

Fixed usage of temp directory in tests

1bf8c4c

Signed-off-by: Travis Addair <taddair@uber.com>

Added comments explaining cluster setup

dfdc5ad

Signed-off-by: Travis Addair <taddair@uber.com>

tgaddair force-pushed the spark3 branch from 58a391a to dfdc5ad Compare January 9, 2020 18:10

tgaddair merged commit 2ef13e1 into master Jan 9, 2020

tgaddair deleted the spark3 branch January 9, 2020 19:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

tgaddair commented Dec 13, 2019

EnricoMi left a comment

EnricoMi Dec 16, 2019

tgaddair Jan 9, 2020

abditag2 commented Dec 18, 2019

tgravescs commented Dec 18, 2019

Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

Added support for Spark 3 TaskContext to fetch GPU resources per task #1584

Conversation

tgaddair commented Dec 13, 2019

EnricoMi left a comment

Choose a reason for hiding this comment

EnricoMi Dec 16, 2019

Choose a reason for hiding this comment

tgaddair Jan 9, 2020

Choose a reason for hiding this comment

abditag2 commented Dec 18, 2019

tgravescs commented Dec 18, 2019