Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial bare-metal implementation of elastic mode for fault tolerance and auto-scaling #1849

Merged
merged 116 commits into from May 15, 2020
Merged
Show file tree
Hide file tree
Changes from 65 commits
Commits
Show all changes
116 commits
Select commit Hold shift + click to select a range
1b87e1f
Initial commit of Elastic Horovod
tgaddair Apr 1, 2020
2ad2107
Fixed unit tests
tgaddair Apr 2, 2020
b1e559f
Removed elastic from run
tgaddair Apr 2, 2020
3b3f960
Fixed Buildkite tests
tgaddair Apr 2, 2020
c6cd306
Fixed blacklist check
tgaddair Apr 3, 2020
3d3f864
Added fault tolerance without scaling
tgaddair Apr 4, 2020
8bc8f73
Refactored registration
tgaddair Apr 4, 2020
6fa884d
Fixed unit tests
tgaddair Apr 4, 2020
e40f034
Added fault tolerance unit test
tgaddair Apr 6, 2020
e6a080c
Fixed file paths
tgaddair Apr 6, 2020
c4a39e7
Added modules
tgaddair Apr 6, 2020
0c12de9
Fixed imports
tgaddair Apr 6, 2020
4a4d128
More tests
tgaddair Apr 6, 2020
a0126ef
Reverting more tests
tgaddair Apr 6, 2020
eb6e0f5
Fixed slots
tgaddair Apr 6, 2020
a139280
Fixed keras state
tgaddair Apr 6, 2020
ac3963a
Removed requirement for host group
tgaddair Apr 6, 2020
cda7f8d
Fixed interactive tests
tgaddair Apr 6, 2020
077c33d
Fixed keras tests
tgaddair Apr 6, 2020
65e8e01
Fixed imports
tgaddair Apr 6, 2020
8269341
Only test elastic
tgaddair Apr 6, 2020
376b6ce
Fixed Spark tests
tgaddair Apr 6, 2020
489d880
Fixed broadcast_object args
tgaddair Apr 6, 2020
c7edff4
Fixed spark test
tgaddair Apr 6, 2020
0288423
Test remove test
tgaddair Apr 7, 2020
a4c80f3
TensorFlow 1.15
tgaddair Apr 7, 2020
37a3d05
TensorFlow CPU mode
tgaddair Apr 7, 2020
185d4b1
Fixed unit tests
tgaddair Apr 7, 2020
3017726
Back to 1.14
tgaddair Apr 7, 2020
75c7cdd
Renable all test
tgaddair Apr 7, 2020
f5ee955
Fixed unit tests
tgaddair Apr 7, 2020
a3446e4
Drop support for very old versions of frameworks
tgaddair Apr 7, 2020
b30838f
Merge branch 'master' into elastic
tgaddair Apr 7, 2020
990240e
Added ncclCommAbort checks to ensure safe clean-up of GPU memory
tgaddair Apr 7, 2020
6eab532
Fixed compilation
tgaddair Apr 7, 2020
6a418db
Adasum Error Check
tgaddair Apr 7, 2020
9307c66
Updated docs
tgaddair Apr 10, 2020
9465fda
Merge branch 'master' into elastic
EnricoMi Apr 12, 2020
2fd8719
Fixed wait_for_available_hosts
tgaddair Apr 13, 2020
4cc79c7
Updated Buildkite tests
tgaddair Apr 13, 2020
130c7b2
Merged master
tgaddair Apr 13, 2020
484d271
Addressed comments
tgaddair Apr 13, 2020
3c22e30
Added Keras example to doc
tgaddair Apr 14, 2020
c1dc521
Renamed sigterm_received -> signal_received
tgaddair Apr 14, 2020
526b290
Added emphasis
tgaddair Apr 14, 2020
2ac644e
Added more emphasis
tgaddair Apr 14, 2020
30f99f9
Fixed host assignment to spawn processes for all new slots
tgaddair Apr 14, 2020
9d4f7c3
Fixed pending slots
tgaddair Apr 14, 2020
26e9e81
Fixed behavior of discovery background thread to fail if the first up…
tgaddair Apr 14, 2020
bdfb993
Added comments explaining barrier reset
tgaddair Apr 14, 2020
8ee9fb9
Ensure that at least one previously active host is still assigned whe…
tgaddair Apr 14, 2020
c2a41e2
Skip notifying workers when host changes would not result in changes …
tgaddair Apr 14, 2020
7164aa3
Only manage notifications on coordinator
tgaddair Apr 14, 2020
f6f6c7b
Fixed notification tests
tgaddair Apr 15, 2020
5219909
Only check host message on rank 0 in integration tests
tgaddair Apr 15, 2020
98409fc
Renamed assigned_hosts -> ordered_available_hosts
tgaddair Apr 15, 2020
9458b0b
Directly compare host assignments with proposed next assignments
tgaddair Apr 15, 2020
338d42d
Added rank assignments and removed iteration over worker clients
tgaddair Apr 15, 2020
7789efd
Removed unused functions
tgaddair Apr 15, 2020
6558a49
Fixed setting rank_assignments
tgaddair Apr 15, 2020
c1e10f0
Merge branch 'master' into elastic
EnricoMi Apr 16, 2020
ae6ddcc
Fix previous merge master
EnricoMi Apr 16, 2020
1bd317b
Fixed flakiness in testing forward_stream by joining in all cases exc…
tgaddair Apr 16, 2020
d6ec90a
Try-except http requests
tgaddair Apr 16, 2020
4875450
Revert "Try-except http requests"
tgaddair Apr 16, 2020
9503b2b
Merge branch 'master' into elastic
EnricoMi Apr 23, 2020
582801e
Added logging
tgaddair Apr 24, 2020
61181b4
Experimental safe_shell_exec execute without fork
tgaddair Apr 24, 2020
2429563
Remove joing_streams, set stop signal for background threads when pro…
tgaddair Apr 24, 2020
c8a6c16
Close streams
tgaddair Apr 24, 2020
7a37356
Removed forking safe_shell_exec
tgaddair Apr 24, 2020
767ecbc
Remove Python 3 code
tgaddair Apr 24, 2020
4efa0e1
Fixed process termination in safe_shell_exec
tgaddair Apr 24, 2020
69ae18d
Updated barrier reset comments
tgaddair Apr 24, 2020
563b84e
Renamed slot_info -> coordinator_slot_info
tgaddair Apr 24, 2020
9eb6e4f
Added comment about stability
tgaddair Apr 24, 2020
cdf7ded
Fixed oneCCL config
tgaddair Apr 25, 2020
ecf164e
Restored middleman to safe_shell_exec
tgaddair Apr 26, 2020
000441a
Fix host updates check to avoid checking rank information explicitly
tgaddair Apr 29, 2020
c352d2d
Merged master
tgaddair Apr 30, 2020
88544d1
Gen pipeline fix
tgaddair Apr 30, 2020
e9b2133
Merge
tgaddair Apr 30, 2020
b65b749
Refactored host management to ensure that host updates do not conflic…
tgaddair May 1, 2020
46693a4
Removed redundant hosts variables and unified with host_slots
tgaddair May 1, 2020
8f2bcd3
Added additional checks for auto-scaling jobs to detect common interf…
tgaddair May 1, 2020
6717d54
Mock start hosts
tgaddair May 1, 2020
0a9c531
Fixed updating current hosts with latest blacklist information
tgaddair May 1, 2020
f5f884d
Fixed previous timestamp
tgaddair May 1, 2020
0b143cd
Test on CPU
tgaddair May 1, 2020
6b3f62e
Broadcast tests on CPU
tgaddair May 1, 2020
3ae4cb6
Added unit test coverage for size == 1 at start
tgaddair May 10, 2020
a0f0ec9
Fixed for TensorFlow
tgaddair May 11, 2020
8eee5fe
Update gradient average divisor on world reset
tgaddair May 11, 2020
5f204ea
Local size
tgaddair May 11, 2020
ab546b7
Added more robust exception handling to controller
tgaddair May 11, 2020
a8176ad
Fixed raw pointer accesses
tgaddair May 11, 2020
d8150c8
Added tests for killing process in addition to raising exceptions
tgaddair May 11, 2020
d842e2a
Add elastic_timeout to ElasticSettings
EnricoMi May 12, 2020
d628073
Fix earlier commit
EnricoMi May 12, 2020
267fbb2
Guard against more psutil.NoSuchProcess
EnricoMi May 12, 2020
b8c7f62
Added tests
tgaddair May 12, 2020
a81e681
Force shutdown when initial host discovery fails and added test
tgaddair May 12, 2020
8804eab
Added test for min hosts
tgaddair May 12, 2020
220a868
Changed min hosts check condition
tgaddair May 12, 2020
617091c
Added additional checks around Torch Horovod calls to raise HorovodIn…
tgaddair May 12, 2020
4506a1d
Renaming for consistency
tgaddair May 12, 2020
f4d7c0b
Do not call _get_host_assignments when we have insufficient slots
tgaddair May 13, 2020
959ef05
Removed caching
tgaddair May 13, 2020
1bcd240
Removed size variable updating, will do in separate PR in C++
tgaddair May 13, 2020
c0869b0
Merge remote-tracking branch 'upstream/master' into elastic
EnricoMi May 13, 2020
b894f9c
Make rsh handle interrupt events
EnricoMi Apr 26, 2020
090b8b5
Removed obsolete test image
EnricoMi May 13, 2020
d0fb6f4
Fix gloo test excludes for python 2
EnricoMi May 14, 2020
08f136e
Upgrade to TensorFlow 2.2, fix skip tests for TensorFlow < 1.15
tgaddair May 14, 2020
dbe703f
Updated Buildkite
tgaddair May 14, 2020
4fb7cb6
Fixed tests for TensorFlow 2.2
tgaddair May 14, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Jump to
Jump to file
Failed to load files.
Diff view
Diff view
62 changes: 41 additions & 21 deletions .buildkite/gen-pipeline.sh
Expand Up @@ -8,22 +8,20 @@ repository=823773083436.dkr.ecr.us-east-1.amazonaws.com/buildkite

# list of all the tests
tests=( \
test-cpu-openmpi-py2_7-tf1_1_0-keras2_0_0-torch0_4_0-mxnet1_4_1-pyspark2_3_2 \
test-cpu-openmpi-py3_6-tf1_1_0-keras2_0_0-torch0_4_0-mxnet1_4_1-pyspark2_3_2 \
test-cpu-openmpi-py2_7-tf1_6_0-keras2_1_2-torch0_4_1-mxnet1_4_1-pyspark2_3_2 \
test-cpu-openmpi-py3_6-tf1_6_0-keras2_1_2-torch0_4_1-mxnet1_4_1-pyspark2_3_2 \
test-cpu-openmpi-py2_7-tf1_14_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-py3_6-tf1_14_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-gloo-py2_7-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-gloo-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-gloo-py2_7-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-gloo-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-py2_7-tf1_15_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-py3_6-tf1_15_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-gloo-py2_7-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-gloo-py2_7-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-cpu-openmpi-py2_7-tf2_0_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-openmpi-py3_6-tf2_0_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-openmpi-py3_6-tfhead-kerashead-torchhead-mxnethead-pyspark2_4_0 \
test-cpu-mpich-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-oneccl-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-oneccl-ofi-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-mpich-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-oneccl-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-cpu-oneccl-ofi-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0 \
test-gpu-openmpi-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-gpu-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
test-gpu-openmpi-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0 \
Expand Down Expand Up @@ -98,23 +96,28 @@ run_mpi_pytest() {
local oneccl_env=${3:-}
oneccl_env=$(echo ${oneccl_env//:/ })

local exclude_keras_if_needed=""
local exclude_keras=""
if [[ ${test} == *"tf2_"* ]] || [[ ${test} == *"tfhead"* ]]; then
# TODO: support for Keras + TF 2.0 and TF-Keras 2.0
exclude_keras_if_needed="| sed 's/test_keras.py//g' | sed 's/test_tensorflow_keras.py//g'"
exclude_keras="| sed 's/test_keras.py//g' | sed 's/test_tensorflow_keras.py//g'"
else
exclude_keras_if_needed="| sed 's/[a-z_]*tensorflow2[a-z_.]*//g'"
exclude_keras="| sed 's/[a-z_]*tensorflow2[a-z_.]*//g'"
fi

local exclude_interactiverun="| sed 's/test_interactiverun.py//g' | sed 's/test_spark_keras.py//g' | sed 's/test_spark_torch.py//g'"
local exclude_elastic=""
if [[ ${test} == *"py2_"* ]]; then
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are just focusing on py3 for elastic feature?
It makes sense to me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the Barrier feature used in the driver does not exist in Python 2. So rather than finding a less elegant solution, I thought it would be better to just drop Python 2 support :).

exclude_elastic="| sed 's/test_elastic[a-z_.]*//g'"
fi

local excluded_tests="| sed 's/test_interactiverun.py//g' | sed 's/test_spark_keras.py//g' | sed 's/test_spark_torch.py//g'"

# Spark test does not need to be executed with horovodrun, but we still run it below.
local exclude_spark_test="| sed 's/test_spark.py//g'"

# pytests have 4x GPU use cases and require a separate queue
run_test "${test}" "${queue}" \
":pytest: Run PyTests (${test})" \
"bash -c \"${oneccl_env} cd /horovod/test && (echo test_*.py ${exclude_keras_if_needed} ${exclude_interactiverun} ${exclude_spark_test} | xargs -n 1 \\\$(cat /mpirun_command) pytest -v --capture=no) && pytest --forked -v --capture=no test_spark.py\""
"bash -c \"${oneccl_env} cd /horovod/test && (echo test_*.py ${exclude_keras} ${exclude_elastic} ${excluded_tests} ${exclude_spark_test} | xargs -n 1 \\\$(cat /mpirun_command) pytest -v --capture=no) && pytest --forked -v --capture=no test_spark.py\""
}

run_mpi_integration() {
Expand Down Expand Up @@ -149,7 +152,7 @@ run_mpi_integration() {
fi

run_test "${test}" "${queue}" \
":python: Test PyTorch MNIST (${test})" \
":fire: Test PyTorch MNIST (${test})" \
tgaddair marked this conversation as resolved.
Show resolved Hide resolved
"bash -c \"${oneccl_env} \\\$(cat /mpirun_command) python /horovod/examples/pytorch_mnist.py\""

run_test "${test}" "${queue}" \
Expand All @@ -158,7 +161,7 @@ run_mpi_integration() {

# tests that should be executed only with the latest release since they don't test
# a framework-specific functionality
if [[ ${test} == *"tf1_14_0"* ]]; then
if [[ ${test} == *"tf1_15_0"* ]]; then
run_test "${test}" "${queue}" \
":muscle: Test Stall (${test})" \
"bash -c \"${oneccl_env} \\\$(cat /mpirun_command) python /horovod/test/test_stall.py\""
Expand Down Expand Up @@ -199,6 +202,11 @@ run_gloo_pytest() {
local test=$1
local queue=$2

local exclude_elastic=""
if [[ ${test} == *"py2_"* ]]; then
exclude_elastic="| sed 's/test_elastic[a-z_.]*//g'"
fi

# These tests are covered in MPI, and testing them in Gloo does not cover any new code paths
local excluded_tests="| sed 's/test_interactiverun.py//g' | sed 's/test_spark_keras.py//g' | sed 's/test_spark_torch.py//g' | sed 's/[a-z_]*tensorflow2[a-z_.]*//g'"

Expand All @@ -207,7 +215,7 @@ run_gloo_pytest() {

run_test "${test}" "${queue}" \
":pytest: Run PyTests (${test})" \
"bash -c \"cd /horovod/test && (echo test_*.py ${excluded_tests} ${exclude_spark_test} | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo pytest -v --capture=no) && pytest --forked -v --capture=no test_spark.py\""
"bash -c \"cd /horovod/test && (echo test_*.py ${exclude_elastic} ${excluded_tests} ${exclude_spark_test} | xargs -n 1 horovodrun -np 2 -H localhost:2 --gloo pytest -v --capture=no) && pytest --forked -v --capture=no test_spark.py\""
}

run_gloo_integration() {
Expand All @@ -219,12 +227,24 @@ run_gloo_integration() {
"horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/keras_mnist_advanced.py"

run_test "${test}" "${queue}" \
":python: Test PyTorch MNIST (${test})" \
":fire: Test PyTorch MNIST (${test})" \
"horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/pytorch_mnist.py"

run_test "${test}" "${queue}" \
":muscle: Test MXNet MNIST (${test})" \
"horovodrun -np 2 -H localhost:2 --gloo python /horovod/examples/mxnet_mnist.py"

# Elastic
if [[ ${test} == *"py3_"* ]]; then
local elastic_tensorflow="test_elastic_tensorflow.py"
if [[ ${test} == *"tf2_"* ]] || [[ ${test} == *"tfhead"* ]]; then
elastic_tensorflow="test_elastic_tensorflow2.py"
fi

run_test "${test}" "${queue}" \
":factory: Elastic Tests (${test})" \
"bash -c \"cd /horovod/test/integration && pytest -v --log-cli-level 10 --capture=no test_elastic_torch.py ${elastic_tensorflow}\""
fi
}

run_gloo() {
Expand Down Expand Up @@ -281,7 +301,7 @@ run_single_integration() {
fi

run_test "${test}" "${queue}" \
":python: Single PyTorch MNIST (${test})" \
":fire: Single PyTorch MNIST (${test})" \
"bash -c \"${oneccl_env} python /horovod/examples/pytorch_mnist.py --epochs 3\""

run_test "${test}" "${queue}" \
Expand Down
57 changes: 16 additions & 41 deletions docker-compose.test.yml
Expand Up @@ -6,31 +6,6 @@ services:
dockerfile: Dockerfile.test.cpu
privileged: true
shm_size: 8gb
test-cpu-openmpi-py2_7-tf1_1_0-keras2_0_0-torch0_4_0-mxnet1_4_1-pyspark2_3_2:
extends: test-cpu-base
build:
args:
MPI_KIND: OpenMPI
PYTHON_VERSION: 2.7
TENSORFLOW_PACKAGE: tensorflow==1.1.0
KERAS_PACKAGE: keras==2.0.0
PYTORCH_PACKAGE: torch==0.4.0
TORCHVISION_PACKAGE: torchvision==0.2.2.post3
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.3.2
test-cpu-openmpi-py3_6-tf1_1_0-keras2_0_0-torch0_4_0-mxnet1_4_1-pyspark2_3_2:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: OpenMPI
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.1.0
KERAS_PACKAGE: keras==2.0.0
PYTORCH_PACKAGE: torch==0.4.0
TORCHVISION_PACKAGE: torchvision==0.2.2.post3
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.3.2
test-cpu-openmpi-py2_7-tf1_6_0-keras2_1_2-torch0_4_1-mxnet1_4_1-pyspark2_3_2:
extends: test-cpu-base
build:
Expand All @@ -56,76 +31,76 @@ services:
TORCHVISION_PACKAGE: torchvision==0.2.2.post3
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.3.2
test-cpu-openmpi-py2_7-tf1_14_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-openmpi-py2_7-tf1_15_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
MPI_KIND: OpenMPI
PYTHON_VERSION: 2.7
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.2.4
PYTORCH_PACKAGE: torch==1.2.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-openmpi-py3_6-tf1_14_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-openmpi-py3_6-tf1_15_0-keras2_2_4-torch1_2_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: OpenMPI
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.2.4
PYTORCH_PACKAGE: torch==1.2.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-openmpi-gloo-py2_7-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-openmpi-gloo-py2_7-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
MPI_KIND: OpenMPI
PYTHON_VERSION: 2.7
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-openmpi-gloo-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-openmpi-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: OpenMPI
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-gloo-py2_7-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-gloo-py2_7-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
MPI_KIND: None
PYTHON_VERSION: 2.7
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.4.1
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-gloo-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
test-cpu-gloo-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_4_1-pyspark2_4_0:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: None
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
Expand Down Expand Up @@ -169,27 +144,27 @@ services:
TORCHVISION_PACKAGE: torchvision==0.6.0.dev20200413
MXNET_PACKAGE: mxnet-nightly
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-mpich-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0:
test-cpu-mpich-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: MPICH
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
MXNET_PACKAGE: mxnet==1.5.0
PYSPARK_PACKAGE: pyspark==2.4.0
test-cpu-oneccl-py3_6-tf1_14_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0:
test-cpu-oneccl-py3_6-tf1_15_0-keras2_3_1-torch1_3_0-mxnet1_5_0-pyspark2_4_0:
extends: test-cpu-base
build:
args:
UBUNTU_VERSION: 18.04
MPI_KIND: ONECCL
PYTHON_VERSION: 3.6
TENSORFLOW_PACKAGE: tensorflow==1.14.0
TENSORFLOW_PACKAGE: tensorflow-cpu==1.15.0
KERAS_PACKAGE: keras==2.3.1
PYTORCH_PACKAGE: torch==1.3.0+cpu
TORCHVISION_PACKAGE: torchvision==0.4.1+cpu
Expand Down