Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Dockerfile & tests to TF 1.14.0 + RDMA #1159

Merged
merged 9 commits into from Jun 28, 2019

Conversation

3 participants
@alsrgv
Copy link
Member

commented Jun 20, 2019

  • Upgrade Dockerfile to TensorFlow 1.14.0 & CUDA 10
  • Add Ubuntu 18.04 bundled RDMA drivers

@alsrgv alsrgv requested review from tgaddair and abditag2 Jun 20, 2019

@alsrgv alsrgv self-assigned this Jun 20, 2019

Update Dockerfile & tests to TF 1.14.0 + RDMA
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

@alsrgv alsrgv force-pushed the update_docker branch from 3b7d4d8 to 614d88f Jun 20, 2019

Fix CPU Ubuntu 18.04 builds
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

alsrgv added some commits Jun 20, 2019

Install python3.6-distutils
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Switch to master torchvision for nightly builds
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Install future/typing before torchvision@master
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Change default to CUDA 10 and fix MXNet packages
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Bugfix, switch to Python 3.6 for single-Python tests
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Remove redundant labels & environment variables
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Fix PyTorch CU9/CU10
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
@byronyi

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

That's great. Congrats! Btw do you find a workaround to the hwloc issue? Your PR doesn't ship with 1.14, which is quite a pity.

@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jun 20, 2019

@byronyi, can you try to LD_PRELOAD=/path/to/system/libhwloc? That may help with symbol discovery.

@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jun 20, 2019

I did not see any hwloc-related issues in the Dockerfiles or integration tests - maybe because they're using a more recent Open MPI 4.0.0.

@byronyi

This comment has been minimized.

Copy link
Contributor

commented Jun 20, 2019

@byronyi, can you try to LD_PRELOAD=/path/to/system/libhwloc? That may help with symbol discovery.

Yep, that works. Although we cherry-picked your PR with our in-house built TF :)

I did not see any hwloc-related issues in the Dockerfiles or integration tests - maybe because they're using a more recent Open MPI 4.0.0.

I kinda expect that. Good for you!

@alsrgv alsrgv merged commit fda07a6 into master Jun 28, 2019

3 checks passed

DCO DCO
Details
License Compliance All checks passed.
Details
buildkite/horovod/pr Build #564 passed (56 minutes, 1 second)
Details

@alsrgv alsrgv deleted the update_docker branch Jun 28, 2019

alsrgv added a commit that referenced this pull request Jun 28, 2019

Update docs as per #1159
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

alsrgv added a commit that referenced this pull request Jun 28, 2019

Update docs as per #1159
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

alsrgv added a commit that referenced this pull request Jun 28, 2019

Update docs as per #1159
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

alsrgv added a commit that referenced this pull request Jun 28, 2019

Add RDMA instructions to Dockerfile (#1150)
* Add MOFED installation instructions to Dockerfile

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Update docs as per #1159

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

* Review feedback

Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.