Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Horovod auto-selection of g++ version #1199

Merged
merged 13 commits into from Jul 9, 2019

Conversation

5 participants
@alsrgv
Copy link
Member

commented Jul 9, 2019

g++ version selection has been a long-standing pain point. TensorFlow from PyPI requires g++ 4.8.5 or 4.9, PyTorch from PyPI requires g++ 4.9+, and Conda versions require g++ 5.4.0+.

This change introduces the ability of Horovod to use different compiler per framework and provide users with meaningful error messages when it's known that given compiler/framework combination will not work.

Additionally, this change allows us to remove cherry-picked g++-4.9 and use g++-7 & g++-4.8.5 combination.

Fixes #1197
Fixes #1131

Horovod auto-selection of g++ version
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

@alsrgv alsrgv requested review from tgaddair, abditag2 and sblotner Jul 9, 2019

alsrgv added some commits Jul 9, 2019

Get rid of g++-4.9, use the highest compatible compiler version
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Bugfixes
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Patch compiler per plugin
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

cc @zsh-thu, this breaks Gloo since Gloo is compiled separately. Unfortunately, since each plugin can be compiled with its own compiler (and with a different CXX ABI settings), we need to build Gloo together with each plugin. Can you make a PR with that change?

Properly extract version from g++-4.8
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

cc @IgorWilbert for docs

@alsrgv alsrgv self-assigned this Jul 9, 2019

alsrgv added some commits Jul 9, 2019

Add a hack to remove offensive compiler options
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Bugfixes
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Python 2.7 support
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Silly bugfix
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>

@alsrgv alsrgv requested a review from tgaddair Jul 9, 2019

@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jul 9, 2019

@tgaddair, there's been a lot of changes before wishful-thinking initial version and the current working version, could you review again?

alsrgv added some commits Jul 9, 2019

Fix indentation in gpus.rst
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
Mention starting version for tf.version
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
More comments
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
@sblotner
Copy link
Collaborator

left a comment

Commented about a minor grammar issue (multiple instances), but otherwise this looks good, especially with the new spacing

Show resolved Hide resolved README.rst Outdated
Show resolved Hide resolved docs/gpus.rst Outdated
Show resolved Hide resolved docs/index.rst Outdated
Show resolved Hide resolved docs/index.rst Outdated
Show resolved Hide resolved docs/index.rst Outdated
Show resolved Hide resolved docs/summary.rst Outdated
Show resolved Hide resolved setup.py Outdated
Review comments
Signed-off-by: Alex Sergeev <alsrgv@users.noreply.github.com>
@tgaddair
Copy link
Collaborator

left a comment

LGTM!

@alsrgv alsrgv merged commit d902152 into master Jul 9, 2019

3 checks passed

DCO DCO
Details
License Compliance All checks passed.
Details
buildkite/horovod/pr Build #690 passed (1 hour, 16 minutes, 15 seconds)
Details

@alsrgv alsrgv deleted the gcc_doc branch Jul 9, 2019

@nskool

This comment has been minimized.

Copy link

commented Jul 18, 2019

Hi, when will horovod have a release with this fix i.e. when is the next release planned? I'm encountering #1197 (segmentation fault) consitently with Ubuntu and Amazon Linux in conda environments. Will another release fix this?

@alsrgv

This comment has been minimized.

Copy link
Member Author

commented Jul 19, 2019

@nskool, can you try installing Horovod from master to see if it resolves your problem?

$ HOROVOD_GPU_ALLREDUCE=NCCL pip install --no-cache-dir git+https://github.com/horovod/horovod
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.