-
Notifications
You must be signed in to change notification settings - Fork 934
Add keepalive support to the TCP OOB component #477
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
|
Refer to this link for build results (access rights to CI server needed): |
orte/mca/oob/tcp/oob_tcp_component.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why have mca_oob_tcp_component.tcp_proto and this call to getprotobyname() instead of just using the IPPROTO_TCP constant everywhere? Is there some portability issue being dealt with here, or some sort of IPoIB compatibility?
…se the right define
orte/mca/oob/tcp/oob_tcp_component.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: the => looks odd next to the <=, might be better to replace with --> or means
|
Great minds think alike on the proto lookup - I changed it literally as you sent your comment :-) Will fix the other comment |
orte/mca/oob/tcp/oob_tcp_component.c
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty frequent keepalive, compared to what I think most people would expect. I'd back off to something at least in the 30-60 second range.
|
|
Refer to this link for build results (access rights to CI server needed): |
|
|
Refer to this link for build results (access rights to CI server needed): |
… keepalive options
|
@goodell I've captured all your comments - see what you think |
|
|
Refer to this link for build results (access rights to CI server needed): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be > instead of <.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
argh, nevermind
|
👍 |
Add keepalive support to the TCP OOB component
Refs open-mpi#627. Fix support for multi-threads with CUDA 7.0
Add keepalive support for the TCP inter-node connections so we detect when the cluster loses a node. This adds three new MCA params:
oob_tcp_keepalive_time: Idle time in seconds before starting to send keepalives (num <= 0 => disable keepalive)
oob_tcp_keepalive_intvl: Time between keepalives, in seconds
oob_tcp_keepalive_probes: Number of keepalives that can be missed before declaring error
@goodell could you take a gander? I've got a minor compiler warning silence change in there as well. Just didn't feel like separating it out.