Skip to content

Conversation

@rhc54
Copy link
Contributor

@rhc54 rhc54 commented Mar 16, 2015

Add keepalive support for the TCP inter-node connections so we detect when the cluster loses a node. This adds three new MCA params:

oob_tcp_keepalive_time: Idle time in seconds before starting to send keepalives (num <= 0 => disable keepalive)

oob_tcp_keepalive_intvl: Time between keepalives, in seconds

oob_tcp_keepalive_probes: Number of keepalives that can be missed before declaring error

@goodell could you take a gander? I've got a minor compiler warning silence change in there as well. Just didn't feel like separating it out.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/351/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why have mca_oob_tcp_component.tcp_proto and this call to getprotobyname() instead of just using the IPPROTO_TCP constant everywhere? Is there some portability issue being dealt with here, or some sort of IPoIB compatibility?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: the => looks odd next to the <=, might be better to replace with --> or means

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 16, 2015

Great minds think alike on the proto lookup - I changed it literally as you sent your comment :-)

Will fix the other comment

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty frequent keepalive, compared to what I think most people would expect. I'd back off to something at least in the 30-60 second range.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/352/
Test PASSed.

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/353/
Test PASSed.

@rhc54
Copy link
Contributor Author

rhc54 commented Mar 16, 2015

@goodell I've captured all your comments - see what you think

@mellanox-github
Copy link

Refer to this link for build results (access rights to CI server needed):
http://bgate.mellanox.com/jenkins/job/gh-ompi-master-pr/354/
Test PASSed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be > instead of <.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh, nevermind

@goodell
Copy link
Member

goodell commented Mar 16, 2015

👍

rhc54 pushed a commit that referenced this pull request Mar 16, 2015
Add keepalive support to the TCP OOB component
@rhc54 rhc54 merged commit ee23b7f into open-mpi:master Mar 16, 2015
@rhc54 rhc54 deleted the topic/keepalive branch March 19, 2015 23:32
jsquyres added a commit to jsquyres/ompi that referenced this pull request Nov 10, 2015
Refs open-mpi#627. Fix support for multi-threads with CUDA 7.0
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants