Skip to content

Conversation

@bwbarrett
Copy link
Member

No description provided.

Move the declaration of mca_btl_vader_poll_handle_frag() from
btl_vader_fbox.h to btl_vader_frag.h.  The function doesn't
require any fbox declarations, and is more associated with
frags than fboxes.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Add framework for disabling fastbox support in the vader btl, to
better debug platforms where fastboxes appear to be failing.

Signed-off-by: Brian Barrett <bbarrett@amazon.com>
@bwbarrett
Copy link
Member Author

This doesn't disable on any particular platform, just gives us the ability to disable in general at configure time.

*/

#if !defined(MCA_BTL_VADER_FBOX_H)
#if !defined(MCA_BTL_VADER_FBOX_H) && OPAL_BTL_VADER_FBOX_SUPPORT
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A minor nit: OPAL_BTL_VADER_FBOX_SUPPORT won't be defined until opal_config.h was previously included. If btl_vader_fbox.h is the first file included, then that macro won't be defined.

I realize that our coding guidelines say "always include opal_config.h first!", but sometimes people don't/forget to do that.

Do you want to put an #include "opal_config.h" here at the top of btl_vader_fbox.h to cover all the bases / be defensive?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll get a compiler error if you include fbox.h before opal_config.h, because we error on undefined preprocessor define behavior. So I don't think it's worth being over defensive here.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

Just want to point out that this is affecting more than i386 so I need to fix the bug. Since nothing has really changed in the fast boxes for a long time its probably a bug in the "fixes". I wouldn't waste time on this PR.

@bwbarrett
Copy link
Member Author

Just want to point out that this is affecting more than i386 so I need to fix the bug. Since nothing has really changed in the fast boxes for a long time its probably a bug in the "fixes". I wouldn't waste time on this PR.

Why? Would it not be helpful to have amckinstry run without fboxes to make sure the issue actually is in the fbox code? Would it not be helpful to have an out so we can ship 4.0, if it comes to that? Plus, unless I'm missing something, the hard work's already done, so unless there's a reason it is 1) wrong or 2) has some moral objection I'm not missing, it seems like committing and moving on is the right option.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

Nope. This is the fast path for messages. Shipping without it working on x86_64 is a blocker.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

FWIW I have recreated the hang on x86_64 in a docker container. Digging into it now.

@bwbarrett
Copy link
Member Author

Nope. This is the fast path for messages. Shipping without it working on x86_64 is a blocker.

Ok, so we go from "i don't have time to work on i386 broken" to "we can't ship without it". You can't have it both ways. So pick one, and be prepared to stick with it.

As an RM, I'd ship with slightly lower shared memory performance in a heartbeat, if it meant getting the correct answer. 100% of my customers are willing to make that tradeoff.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

The fixes we put in broke x86_64. Either we back out the "fixes" and do this PR or I have to figure out what is going on. Once x86_64 broke I have to find time to work on this.

@hjelmn
Copy link
Member

hjelmn commented Oct 5, 2018

Think I identified the problem on x86_64. Its a really subtle bug in the fixes. testing now. So far in 100 iterations it hasn't hung/faulted.

@bwbarrett bwbarrett closed this Oct 8, 2018
@hppritcha
Copy link
Member

4.0 release managers are getting really nervous about all these last minute vader fixes.
We will want this patch or something similar to allow for an out so as to not have to release a 4.0.1 within days of the 4.0.0.

@hppritcha hppritcha reopened this Oct 8, 2018
@hjelmn
Copy link
Member

hjelmn commented Oct 8, 2018

@hppritcha I disagree. The code that was in the releases worked on 64-bit platforms but had issues on 32-bit (discovered recently). In an attempt to make it work for 32-bit we broke 64-bit. The recent changes fix 64-bit while still allowing it to work with 32-bit/

@gpaulsen
Copy link
Member

gpaulsen commented Oct 8, 2018

@hjelmn, When @hppritcha and I talked earlier today, we were discussing that it would be nice to have either an mca parameter or configure parameter to enable|disable the fastbox feature of vader (defaulting to enabled).
Could you take a look at this PR, and see if you think it's suitable for disabling fastbox if a user so chose?

@hjelmn
Copy link
Member

hjelmn commented Oct 8, 2018

FWIW its possible to disable fast boxes without this. Just need to set an MCA parameter.

@gpaulsen
Copy link
Member

gpaulsen commented Oct 8, 2018

Oh, that might be sufficient. What is that parameter? I can't seem to find it.

@hjelmn
Copy link
Member

hjelmn commented Oct 8, 2018

btl_vader_fbox_max. I completely forgot about it. It was meant to control the memory footprint of vader.

@bwbarrett
Copy link
Member Author

Re-closing this PR. While Nathan and I might disagree about how critical fbox support is to Open MPI as a whole, he's right that btl_vader_fbox_max is a better control than this PR for A/B testing with/without fbox support and as a workaround to users.

@bwbarrett bwbarrett closed this Oct 8, 2018
@bwbarrett bwbarrett deleted the feature/vader-fastbox-optional branch May 8, 2020 00:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants