-
Notifications
You must be signed in to change notification settings - Fork 936
Allow disabling vader fastbox support #5848
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow disabling vader fastbox support #5848
Conversation
Move the declaration of mca_btl_vader_poll_handle_frag() from btl_vader_fbox.h to btl_vader_frag.h. The function doesn't require any fbox declarations, and is more associated with frags than fboxes. Signed-off-by: Brian Barrett <bbarrett@amazon.com>
Add framework for disabling fastbox support in the vader btl, to better debug platforms where fastboxes appear to be failing. Signed-off-by: Brian Barrett <bbarrett@amazon.com>
|
This doesn't disable on any particular platform, just gives us the ability to disable in general at configure time. |
| */ | ||
|
|
||
| #if !defined(MCA_BTL_VADER_FBOX_H) | ||
| #if !defined(MCA_BTL_VADER_FBOX_H) && OPAL_BTL_VADER_FBOX_SUPPORT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A minor nit: OPAL_BTL_VADER_FBOX_SUPPORT won't be defined until opal_config.h was previously included. If btl_vader_fbox.h is the first file included, then that macro won't be defined.
I realize that our coding guidelines say "always include opal_config.h first!", but sometimes people don't/forget to do that.
Do you want to put an #include "opal_config.h" here at the top of btl_vader_fbox.h to cover all the bases / be defensive?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll get a compiler error if you include fbox.h before opal_config.h, because we error on undefined preprocessor define behavior. So I don't think it's worth being over defensive here.
|
Just want to point out that this is affecting more than i386 so I need to fix the bug. Since nothing has really changed in the fast boxes for a long time its probably a bug in the "fixes". I wouldn't waste time on this PR. |
Why? Would it not be helpful to have amckinstry run without fboxes to make sure the issue actually is in the fbox code? Would it not be helpful to have an out so we can ship 4.0, if it comes to that? Plus, unless I'm missing something, the hard work's already done, so unless there's a reason it is 1) wrong or 2) has some moral objection I'm not missing, it seems like committing and moving on is the right option. |
|
Nope. This is the fast path for messages. Shipping without it working on x86_64 is a blocker. |
|
FWIW I have recreated the hang on x86_64 in a docker container. Digging into it now. |
Ok, so we go from "i don't have time to work on i386 broken" to "we can't ship without it". You can't have it both ways. So pick one, and be prepared to stick with it. As an RM, I'd ship with slightly lower shared memory performance in a heartbeat, if it meant getting the correct answer. 100% of my customers are willing to make that tradeoff. |
|
The fixes we put in broke x86_64. Either we back out the "fixes" and do this PR or I have to figure out what is going on. Once x86_64 broke I have to find time to work on this. |
|
Think I identified the problem on x86_64. Its a really subtle bug in the fixes. testing now. So far in 100 iterations it hasn't hung/faulted. |
|
4.0 release managers are getting really nervous about all these last minute vader fixes. |
|
@hppritcha I disagree. The code that was in the releases worked on 64-bit platforms but had issues on 32-bit (discovered recently). In an attempt to make it work for 32-bit we broke 64-bit. The recent changes fix 64-bit while still allowing it to work with 32-bit/ |
|
@hjelmn, When @hppritcha and I talked earlier today, we were discussing that it would be nice to have either an mca parameter or configure parameter to enable|disable the fastbox feature of vader (defaulting to enabled). |
|
FWIW its possible to disable fast boxes without this. Just need to set an MCA parameter. |
|
Oh, that might be sufficient. What is that parameter? I can't seem to find it. |
|
btl_vader_fbox_max. I completely forgot about it. It was meant to control the memory footprint of vader. |
|
Re-closing this PR. While Nathan and I might disagree about how critical fbox support is to Open MPI as a whole, he's right that btl_vader_fbox_max is a better control than this PR for A/B testing with/without fbox support and as a workaround to users. |
No description provided.