-
Notifications
You must be signed in to change notification settings - Fork 929
- make the split collective shared file pointer operations work #756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
- make the split collective shared file pointer operations work #756
Conversation
edgargabriel
commented
Jul 28, 2015
- minor code restructering in io/ompio required for that.
- minor code restructering in io/ompio required for that.
|
@hppritcha this is now the cleaned up version of the split collective operations, not containing any other fixes. If you have a chance to review it, please go ahead. If you prefer to review the pr to v2.x, let me know and I can merge this commit and file the pr. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be using opal_output instead?
|
I'm getting segfaults with filetest. Please don't merge yet. |
|
ok, will also incorporate your comments about using opal_output instead of printf tomorrow morning. You are testing on the cray? Can try on the stuttgart machine as well |
|
i don't think this problem is specific to cray, but yes I"m using the NERSC systems hopper and edison. |
|
If I don't use lustre, then filetest passes. By the way, what's going on with |
|
I'm okay with this PR, but it appears that for Lustre things are pretty broken. Filetest fails at the very Edison is using Lustre 2.5.0 (at least for the clients). I observe the same behavior on both systems' lustre file systems. I'm pretty sure things were working for lustre several weeks ago when I tested one of the previous ompi i/o PRs. I'll try to do some bisecting to narrow down the problem when I have time. |
|
hm, ok, thanks! I will merge the branch and check on lustre as well. One simple test that you could run is to see whether things execute correctly if you exclude the lustre fs component, it should still work correctly, e.g. mpirun --mca fs ^lustre -np 6 ./filetest Also note, that I did modifications to the filetest testsuite in the last couple of days, maybe I introduced a bug there inadvertently. |
|
well, I will first fix the printf to opal_output stuff tomorrow morning, than I will merge. |
|
Will have to check mca_fs_lustre_file_get_size, did not look into that in a looong time |
|
Can confirm that I am able to reproduce the problem that you see on the Stuttgart Cray/Lustre, and am debugging it. I know at least part of the problem, it should be however unrelated to this pr. |
|
the problem definitely comes from the lustre fs component. If I exclude it (on the lustre system) and use the regular ufs component instead, everything works like a charm. I will merge this pr, and debug the lustre fs component to see what is going on. Thanks @hppritcha ! |
- make the split collective shared file pointer operations work
change -0bind-to and -bind-to to --bind-to in the manpages