Skip to content

Conversation

@edgargabriel
Copy link
Member

  • minor code restructering in io/ompio required for that.

- minor code restructering in io/ompio required for that.
@edgargabriel
Copy link
Member Author

@hppritcha this is now the cleaned up version of the split collective operations, not containing any other fixes. If you have a chance to review it, please go ahead. If you prefer to review the pr to v2.x, let me know and I can merge this commit and file the pr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be using opal_output instead?

@hppritcha
Copy link
Member

I'm getting segfaults with filetest. Please don't merge yet.

@edgargabriel
Copy link
Member Author

ok, will also incorporate your comments about using opal_output instead of printf tomorrow morning. You are testing on the cray? Can try on the stuttgart machine as well

@hppritcha
Copy link
Member

i don't think this problem is specific to cray, but yes I"m using the NERSC systems hopper and edison.

@hppritcha
Copy link
Member

If I don't use lustre, then filetest passes. By the way, what's going on with mca_fs_lustre_file_get_size?

@hppritcha
Copy link
Member

I'm okay with this PR, but it appears that for Lustre things are pretty broken. Filetest fails at the very
first ompio_io_ompio_file_close owing to some mishandling of the f_converter at line 369. That's not the whole problem though because if I comment out that line, things still end up segfaulting later in the tests. The signature is what one would get with heap corruption.

Edison is using Lustre 2.5.0 (at least for the clients).
Hopper is using Lustre 2.4.1

I observe the same behavior on both systems' lustre file systems.

I'm pretty sure things were working for lustre several weeks ago when I tested one of the previous ompi i/o PRs. I'll try to do some bisecting to narrow down the problem when I have time.

@edgargabriel
Copy link
Member Author

hm, ok, thanks! I will merge the branch and check on lustre as well. One simple test that you could run is to see whether things execute correctly if you exclude the lustre fs component, it should still work correctly, e.g.

mpirun --mca fs ^lustre -np 6 ./filetest

Also note, that I did modifications to the filetest testsuite in the last couple of days, maybe I introduced a bug there inadvertently.

@edgargabriel
Copy link
Member Author

well, I will first fix the printf to opal_output stuff tomorrow morning, than I will merge.

@edgargabriel
Copy link
Member Author

Will have to check mca_fs_lustre_file_get_size, did not look into that in a looong time

@edgargabriel
Copy link
Member Author

Can confirm that I am able to reproduce the problem that you see on the Stuttgart Cray/Lustre, and am debugging it. I know at least part of the problem, it should be however unrelated to this pr.

@edgargabriel
Copy link
Member Author

the problem definitely comes from the lustre fs component. If I exclude it (on the lustre system) and use the regular ufs component instead, everything works like a charm. I will merge this pr, and debug the lustre fs component to see what is going on. Thanks @hppritcha !

edgargabriel added a commit that referenced this pull request Jul 29, 2015
- make the split collective shared file pointer operations work
@edgargabriel edgargabriel merged commit a3327fe into open-mpi:master Jul 29, 2015
@edgargabriel edgargabriel deleted the pr/nb-sharedfp-splitcoll2 branch August 5, 2015 23:28
jsquyres pushed a commit to jsquyres/ompi that referenced this pull request Sep 19, 2016
change -0bind-to and -bind-to to --bind-to in the manpages
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants