-
Notifications
You must be signed in to change notification settings - Fork 275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ch4/ucx: MPI_Win_allocate causing NWChem problems again #6110
Comments
It was working as of v3.4.1 but is broken in 4.x branches currently tested by Debian: nwchemgit/nwchem#633 (comment). |
nwchemgit/nwchem#633 (comment) mentions this is a regression between v4.0.0 and v4.0.1 --
The likely offending commit is "b1ada30eab 03/15 22:24 ch4/posix: workaround for inter-process mutex on FreeBSD" from #5894 EDIT: Hmm, that commit only affects FreeBSD and shouldn't affect ucx build. I am at a loss. |
@jeffhammond I suspect this is potentially an issue with |
It's not ARMCI-MPI. The code has not changed significantly in the past 8 years, and has been tested literally thousands of times with NWChem against every RMA implementation out there, usually in tortuous single-node circumstances. ARMCI-MPI works fine with many other versions of MPICH. The bug is in MPICH:Ch4:UCX or UCX. |
ARMCI-MPI does not assume MWA allocates shared-memory, because there is a no standard-compliant way to do that. I proposed a solution to that for the MPI standard. |
This, and the corresponding difference with if (ARMCII_GLOBAL_STATE.use_win_allocate == 0) {
if (local_size == 0) {
alloc_slices[alloc_me].base = NULL;
} else {
MPI_Alloc_mem(local_size, alloc_shm_info, &(alloc_slices[alloc_me].base));
ARMCII_Assert(alloc_slices[alloc_me].base != NULL);
}
MPI_Win_create(alloc_slices[alloc_me].base, (MPI_Aint) local_size, 1, MPI_INFO_NULL, group->comm, &mreg->window);
}
else if (ARMCII_GLOBAL_STATE.use_win_allocate == 1) {
/* give hint to CASPER to avoid extra work for lock permission */
if (alloc_shm_info == MPI_INFO_NULL)
MPI_Info_create(&alloc_shm_info);
MPI_Info_set(alloc_shm_info, "epochs_used", "lockall");
MPI_Win_allocate( (MPI_Aint) local_size, 1, alloc_shm_info, group->comm, &(alloc_slices[alloc_me].base), &mreg->window);
if (local_size == 0) {
/* TODO: Is this necessary? Is it a good idea anymore? */
alloc_slices[alloc_me].base = NULL;
} else {
ARMCII_Assert(alloc_slices[alloc_me].base != NULL);
}
} |
Just to confirm @jeffhammond, I believe #6140 fixed this issue? |
That's what I've been told by Edo, who is authoritative on all NWChem issues. |
👍 |
See nwchemgit/nwchem#633 (comment) for the initial report. This is very similar to what I saw that led to bad898f.
I don't know when I'll have time to bisect this. Can somebody look at RMA and see if there were any nontrivial changes to shared-memory atomics recently?
The text was updated successfully, but these errors were encountered: