You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is awkward for applications that can otherwise use the SCR restart API when restarting with the same number of ranks, since they then need to have two code paths:
if restarting with same number of ranks --> use SCR restart API
if restarting with different number of ranks --> do not use the SCR restart API
It would be nice to merge these. It should be possible when leaving the files on the parallel file system, but there are checks and logic in the fetch process that currently do not support it.
One known problem is in reading the rank2file map. This scatters the files using kvtree, and it currently requires the exact same number of ranks to read the file which wrote it.
if (kvtree_read_scatter(rank2file, filelist, scr_comm_world) !=KVTREE_SUCCESS) {
We could work around that to distribute the file info to the ranks in the current run. We could just have kvtree decide how the info gets spread out, or we'd need to modify the kvtree API so that the calling ranks can specify the new mapping.
For the remainder of the function, we stat each file to verify that it exists. It would be nice to keep that, and it's easy to handle.
/* either can't read this file or it doesn't exist */
success=0;
break;
}
}
The trickier part is that we then fill in the local filemap data structure with info about each file that a rank "owns". It's not clear what to do in this case. One option would be to have each rank register every file as though all files are shared by all ranks. This is not exactly scalable, but perhaps it's the safest option, since we don't know how they will be accessed.
SCR currently allows an application to restart with a different number of ranks. However, one cannot call the SCR restart API in that case.
https://scr.readthedocs.io/en/latest/users/integration.html#restart-without-scr
This is awkward for applications that can otherwise use the SCR restart API when restarting with the same number of ranks, since they then need to have two code paths:
It would be nice to merge these. It should be possible when leaving the files on the parallel file system, but there are checks and logic in the fetch process that currently do not support it.
One known problem is in reading the rank2file map. This scatters the files using kvtree, and it currently requires the exact same number of ranks to read the file which wrote it.
scr/src/scr_fetch.c
Line 169 in 79ff7ed
We could work around that to distribute the file info to the ranks in the current run. We could just have kvtree decide how the info gets spread out, or we'd need to modify the kvtree API so that the calling ranks can specify the new mapping.
For the remainder of the function, we
stat
each file to verify that it exists. It would be nice to keep that, and it's easy to handle.scr/src/scr_fetch.c
Lines 251 to 258 in 79ff7ed
The trickier part is that we then fill in the local filemap data structure with info about each file that a rank "owns". It's not clear what to do in this case. One option would be to have each rank register every file as though all files are shared by all ranks. This is not exactly scalable, but perhaps it's the safest option, since we don't know how they will be accessed.
scr/src/scr_fetch.c
Lines 267 to 272 in 79ff7ed
The text was updated successfully, but these errors were encountered: