You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Issues/bugs I noticed while running large-scale real applications on Frontier.
I use this for bookkeeping purpose. I will create a PR to fix them in the future.
The issues reported here occur only at large scales, e.g., 628-node FLASH-X runs.
System information
The issues are not system dependent.
Describe the problem you're observing
Most of the issues in the end will lead to Mercury TIMEOUT errors. Then the I/O (e.g., HDF5) will fail.
1:
This is inside the unifyfs_invoke_filesize_rpc() function. So the rpc id should be filesize_id not metaget_id.
The bug causes that filesize rpc calls are never handled, all waiting forever.
We need to carefully examine if we have similar bugs like this. Best to have unit tests to cover all RPC routines.
2:
During servers initialization process, server rank 0 acts as coordinator and performs a tree-based broadcast.
The hard-coded 5 secs timeout may not be enough for a large number of servers. I have to increase it a little to avoid the timeout error for 628-node Flash runs.
Issues/bugs I noticed while running large-scale real applications on Frontier.
I use this for bookkeeping purpose. I will create a PR to fix them in the future.
The issues reported here occur only at large scales, e.g., 628-node FLASH-X runs.
System information
The issues are not system dependent.
Describe the problem you're observing
Most of the issues in the end will lead to Mercury TIMEOUT errors. Then the I/O (e.g., HDF5) will fail.
1:
This is inside the
unifyfs_invoke_filesize_rpc()
function. So the rpc id should befilesize_id
notmetaget_id
.UnifyFS/server/src/unifyfs_p2p_rpc.c
Line 981 in 58ece44
The bug causes that filesize rpc calls are never handled, all waiting forever.
We need to carefully examine if we have similar bugs like this. Best to have unit tests to cover all RPC routines.
2:
During servers initialization process, server rank 0 acts as coordinator and performs a tree-based broadcast.
The hard-coded 5 secs timeout may not be enough for a large number of servers. I have to increase it a little to avoid the timeout error for 628-node Flash runs.
UnifyFS/server/src/unifyfs_group_rpc.c
Line 967 in 58ece44
The text was updated successfully, but these errors were encountered: