Excessive metadata array reads in Workflow.write_commands
#667
Labels
bug
Something isn't working
persistence
Related to persistent workflow data storage/manipulation
zarr
When checking if we need to add a loop termination command to the commands of an action, we call
Workflow.get_iteration_final_run_IDs
, which in turn callsWorkflow.get_loop_map
. This then callsWorkflow.get_EARs_from_IDs
(which reads the runs metadata array) on all run IDs from that submission, which could be many thousands of runs for large workflows.In principle, this shouldn't be a problem because Zarr support multiprocess reading. In practice, it seems something is going wrong here under high concurrency scenarios (i.e. using a large job array when the cluster has very good availability). We get random
RuntimeError
s fromnumcodecs
during the chunk decompression from this metadata array. These errors are guarded against using thereretry
package. However, for tasks that should be quick, this introduces a potentially lengthy delay to execution, especially for large workflows.Additionally, reading the whole array is slow on Lustre file systems in general, because this array must be single-chunked (one chunk/file per run) to allow for multi-process writing during execution. So we ideally want to avoid reading most of/the whole array anyway.
Two steps to solve:
if
statement.The text was updated successfully, but these errors were encountered: