Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Excessive metadata array reads in Workflow.write_commands #667

Open
1 of 2 tasks
aplowman opened this issue May 1, 2024 · 1 comment
Open
1 of 2 tasks

Excessive metadata array reads in Workflow.write_commands #667

aplowman opened this issue May 1, 2024 · 1 comment
Labels
bug Something isn't working persistence Related to persistent workflow data storage/manipulation zarr

Comments

@aplowman
Copy link
Contributor

aplowman commented May 1, 2024

When checking if we need to add a loop termination command to the commands of an action, we call Workflow.get_iteration_final_run_IDs, which in turn calls Workflow.get_loop_map. This then calls Workflow.get_EARs_from_IDs (which reads the runs metadata array) on all run IDs from that submission, which could be many thousands of runs for large workflows.

In principle, this shouldn't be a problem because Zarr support multiprocess reading. In practice, it seems something is going wrong here under high concurrency scenarios (i.e. using a large job array when the cluster has very good availability). We get random RuntimeErrors from numcodecs during the chunk decompression from this metadata array. These errors are guarded against using the reretry package. However, for tasks that should be quick, this introduces a potentially lengthy delay to execution, especially for large workflows.

Additionally, reading the whole array is slow on Lustre file systems in general, because this array must be single-chunked (one chunk/file per run) to allow for multi-process writing during execution. So we ideally want to avoid reading most of/the whole array anyway.

Two steps to solve:

  • Fix for the case where the workflow has no loops. This is easy, and should just require an wrapping some existing code in an if statement.
  • Fix for the case where the workflow has loops.
@aplowman aplowman added bug Something isn't working zarr persistence Related to persistent workflow data storage/manipulation labels May 1, 2024
@aplowman
Copy link
Contributor Author

aplowman commented May 1, 2024

First step fixed in #668.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working persistence Related to persistent workflow data storage/manipulation zarr
Projects
Status: 🔲 Todo
Development

No branches or pull requests

1 participant