Skip to content

feat: parallelize viscosity calculation#174

Draft
ltalirz wants to merge 1 commit intomainfrom
feat/parallelize-visc
Draft

feat: parallelize viscosity calculation#174
ltalirz wants to merge 1 commit intomainfrom
feat/parallelize-visc

Conversation

@ltalirz
Copy link
Copy Markdown
Contributor

@ltalirz ltalirz commented Mar 31, 2026

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions bot commented Mar 31, 2026

Coverage

Coverage Report
FileStmtsMissCoverMissing
__init__.py5340%5–7
app.py42783%36–40, 80–81
config.py18477%17–19, 37
database.py107892%73, 117, 128–129, 210, 212, 214, 234
executor.py432834%48, 50, 57–59, 61, 66, 75–87, 101–104, 113, 126, 128, 134
models.py1911791%67, 80–82, 88, 90, 94–104
routers
   __init__.py30100% 
   glasses.py380100% 
   jobs.py1483874%105–107, 192–193, 213, 245, 267, 289, 295, 299, 320, 322, 324–327, 330–332, 334–335, 340, 342–343, 345–346, 351–352, 354–356, 358, 360–364
   jobs_helpers.py27215343%76, 134–135, 190, 192, 196, 202, 239, 242–243, 289–291, 298–300, 308–316, 320–322, 326–335, 344–345, 348–349, 353–360, 362–366, 369–372, 374, 376–381, 383–384, 389, 391–393, 395–397, 399–404, 406–411, 428–432, 434–444, 449–453, 455–456, 458–461, 463–464, 474, 478, 480, 482, 484, 494–499, 501, 504–506, 509–511, 513, 516–522, 525, 527, 538–540, 543–546, 549, 551
workflows
   __init__.py554027%41–42, 52–53, 58–61, 72–77, 94, 96–98, 101, 103–105, 108–110, 120, 123–127, 129, 139–140, 151–155, 157
   meltquench.py241729%39–41, 43–45, 47, 63, 65–66, 68–69, 71–73, 75, 88
workflows/analyses
   __init__.py50100% 
   cte.py1008812%22–23, 25–27, 29–30, 44, 71, 73, 80, 107, 109, 111, 132, 134–144, 146–148, 157–159, 161–162, 166–172, 174, 184, 197–198, 200, 216–218, 220–221, 223, 225, 253–254, 256, 258–262, 264–266, 268–273, 275–276, 278, 304, 306, 309–310, 312–318, 321–323, 325
   elastic.py352722%22, 24, 26, 28, 36, 50–52, 54, 67–72, 74, 94–95, 124, 126, 128–130, 132–134, 136
   meltquench_viz.py90900%3, 5–8, 10, 13, 15–20, 23, 32, 34, 36–38, 40–55, 57, 60, 67–69, 71, 80–82, 89, 92–94, 101, 107–111, 113–114, 116–118, 122–123, 125, 127, 129, 131, 134–137, 139, 142–149, 152, 154, 156, 158, 160, 163–164, 167–168, 170, 172, 214
   structure.py433518%19, 21–22, 32–35, 37, 39, 41–47, 57, 62–63, 66–69, 72–76, 79–81, 83–85, 87
   viscosity.py12510813%30, 32, 34, 81–82, 85–86, 88–93, 95, 97–98, 101–102, 114–115, 118–119, 132–133, 135–140, 142, 164, 166–168, 170–172, 174–175, 177–180, 192–194, 196, 201, 203–205, 207–208, 220–221, 223, 236–237, 271, 274–278, 280, 289–295, 297, 306–310, 312, 322–324, 348–350, 378, 389–391, 411, 422–424, 444, 446–450, 452–459
TOTAL134466350% 

@ltalirz
Copy link
Copy Markdown
Contributor Author

ltalirz commented Mar 31, 2026

@jan-janssen So far I just parallelized at the analysis level, but since the viscosity analysis is too slow, we also need to parallelize inside.

In this PR I special-case the viscosity so I can still submit slurm jobs for its internal steps, but it's not very elegant.

What is the suggested approach with executorlib for cases like these?

@jan-janssen
Copy link
Copy Markdown
Contributor

@ltalirz There are two constraints from my side:

  • One is the SLURM job manager - Is there any limit how small the jobs should be? In terms of compute hours / minutes and number of cores? In principle, executorlib can have jobs in the order of 1 minute of compute time submitted, but this would result in a large number of small jobs. Instead we commonly used nested executors: https://executorlib.readthedocs.io/en/latest/3-hpc-job.html#slurm-with-flux
  • Two is the caching - For every job submitted to SLURM executorlib creates one HDF5 file, while for nested executors we typically use socket based communication, in this case the creation of an HDF5 file is optional. For choosing the right caching level we have to identify which kind of results might be useful for a request from a different user later on.

Unless there is a strong preference for very small jobs from the job manager perspective or a drastic change in resource requirements, I would recommend to package one user request in one SLURM job and then within this SLRUM job use a nested executor to efficiently use the available resources.

@ltalirz
Copy link
Copy Markdown
Contributor Author

ltalirz commented Apr 13, 2026

Thanks for the detailed feedback, Jan!
Let me add a bit more details to my question and then address your points as well.

My use case here is: At the top level, it makes sense to launch one workflow per analysis, such as CTE, viscosity, elastic constants. However, within each workflow it can be necessary to parallelize again. This is the case for the viscosity workflow, we are talking about many minutes to hours of runtime for each parallel job and, importantly, the parallel jobs may have significantly different runtimes.

Now, I could submit one "viscosity" slurm job and then parallelize inside that job, but that would (a) require me to already know how many parallel jobs the viscosity workflow will want to spawn so I know how much computational resources to allocate (doable, but not elegant) and (b) it will be inefficient because the entire reserved resources will be blocked until the slowest of the parallel jobs has completed.

Alternatively, I could submit one slurm job per parallel job (which is what I do in this PR), but then I lose the "workflow wrapper" around the viscosity (no caching + I have to use a different codepath than non-parallel workflows that are directly run inside executors, see the code in this PR).

Ideally, I would like to do something like: submit a 1-core job for the outer viscosity workflow that essentially just does the steering and data merging (in AiiDA, this comes for free, it is the AiiDA daemon process), which then launches dynamically as many parallel computation jobs as needed.
Is that possible?

Re 1.: We do currently have some jobs that run just for a few seconds, where we could probably gain a bit by not running them through slurm, but they are few and the few seconds we lose here are not relevant in the big picture (on the upside we get the caching), see below

image

Re 2.: In this particular case, a cache of the intermediate jobs would be nice to have, but not mandatory.

@jan-janssen
Copy link
Copy Markdown
Contributor

I am a bit confused. Based on the user input, do you know how many tasks you are going to create in the workflow? Or is the number of tasks only determined at run time? If the number of tasks is only determined at run time, then this requires hierarchical scheduling, because job schedulers like SLURM do not support this.

In terms of the comparison to AiiDA, you can create a local flux server and use it for short running tasks. In this case you would submit the whole workflow to this flux scheduler and inside this flux job submit the individual tasks to SLURM. The question for me is do you want to keep these flux jobs running while the SLURM jobs are waiting or is the task stopped after the individual steps are submitted to SLURM.

In general to me this comes down to the difference of caching and data management. Executorlib does caching, so when you submit the same task we can reload the result, still we do not claim to provide a research data management solution. So I would split the workflow into individual steps but having a hierarchy of caching is not something we provide at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants