Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement HPC execution #192

Merged
merged 13 commits into from
Dec 10, 2021
Merged

Implement HPC execution #192

merged 13 commits into from
Dec 10, 2021

Conversation

rvhonorato
Copy link
Member

I implemented a new libhpc to handle the HPC executions (SLURM, but we can add TORQUE later). I tried my best to follow the same design of Brian's libparallel.

I also implemented queue_limit that defines the size of the submission batches and concat, similar to what we have in haddock2.4

# concatenate models inside each .job, concat = 5 each .job will produce 5 models
concat = 1
#  Limit the number of concurrent submissions to the queue
queue_limit = 100
[2021-12-09 12:44:57,404 __init__ INFO] [topoaa] Running CNS Jobs n=11
[2021-12-09 12:44:57,404 libhpc INFO] Concatenating, each .job will produce 5 (or less) models
[2021-12-09 12:44:57,405 libhpc INFO] > Running batch 1/1
[2021-12-09 12:44:57,503 libhpc INFO] >> topoaa_1.job submitted
[2021-12-09 12:44:57,531 libhpc INFO] >> topoaa_2.job submitted
[2021-12-09 12:44:57,555 libhpc INFO] >> topoaa_3.job submitted
[2021-12-09 12:44:57,556 libhpc INFO] >> 0% done
[2021-12-09 12:44:57,557 libhpc INFO] >> Waiting... (10.00s)
[2021-12-09 12:45:07,638 libhpc INFO] >> 100% done
[2021-12-09 12:45:07,638 libhpc INFO] >> Took 10.23s
[2021-12-09 12:45:07,638 libhpc INFO] > Batch 1/1 done

I also added a "terminate signal" that will remove the .jobs from the queue:

^C[2021-12-09 12:41:54,179 libhpc INFO] Terminate signal recieved, removing jobs from the queue...
[2021-12-09 12:41:54,214 libhpc INFO] Canceling topoaa_1.job - 20353291
[2021-12-09 12:41:54,261 libhpc INFO] Canceling topoaa_2.job - 20353292
[2021-12-09 12:41:54,313 libhpc INFO] Canceling topoaa_3.job - 20353293
[2021-12-09 12:41:54,360 libhpc INFO] Canceling topoaa_4.job - 20353294
[2021-12-09 12:41:54,410 libhpc INFO] Canceling topoaa_5.job - 20353295
[2021-12-09 12:41:54,465 libhpc INFO] Canceling topoaa_6.job - 20353296
[2021-12-09 12:41:54,516 libhpc INFO] Canceling topoaa_7.job - 20353297
[2021-12-09 12:41:54,563 libhpc INFO] Canceling topoaa_8.job - 20353298
[2021-12-09 12:41:54,615 libhpc INFO] Canceling topoaa_9.job - 20353299
[2021-12-09 12:41:54,670 libhpc INFO] Canceling topoaa_10.job - 20353300
[2021-12-09 12:41:54,721 libhpc INFO] Canceling topoaa_11.job - 20353301
[2021-12-09 12:41:54,744 libhpc INFO] The jobs in the queue were terminated in a controlled way
[2021-12-09 12:41:54,745 libworkflow INFO] You have halted subprocess execution by hitting Ctrl+c
[2021-12-09 12:41:54,746 libworkflow INFO] Exiting...

As for the wait to check if the .jobs have finished I added an "adaptive timer" to be both HPC friendly and efficient: in the first batch submission is uses pre-defined wait timers 10s (<10 jobs), 30s (<50) and 60s (>50), after that it keeps track of how long each batch took to finish and then waits for the average time.

I have NOT added the re-submission logic, let's do it in another PR. This one is just for the implementation of libhpc

Copy link
Member

@joaomcteixeira joaomcteixeira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am sending those comments for now, and making a deeper review on what we discussed on slack later. 👍

src/haddock/libs/libhpc.py Show resolved Hide resolved
src/haddock/libs/libhpc.py Outdated Show resolved Hide resolved
@codecov-commenter
Copy link

codecov-commenter commented Dec 9, 2021

Codecov Report

Merging #192 (17bf453) into main (4cb69b3) will decrease coverage by 1.23%.
The diff coverage is 22.43%.

Impacted file tree graph

@@            Coverage Diff             @@
##             main     #192      +/-   ##
==========================================
- Coverage   41.30%   40.06%   -1.24%     
==========================================
  Files          41       42       +1     
  Lines        2271     2401     +130     
==========================================
+ Hits          938      962      +24     
- Misses       1333     1439     +106     
Impacted Files Coverage Δ
src/haddock/modules/sampling/rigidbody/__init__.py 22.95% <16.66%> (-2.05%) ⬇️
src/haddock/modules/refinement/emref/__init__.py 22.58% <20.00%> (-1.62%) ⬇️
src/haddock/modules/refinement/flexref/__init__.py 22.58% <20.00%> (-1.62%) ⬇️
src/haddock/modules/refinement/mdref/__init__.py 22.95% <20.00%> (-1.64%) ⬇️
src/haddock/modules/topology/topoaa/__init__.py 19.76% <20.00%> (-1.17%) ⬇️
src/haddock/libs/libhpc.py 21.55% <21.55%> (ø)
src/haddock/modules/__init__.py 35.00% <35.71%> (-0.64%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4cb69b3...17bf453. Read the comment docs.

Also, modularized functions for better testing.

TODO
testing if it reproduces what @rvhonorate implemented
A small touch over @rvhonorato implementation. Good work m8!

Done:

* brings some variables to parameters render things more configurable and testable
* leverages what was done in `benchmark` regarding job heading creation.
* Creates a small factory for the `Engine` in the modules `_run`.
* defines default variables at the module level so they are synchronized in `libworkflow` and `libhpc`
@joaomcteixeira
Copy link
Member

@rvhonorato give a look if you wish. Ready to merge from me.

@joaomcteixeira joaomcteixeira added the feature New feature request label Dec 10, 2021
@joaomcteixeira joaomcteixeira added this to In Progress in Features via automation Dec 10, 2021
@joaomcteixeira joaomcteixeira added this to the v3.0.0 stable release milestone Dec 10, 2021
@rvhonorato rvhonorato merged commit 7acbc1a into main Dec 10, 2021
Features automation moved this from In Progress to Done Dec 10, 2021
@rvhonorato rvhonorato deleted the hpc branch December 10, 2021 17:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature New feature request
Projects
Features
  
Done
Development

Successfully merging this pull request may close these issues.

None yet

3 participants