Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Purdue Anvil. #775

Merged
merged 13 commits into from
Oct 30, 2023
Merged

Support Purdue Anvil. #775

merged 13 commits into from
Oct 30, 2023

Conversation

joaander
Copy link
Member

@joaander joaander commented Oct 27, 2023

Description

Add environment for Purdue Anvil.

Motivation and Context

Anvil is a national HPC resources allocated though the NSF ACCESS program. Allow users to submit jobs to Anvil using flow.

Checklist:

@joaander joaander requested review from a team as code owners October 27, 2023 15:50
@joaander joaander requested review from b-butler and syjlee and removed request for a team October 27, 2023 15:50
@codecov
Copy link

codecov bot commented Oct 27, 2023

Codecov Report

Merging #775 (852bca2) into main (ca38c29) will increase coverage by 0.06%.
The diff coverage is 88.23%.

@@            Coverage Diff             @@
##             main     #775      +/-   ##
==========================================
+ Coverage   69.18%   69.25%   +0.06%     
==========================================
  Files          44       45       +1     
  Lines        4332     4348      +16     
  Branches     1053     1055       +2     
==========================================
+ Hits         2997     3011      +14     
- Misses       1129     1131       +2     
  Partials      206      206              
Files Coverage Δ
flow/environments/__init__.py 100.00% <100.00%> (ø)
flow/environments/purdue.py 87.50% <87.50%> (ø)

@joaander
Copy link
Member Author

I have successfully tested the partitions debug, shared (default) and wholenode in production scripts (hoomd-validation).

The test-workflow jobs produce results that appear correct for wide, highmem, gpu-debug, and gpu. I don't have a GPU allocation to use for testing GPU submissions.

I'm not sure about the hybrid OMP/MPI jobs generated by generate_template_reference_data. Without a 'cpus-per-task', SLURM may not correctly assign tasks and threads as the user intends. However, this is a general problem of the SLURM template and therefore fixing it is beyond the scope of this pull request.

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the environment @joaander!

changelog.txt Show resolved Hide resolved
flow/templates/anvil.sh Outdated Show resolved Hide resolved
flow/templates/anvil.sh Outdated Show resolved Hide resolved
flow/templates/anvil.sh Outdated Show resolved Hide resolved
@joaander
Copy link
Member Author

joaander commented Oct 30, 2023

mpirun with the -n option is currently broken on Anvil. It attempts to bind many processes to the same core:

mpirun --bind-to core:overload-allowed -n 8 --report-bindings hostname  
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
a246.anvil.rcac.purdue.edu
[a246.anvil.rcac.purdue.edu:286853] MCW rank 0 bound to socket 0[core 0[hwt 0]]: [B/./././././././././././././././.][././././././././././././././././././././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 1 bound to socket 0[core 1[hwt 0]]: [./B/././././././././././././././.][././././././././././././././././././././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 2 bound to socket 1[core 17[hwt 0]]: [././././././././././././././././.][B/./././././././././././././././././././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 3 bound to socket 1[core 33[hwt 0]]: [././././././././././././././././.][././././././././././././././././B/./././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 4 bound to socket 1[core 47[hwt 0]]: [././././././././././././././././.][././././././././././././././././././././././././././././././B/./././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 5 bound to socket 0[core 0[hwt 0]]: [B/./././././././././././././././.][././././././././././././././././././././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 6 bound to socket 0[core 2[hwt 0]]: [././B/./././././././././././././.][././././././././././././././././././././././././././././././././././././././.]
[a246.anvil.rcac.purdue.edu:286853] MCW rank 7 bound to socket 1[core 18[hwt 0]]: [././././././././././././././././.][./B/././././././././././././././././././././././././././././././././././././.]
a246.anvil.rcac.purdue.edu

It seems to work correctly without the -n option.

Edit: Leaving off the -n option produces incorrect bindings with --cpus-per-task. I will work around by disabling binding entirely.

Copy link
Member

@b-butler b-butler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I misspoke in my previous review, so I went ahead and made some small changes to the template. Please check that they are correct, and if so feel free to (auto-)merge.

Edit: @joaander

@joaander
Copy link
Member Author

I didn't misinterpret your earlier comments, I took them to the extreme and removed unneeded code. sbatch will error if the user requests GPUs in a non-gpu partition, just as it will error if a user requests more than the maximum number of cores in a specific partition (which we don't check for in the template).

Similarly, the -N is not needed and/or could be set unconditionally. However, it is not an error to include. These will likely be revised #777. I also noticed that other elements of the slurm scripts could be refactored - like moving the account to the base slurm script, or at least an intermediate common template.

@joaander joaander enabled auto-merge (squash) October 30, 2023 22:13
@joaander joaander merged commit c429e06 into main Oct 30, 2023
14 checks passed
@joaander joaander deleted the feature/anvil branch October 30, 2023 22:50
@b-butler b-butler added this to the v0.27.0 milestone Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants