Skip to content

Conversation

@ilumsden
Copy link
Collaborator

Description

This PR adds the infrastructure files for our dry-runs in advance of the HPDC tutorial. The configuration for these dry-runs is the same as the test infrastructures with the following exceptions:

  • Max concurrent attendees is set to 14
  • The minimum number of nodes per zone is set to 2
  • The maximum number of nodes per zone is set to 8
  • The region has been changed to us-east-2 (i.e., Ohio) since that's what we'll use for HPDC
  • The availability zones have been changed to us-east-2a and us-east-2b

@slabasan this PR is ready for review, but please do not merge until after our dry-run.

@ilumsden ilumsden requested a review from slabasan July 10, 2025 19:22
@ilumsden ilumsden self-assigned this Jul 10, 2025
@ilumsden ilumsden force-pushed the dry-run-infrastructure branch from 3f4ba9b to 8c03b56 Compare July 13, 2025 20:08
@ilumsden
Copy link
Collaborator Author

Some notes from testing and the dry run:

  • There are issues with vCPU and VPC account limits in different regions. The dry-run infrastructure has been switched to us-west-1 to avoid these issues. For the final infrastructure, we will use us-east-1 (N. Virginia) to resolve the vCPU limit issue, and we have reserved a VPC in that region to prevent issues with the VPC limits.
  • To limit the number of vCPUs we use without needing users to share an AWS instance, I've changed the region from c7i.24xlarge to c7i.12xlarge. These instances are the same, except that 12xlarge has fewer cores and less memory.
  • Flux cannot be (easily) configured to automatically treat hardware threads (i.e., vCPUs in Kubernetes/AWS terms) as cores. This causes issues with Benchpark on the c7i.12xlarge instances because Flux tells Benchpark that there are only 24 cores on the node. Because of this, in our Kripke scaling experiment, Benchpark configures the last run to execute on 2 nodes, but we do not actually have 2 nodes to run on. To work around this, I've updated the entrypoint scripts to accept an optional command-line argument. When that argument is not provided, the scripts will behave the same as before. When a user provides the "real" number of cores as an argument, the entrypoint scripts will generate a Flux TOML config that forces Flux to believe that we have that number of cores. We can use this from the infrastructure YAML files to make sure Flux is configured correctly. This PR does not add anything into the YAML files yet. I'll save that for a follow-up PR that will add the final infrastructure.

@ilumsden ilumsden force-pushed the dry-run-infrastructure branch from b02543c to fef22d9 Compare July 15, 2025 19:41
@ilumsden
Copy link
Collaborator Author

@slabasan once CI is done running, this PR will be ready for review and merge. Once it's merged, I'll move onto generating and testing the final infrastructure that we will use for the tutorial.

@slabasan slabasan merged commit d32abe5 into develop Jul 15, 2025
6 checks passed
@slabasan slabasan deleted the dry-run-infrastructure branch July 15, 2025 21:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants