High-Performance Computing (HPC) #16182

hongbo-miao · 2024-04-27T06:23:53Z

High-Performance Computing (HPC)

AWS ParallelCluster

Tutorial
Tickets
- Terraform support: Integrate parallelcluster deployment with Terraform aws/aws-parallelcluster#6092
- Amazon Linux 2023 support: Feature request: AmazonLinux 2023 (al2023) support aws/aws-parallelcluster#5214
- price-capacity-optimized support: Feature request: price-capacity-optimized aws/aws-parallelcluster#6058
- Lmod: Support Lmod aws/aws-parallelcluster#5808
Config
- EnableMemoryBasedScheduling
  
  With EnableMemoryBasedScheduling: true, the Slurm scheduler tracks the amount of memory that each job requires on each node. Then, the Slurm scheduler uses this information to schedule multiple jobs on the same compute node. The total amount of memory that jobs require on a node can't be larger than the available node memory. The scheduler prevents a job from using more memory than what was requested when the job was submitted.
  With EnableMemoryBasedScheduling: false, jobs might compete for memory on a shared node and cause job failures and out-of-memory events.
- QueueUpdateStrategy
- Efa
  
  Specifies that Elastic Fabric Adapter (EFA) is enabled. To view the list of EC2 instances that support EFA, see Supported instance types in the Amazon EC2 User Guide for Linux Instances. For more information, see Elastic Fabric Adapter. We recommend that you use a cluster SlurmQueues / Networking / PlacementGroup to minimize latencies between instances.
  The default value is false.
- GdrSupport
  
  Starting with AWS ParallelCluster version 3.0.2, this setting has no effect.

HPC job managing and scheduling tools

Slurm: https://slurm.schedmd.com/documentation.html
- Tutorial: https://www.c4.ucsf.edu/get-started/hello-world-job.html
(Old) PBS/TORQUE: https://carc.unm.edu/user-support-2/running-jobs/pbstorque.html

Comparison

The text was updated successfully, but these errors were encountered:

hongbo-miao · 2024-04-29T00:24:55Z

Added by

Added diagram as part of the full architecture.

➜ pcluster create-cluster --cluster-name=hm-hpc-cluster --cluster-configuration=config/hm-hpc-cluster-config.yaml
{
  "cluster": {
    "clusterName": "horizon-hpc-cluster",
    "cloudformationStackStatus": "CREATE_IN_PROGRESS",
    "cloudformationStackArn": "arn:aws:cloudformation:us-west-2:272394222652:stack/hm-hpc-cluster/d8cc5ef0-05b0-11ef-bf6e-0ac24af65aed",
    "region": "us-west-2",
    "version": "3.9.1",
    "clusterStatus": "CREATE_IN_PROGRESS",
    "scheduler": {
      "type": "slurm"
    }
  }
}

➜ pcluster ssh --cluster-name=hm-hpc-cluster

ubuntu@ip-172-31-32-251:~$ sbatch --nodes=3 --partition=spot-queue --constraint="[c7gn-16xlarge*1&c7gn-metal*2]" --wrap="srun jobs/hello.sh"
Submitted batch job 1

ubuntu@ip-172-31-32-251:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1 spot-queu     wrap   ubuntu CF       0:06      3 spot-queue-dy-c7gn16xlarge-1,spot-queue-dy-c7gnmetal-[1-2]

ubuntu@ip-172-31-32-251:~$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

ubuntu@ip-172-31-32-251:~$ ls
jobs  slurm-1.out

ubuntu@ip-172-31-32-251:~$ cat slurm-1.out
Hello from spot-queue-dy-c7gn16xlarge-1
Hello from spot-queue-dy-c7gnmetal-1
Hello from spot-queue-dy-c7gnmetal-2

hongbo-miao added the feature request label Apr 27, 2024

hongbo-miao self-assigned this Apr 27, 2024

hongbo-miao changed the title ~~HPC job managing and scheduling~~ HPC Apr 27, 2024

hongbo-miao linked a pull request Apr 28, 2024 that will close this issue

feat(aws): add aws parallelcluster and slurm #16260

Merged

hongbo-miao changed the title ~~HPC~~ High-Performance Computing (HPC) Apr 28, 2024

hongbo-miao closed this as completed Apr 28, 2024

hongbo-miao linked a pull request Apr 29, 2024 that will close this issue

perf(aws-parallelcluster): set QueueUpdateStrategy, EnableMemoryBasedScheduling, Efa, PlacementGroup #16319

Merged

Repository owner locked and limited conversation to collaborators May 29, 2024

hongbo-miao converted this issue into discussion #16981 May 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

High-Performance Computing (HPC) #16182

High-Performance Computing (HPC) #16182

hongbo-miao commented Apr 27, 2024 •

edited

Loading

hongbo-miao commented Apr 29, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

High-Performance Computing (HPC) #16182

High-Performance Computing (HPC) #16182

Comments

hongbo-miao commented Apr 27, 2024 • edited Loading

High-Performance Computing (HPC)

AWS ParallelCluster

HPC job managing and scheduling tools

Comparison

hongbo-miao commented Apr 29, 2024 • edited Loading

This issue was moved to a discussion.

hongbo-miao commented Apr 27, 2024 •

edited

Loading

hongbo-miao commented Apr 29, 2024 •

edited

Loading