Skip to content

Latest commit

 

History

History
174 lines (115 loc) · 5.02 KB

README.md

File metadata and controls

174 lines (115 loc) · 5.02 KB

Infra

Overview

The job scheduler is Slurm, a common system for Linux clusters.

The GCP system is made of the following components:

  • a login node, where users connect to prepare, launch and monitor the jobs
  • a controller node, to manage the scheduler and keep track of the system state. Users do not interact directly with this node.
  • worker nodes, where jobs are sent to be executed
  • a shared filesystem node, mounted in all nodes so that the /home directory is avaialble from all workers

Nodes are grouped into sets called partitions, so that different nodes can be assigned for different purposes. They functions as job queues, with different resources and priority settings.

Each user is assigned to one or more partitions, and when sending a job the default partition for the user will be used if not set explicitly.

There are many resources online with detailed documentation, the following is only a simplified guide.

Diagram

drawing

Login

TODO decide if we are going to use the GCP console and OS Login or a conventional SSH server with private keys

Job Execution

There are two possible modes to run a job, interactive and batch. In interactive mode the shell is connected to the worker and commands can be run directly in the node. In batch mode a script with the commands to run and the resources required is sent to the scheduler, it will be run as soon as enough resources are available.

Interactive

Runs an interactive session in a compute node.

$ srun --partition g2 --nodes=1 --gpus-per-node=1 --time=01:00:00 --pty bash -i

Common parameters:

  • --partition=g2, the partition name
  • --nodes=1, the node count, usually 1 for interactive sessions
  • --gpus-per-node=, to set number of GPUs to use
  • --time=01:00:00, to set a time limit (optional)
  • --pty bash -i, to make it interactive with a bash shell

Batch

Create a job script to launch in the background:

#!/bin/bash

# Command line options go here
#SBATCH --partition=g2
#SBATCH --time=00:01:00
#SBATCH --nodes=1
#SBATCH --job-name=example
#SBATCH --output=example.out
#SBATCH --gpus-per-node=1

# Command(s) goes here
nvidia-smi

Then send the job to the batch scheduler:

$ sbatch myjob.sh

Submitted batch job 4

The following statement will be displayed when the maximum number of batch submissions is reached.

sbatch: error: AssocGrpSubmitJobsLimit
sbatch: error: Batch job submission failed: Job violates accounting/QOS policy (job submit limit, user's size and/or time limits)

If the specified maximum number of nodes has been reached, the following will be displayed in the NODELIST(REASON) column of the squeue command.

PartitionNodeLimit

If the specified execution time limit has been reached, the NODELIST(REASON) column of the squeue command will display the following.

AssocMaxWallDurationPerJobLimit

Check status

Check the status of all jobs:

$ squeue

Example output:

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 6        g2  example asolano_  R       0:02      1 mlmini-g2-ghpc-0

Or a single job:

$ squeue --job $JOBID

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                10        g2  example asolano_ CF       0:06      2 mlmini-g2-ghpc-[0-1]

Cancel a job

Cancel a running job:

$ scancel $JOBID

Environment preparation

With the NVIDIA driver installed it is possible to set up all the user environment for each project independently with Conda, thus minimizing the use of environment modules.

Minconda

Download a recent version of the Miniconda installer from the Anaconda website here.

$ wget https://repo.anaconda.com/miniconda/Miniconda3-py310_23.10.0-1-Linux-x86_64.sh
$ bash Miniconda3-py310_23.10.0-1-Linux-x86_64.sh

Python

To select a specific version of Python for a project specify it when creating the environment:

$ conda create -n myenv python=3.9
$ conda activate myenv

CUDA Toolkit

To install a complete CUDA toolkit with a specific version, choose from the NVIDIA selection here. For example:

$ conda install nvidia/label/cuda-11.8.0::cuda-toolkit

PyTorch

Install a specific PyTorch version from the list of previous versions here. It is usually recommended to match the CUDA version. For example:

$ conda install pytorch==2.2.0 torchvision==0.17.0 torchaudio==2.2.0 pytorch-cuda=11.8 -c pytorch -c nvidia

Activation

This combination of packages sets all the paths for the project automatically, to activate it add the environment loading code to the job script:

# job.sh

# ...

# Activate the correct conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate myenv

# ...