Example DDP model training

Installation on NeSI

Create a conda environment and install dependencies

module purge
module load Miniconda3/22.11.1-1
source $(conda info --base)/etc/profile.d/conda.sh
export PYTHONNOUSERSITE=1
conda env create -f environment.lock.yml -p ./venv

Note: The environment.lock.yml file has been generated from a conda environment created with the environment.yml file and then exported with

conda env export -p ./venv --no-builds | sed '/^name: .*/d; /^prefix: .*/d' > environment.lock.yml

Getting started

Run the example via slurm using

sbatch --account=ACCOUNT train_1node.sl

where ACCOUNT is your NeSI account.

The log files are saved in the logs/ folder.

References

Notes

Rendez-vous address using SLURMD_NODENAME or

master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr

from https://gist.github.com/TengdaHan/1dd10d335c7ca6f13810fff41e809904?permalink_comment_id=3751671#gistcomment-3751671 (edited)

The multi-nodes scripts uses a static rendez-vous backend. The C10d rendez-vous backend was not working, nodes failing to find each other on the platform. The reason was not clear and should be further investigated.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.gitignore		.gitignore
README.md		README.md
environment.lock.yml		environment.lock.yml
environment.yml		environment.yml
train.py		train.py
train_1node.sl		train_1node.sl
train_2nodes.sl		train_2nodes.sl
train_5gpus.sl		train_5gpus.sl

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Example DDP model training

Installation on NeSI

Getting started

References

Notes

About

Releases

Packages

Languages

nesi/ddp_example

Folders and files

Latest commit

History

Repository files navigation

Example DDP model training

Installation on NeSI

Getting started

References

Notes

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages