Skip to content

jan3zk/HPC_ML

Repository files navigation

Efficient Usage of HPC for Deep Learning - Workshop Examples

You can install all required libraries using the following command:

conda env create -f environment.yml

Bird Classification Example

Data is available at the 100 Bird Species Dataset.

Task Parallelization on a Single Node

The scripts train_singlenode.sh and train_singlenode.py demonstrate running multiple training tasks in parallel with distinct parameters. Each task operates independently on a separate GPU. Execute the example using the following command:

sbatch train_singlenode.sh

Details for each srun experiment can be found at Run 1 and Run 2. The expected training time for 10 epochs is approximately 30 minutes on Nvidia V100s GPU.

Data Parallelization on a Single Node using PyTorch DataParallel (constrained)

The scripts train_singlenode_dp.sh and train_singlenode_dp.py illustrate data parallelization using the DataParallel approach. PyTorch's DataParallel replicates the model across available GPUs within a single machine, dividing input data into smaller batches for parallel processing on each GPU. It independently computes gradients on these batches, aggregates them, and synchronizes the model's parameters across all GPUs after each update. Run the example using the following command:

sbatch train_singlenode_dp.sh

Details are available at Run. The expected training time for 10 epochs is approximately 20 minutes on 2 Nvidia V100s GPUs.

Data Parallelization on Multiple Nodes using PyTorch DistributedDataParallel

The scripts train_multinode_ddp.sh and train_multinode_ddp.py illustrate data parallelization using the DistributedDataParallel approach. DistributedDataParallel allows efficient parallelization across multiple nodes. It distributes the workload among different nodes, leveraging collective communication operations to manage gradient synchronization and model parameter updates efficiently. This approach significantly reduces the communication overhead between nodes, enabling faster training and scaling for larger models and datasets.

Utilizing PyTorch's DistributedDataParallel incurs less overhead compared to DataParallel. Execute the following command to run the example on multiple nodes:

sbatch train_multinode_ddp.sh

Refer to the experiment's details at Run. The expected training time is approximately 10 minutes on 4 Nvidia V100s GPUs.

About

Machine Learning Examples on HPC Clusters

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published