Running on SLURM cluster with multiple GPUs gets stuck at dataset creation step

I am having trouble using this repository. Any ideas what to do or try?

# Description
I want to run my submission on a SLURM cluster without docker.
For now I was trying to run the baselines to test the setup.
Running with 1 gpu works fine but the process gets stuck when using multiple GPUs.

## running with 1 gpu
command used:
```bash
torchrun --standalone --nnodes=1 --nproc_per_node=1 submission_runner.py --framework=pytorch --workload=mnist --experiment_dir=experiments --experiment_name=baseline_adamw --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --torch_compile=False --overwrite
```
this works fine and runs the submission as expected. Abbreviated console output:
```
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
I1114 19:32:40.648843 140249311245440 distributed_c10d.py:442] Added key: store_based_barrier_key:1 to store for rank: 0
I1114 19:32:40.649153 140249311245440 distributed_c10d.py:476] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 1 nodes.
I1114 19:32:41.436739 140249311245440 logger_utils.py:76] Creating experiment directory at experiments/baseline_adamw/mnist_pytorch.
I1114 19:32:41.453138 140249311245440 submission_runner.py:526] Using RNG seed 1762019175
I1114 19:32:41.454235 140249311245440 submission_runner.py:535] --- Tuning run 1/1 ---
I1114 19:32:41.454341 140249311245440 submission_runner.py:540] Creating tuning directory at experiments/baseline_adamw/mnist_pytorch/trial_1.
I1114 19:32:41.455712 140249311245440 logger_utils.py:92] Saving hparams to experiments/baseline_adamw/mnist_pytorch/trial_1/hparams.json.
I1114 19:32:41.458003 140249311245440 submission_runner.py:205] Initializing dataset.
I1114 19:32:41.458118 140249311245440 submission_runner.py:212] Initializing model.
I1114 19:32:42.231244 140249311245440 submission_runner.py:243] Initializing optimizer.
I1114 19:32:42.231692 140249311245440 submission_runner.py:250] Initializing metrics bundle.
I1114 19:32:42.231785 140249311245440 submission_runner.py:268] Initializing checkpoint and logger.
I1114 19:32:42.233011 140249311245440 submission_runner.py:288] Saving meta data to experiments/baseline_adamw/mnist_pytorch/trial_1/meta_data_0.json.
I1114 19:32:42.460128 140249311245440 submission_runner.py:292] Saving flags to experiments/baseline_adamw/mnist_pytorch/trial_1/flags_0.json.
I1114 19:32:42.499783 140249311245440 submission_runner.py:302] Starting training loop.
I1114 19:32:42.512056 140249311245440 dataset_info.py:578] Load dataset info from /home/blauths/data/mnist/3.0.1
I1114 19:32:42.515055 140249311245440 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I1114 19:32:42.515373 140249311245440 dataset_builder.py:528] Reusing dataset mnist (/home/blauths/data/mnist/3.0.1)
I1114 19:32:42.597422 140249311245440 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /home/blauths/data/mnist/3.0.1
I1114 19:32:44.257822 140230997542464 logging_writer.py:48] [0] global_step=0, grad_norm=208.853928, loss=18.818687
I1114 19:32:44.281792 140249311245440 submission.py:120] 0) loss = 18.819, grad_norm = 208.854
I1114 19:32:44.467674 140249311245440 spec.py:321] Evaluating on the training split.
I1114 19:32:44.470147 140249311245440 dataset_info.py:578] Load dataset info from /home/blauths/data/mnist/3.0.1
I1114 19:32:44.473052 140249311245440 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I1114 19:32:44.473357 140249311245440 dataset_builder.py:528] Reusing dataset mnist (/home/blauths/data/mnist/3.0.1)
I1114 19:32:44.522919 140249311245440 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /home/blauths/data/mnist/3.0.1
I1114 19:32:49.728950 140249311245440 spec.py:333] Evaluating on the validation split.
I1114 19:32:49.731426 140249311245440 dataset_info.py:578] Load dataset info from /home/blauths/data/mnist/3.0.1
I1114 19:32:49.734574 140249311245440 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I1114 19:32:49.734902 140249311245440 dataset_builder.py:528] Reusing dataset mnist (/home/blauths/data/mnist/3.0.1)
I1114 19:32:49.786858 140249311245440 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[50000:], from /home/blauths/data/mnist/3.0.1
I1114 19:32:50.280914 140249311245440 spec.py:349] Evaluating on the test split.
I1114 19:32:50.283240 140249311245440 dataset_info.py:578] Load dataset info from /home/blauths/data/mnist/3.0.1
I1114 19:32:50.286474 140249311245440 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I1114 19:32:50.286855 140249311245440 dataset_builder.py:528] Reusing dataset mnist (/home/blauths/data/mnist/3.0.1)
I1114 19:32:50.341554 140249311245440 logging_logger.py:49] Constructing tf.data.Dataset mnist for split test, from /home/blauths/data/mnist/3.0.1
I1114 19:32:50.871682 140249311245440 submission_runner.py:396] Time since start: 8.37s, 	Step: 1, 	{'train/accuracy': 0.0499, 'train/loss': 1.18980361328125, 'validation/accuracy': 0.0518, 'validation/loss': 1.19333125, 'validation/num_examples': 10000, 'test/accuracy': 0.0552, 'test/loss': 1.188576171875, 'test/num_examples': 10000, 'score': 1.7838444709777832, 'total_duration': 8.372186660766602, 'accumulated_submission_time': 1.7838444709777832, 'accumulated_eval_time': 6.404211759567261, 'accumulated_logging_time': 0}
.
.
.
I1114 19:33:11.261379 140249311245440 spec.py:321] Evaluating on the training split.
I1114 19:33:11.271509 140249311245440 spec.py:333] Evaluating on the validation split.
I1114 19:33:11.281817 140249311245440 spec.py:349] Evaluating on the test split.
.
.
.
I1114 19:33:55.048449 140249311245440 submission_runner.py:569] Timing: 60.00460410118103
I1114 19:33:55.048520 140249311245440 submission_runner.py:571] Total number of evals: 7
I1114 19:33:55.048587 140249311245440 submission_runner.py:572] ====================
I1114 19:33:55.048690 140249311245440 submission_runner.py:651] Final mnist score: 60.00460410118103
```

## running with 2 gpus
command used:

```bash
torchrun --redirects 1:0 --standalone --nnodes=1 --nproc_per_node=2 submission_runner.py --framework=pytorch --workload=mnist --experiment_dir=experiments --experiment_name=baseline_adamw --submission_path=baselines/adamw/pytorch/submission.py --tuning_search_space=baselines/adamw/tuning_search_space.json --torch_compile=False --overwrite
```
this gets stuck at the _dataset creation_ step until timeout.
See the following console output for where it stops:
```
master_addr is only used for static rdzv_backend and when rdzv_endpoint is not specified.
I1114 19:39:46.754390 140163014198400 distributed_c10d.py:442] Added key: store_based_barrier_key:1 to store for rank: 0
I1114 19:39:46.754344 139726293963904 distributed_c10d.py:442] Added key: store_based_barrier_key:1 to store for rank: 1
I1114 19:39:46.754640 140163014198400 distributed_c10d.py:476] Rank 0: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
I1114 19:39:46.754694 139726293963904 distributed_c10d.py:476] Rank 1: Completed store-based barrier for key:store_based_barrier_key:1 with 2 nodes.
I1114 19:39:47.555879 140163014198400 logger_utils.py:61] Removing existing experiment directory experiments/baseline_adamw/mnist_pytorch because --overwrite was set.
I1114 19:39:47.560014 140163014198400 logger_utils.py:76] Creating experiment directory at experiments/baseline_adamw/mnist_pytorch.
I1114 19:39:47.574892 140163014198400 submission_runner.py:526] Using RNG seed 545346501
I1114 19:39:47.575957 140163014198400 submission_runner.py:535] --- Tuning run 1/1 ---
I1114 19:39:47.576064 140163014198400 submission_runner.py:540] Creating tuning directory at experiments/baseline_adamw/mnist_pytorch/trial_1.
I1114 19:39:47.577447 140163014198400 logger_utils.py:92] Saving hparams to experiments/baseline_adamw/mnist_pytorch/trial_1/hparams.json.
I1114 19:39:47.579491 140163014198400 submission_runner.py:205] Initializing dataset.
I1114 19:39:47.579612 140163014198400 submission_runner.py:212] Initializing model.
I1114 19:39:48.925061 140163014198400 submission_runner.py:243] Initializing optimizer.
I1114 19:39:48.925512 140163014198400 submission_runner.py:250] Initializing metrics bundle.
I1114 19:39:48.925613 140163014198400 submission_runner.py:268] Initializing checkpoint and logger.
I1114 19:39:48.926866 140163014198400 submission_runner.py:288] Saving meta data to experiments/baseline_adamw/mnist_pytorch/trial_1/meta_data_0.json.
I1114 19:39:49.135014 140163014198400 submission_runner.py:292] Saving flags to experiments/baseline_adamw/mnist_pytorch/trial_1/flags_0.json.
I1114 19:39:49.175033 140163014198400 submission_runner.py:302] Starting training loop.
I1114 19:39:49.187256 140163014198400 dataset_info.py:578] Load dataset info from /home/blauths/data/mnist/3.0.1
I1114 19:39:49.190353 140163014198400 dataset_info.py:669] Fields info.[citation, splits, supervised_keys, module_name] from disk and from code do not match. Keeping the one from code.
I1114 19:39:49.190659 140163014198400 dataset_builder.py:528] Reusing dataset mnist (/home/blauths/data/mnist/3.0.1)
I1114 19:39:49.270506 140163014198400 logging_logger.py:49] Constructing tf.data.Dataset mnist for split train[:50000], from /home/blauths/data/mnist/3.0.1
```

## torch compile
The behavior does not change when enabling/disabling `torch.compile`.

## using singularity/apptainer
I built the singularity container from the Docker image. 
Running the above commands in the singularity container exhibits the same behaviour.
It still gets stuck with two gpus and works fine with one. 

# Steps to Reproduce
- clone repository
- setup conda env as described in `README.md` 
- follow installation for `pytorch` version
- run above command on machine with multiple GPUs


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Running on SLURM cluster with multiple GPUs gets stuck at dataset creation step #572

Description

running with 1 gpu

running with 2 gpus

torch compile

using singularity/apptainer

Steps to Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Running on SLURM cluster with multiple GPUs gets stuck at dataset creation step #572

Description

Description

running with 1 gpu

running with 2 gpus

torch compile

using singularity/apptainer

Steps to Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions