Integration of multi GPU processing (Issue #441) #990

entiri · 2021-11-15T15:20:16Z

Checklist

GitHub

I've given this PR a concise, self-descriptive, and meaningful title
I've linked relevant issues in the PR body
I've applied the relevant labels to this PR
I've assigned a reviewer

PR contents

I've consulted ivadomed's internal developer documentation to ensure my contribution is in line with any relevant design decisions
I've added relevant tests for my contribution
I've updated the relevant documentation for my changes, including argparse descriptions, docstrings, and ReadTheDocs tutorial pages

Description

This PR aims to address the feature request in issue #441. Data parallelism is implemented with PyTorch's Distributed Data Parallel (DDP). If users have multiple GPUs, DDP will be implemented, leading to a faster training time.

The behavior of training/testing models with one/no GPU was not changed. No new test is required, and all tests passed. The commands --train and --test were tested in multi contrasts.

Testing the PR:
To test the PR, you can train or test a model.

Linked issues

Adresses #441

…hrough torch.multiprocessing.spawn

…error

jcohenadad · 2021-11-15T15:29:34Z

ivadomed/main.py

-            resume_training=resume_training,
-            debugging=context["debugging"])
+        n_gpus = device_count()
+        if n_gpus > 1:


cannot we make this work with the n_gpus=1 case (for the sake of minimizing code duplication)

ivadomed/training.py

coveralls · 2021-11-20T14:55:50Z

Pull Request Test Coverage Report for Build 3412880409

17 of 99 (17.17%) changed or added relevant lines in 3 files are covered.
3 unchanged lines in 2 files lost coverage.
Overall coverage decreased (-0.7%) to 68.36%

Changes Missing Coverage	Covered Lines	Changed/Added Lines	%
ivadomed/main.py	4	24	16.67%
ivadomed/training.py	6	68	8.82%

Files with Coverage Reduction	New Missed Lines	%
ivadomed/main.py	1	15.17%
ivadomed/training.py	2	10.7%

Totals
Change from base Build 3370120657:	-0.7%
Covered Lines:	4252
Relevant Lines:	6220

💛 - Coveralls

ivadomed/training.py

dyt811 · 2021-11-21T03:05:24Z

ivadomed/training.py

@@ -25,8 +27,9 @@
 cudnn.benchmark = True


-def train(model_params, dataset_train, dataset_val, training_params, path_output, device,
-          cuda_available=True, metric_fns=None, n_gif=0, resume_training=False, debugging=False):
+def train(rank, model_params, dataset_train, dataset_val, training_params, path_output, device,


See comment below for the rank (int) comment.

dyt811 · 2021-11-29T18:17:09Z

@entiri is already aware of this but just posting as a general update for others who are curious.

This PR still needs a bit work in terms of getting the best_ values back from the multi-process spawning. Currently tutorial testing it is crashing at the end when trying to report the best timing. These are now issues (even though the model trained is still valid and can be used for testings). Once mp.process adoption is complete by Edward, I think these issues should be more clarified.

ivadomed/main.py

ivadomed/training.py

…hod to load from checkpoint based on gpu

mariehbourget

Thank @entiri for looking into this!

I did a first pass review and I ran into some questions/issues with training and testing.
I'm not familiar at all with DDP training so some questions may be a bit naïve at this point.

I tried training on rosenberg and the training on multiple GPUs seemed to work well.
However, I was wondering was is the relation between the config parameter gpu_ids and the DDP process. I was under the impression that I could control which GPUs would be used and found that the job was spawned on all available GPUs regardless of the gpu_ids values in the config file.
Similarly, there is a line in the terminal output that says which GPU is used and remain unchanged when DDP is used:
INFO | ivadomed.utils:define_device:166 - Using GPU ID 0
I noticed that the log file in the path_output does not contain the progress of the training and stop logging after this line:
ivadomed.main:run_command:472 - Spawning workers
I was unable to use the command --test with my DDP trained model with the following error:

2022-11-14 10:06:48.603 | INFO     | ivadomed.testing:test:52 - Loading model: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/best_model.pt
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/bin/ivadomed", line 33, in <module>
    sys.exit(load_entry_point('ivadomed', 'console_scripts', 'ivadomed')())
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 677, in run_main
    run_command(context=context,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 581, in run_command
    pred_metrics = imed_testing.test(model_params=model_params,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/testing.py", line 55, in test
    model.cuda()
AttributeError: 'collections.OrderedDict' object has no attribute 'cuda'

Similarly, I was unable to use the --segment command with the following error:

2022-11-14 10:09:43.781 | INFO     | ivadomed.utils:define_device:166 - Using GPU ID 0
2022-11-14 10:09:43.783 | DEBUG    | ivadomed.inference:get_preds:71 - PyTorch model detected at: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/my_model/my_model.pt
2022-11-14 10:09:43.783 | DEBUG    | ivadomed.inference:get_preds:72 - Loading model from: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/my_model/my_model.pt
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/bin/ivadomed", line 33, in <module>
    sys.exit(load_entry_point('ivadomed', 'console_scripts', 'ivadomed')())
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 677, in run_main
    run_command(context=context,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 397, in run_command
    run_segment_command(context, model_params, no_patch, overlap_2d)
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 314, in run_segment_command
    pred_list, target_list = imed_inference.segment_volume(str(path_model),
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/inference.py", line 496, in segment_volume
    preds = get_preds(context, fname_model, model_params, gpu_id, batch)
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/inference.py", line 73, in get_preds
    model = torch.load(fname_model, map_location=device)
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/serialization.py", line 1046, in _load
    result = unpickler.load()
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 750, in __setstate__
    self.process_group = _get_default_group()
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

For reference, I followed the One-class segmentation with 2D U-Net for training and testing.

mariehbourget · 2022-11-14T14:21:55Z

ivadomed/keywords.py

+
+@dataclass(frozen=True)
+class MultiGPUsKW:
+    # Address/Port information for DDP inter process communicaitons.


Suggested change

# Address/Port information for DDP inter process communicaitons.

# Address/Port information for DDP inter process communications.

mariehbourget · 2022-11-14T14:22:09Z

ivadomed/keywords.py

+    BEST_TRAINING_DICE: str = "best_training_dice"
+    BEST_TRAINING_LOSS: str = "best_training_loss"
+    BEST_VALIDATION_DICE: str = "best_validation_dice"
+    BEST_VALIDATION_LOSS: str = "best_validation_loss"


Suggested change

BEST_VALIDATION_LOSS: str = "best_validation_loss"

BEST_VALIDATION_LOSS: str = "best_validation_loss"

mariehbourget · 2022-11-14T16:20:05Z

ivadomed/training.py

+        rank (int): the rank of the training function as a process. Default value is set to -1,
+                                indicating that Distributed Parallel Processing will not be used.
+            rank == 0 means single GPU usually


Suggested change

rank (int): the rank of the training function as a process. Default value is set to -1,

indicating that Distributed Parallel Processing will not be used.

rank == 0 means single GPU usually

rank (int): the rank of the training function as a process. Default value is set to -1,

indicating that Distributed Parallel Processing will not be used.

rank == 0 means single GPU usually.

mariehbourget · 2022-11-14T16:20:27Z

ivadomed/training.py

+        n_process (int): the total number of processes that will be used to run train. Default is set to 1,
+                                indicating that no other processes will be run in parallel.


Suggested change

n_process (int): the total number of processes that will be used to run train. Default is set to 1,

indicating that no other processes will be run in parallel.

n_process (int): the total number of processes that will be used to run train. Default is set to 1,

indicating that no other processes will be run in parallel.

mariehbourget · 2022-11-14T16:24:03Z

ivadomed/training.py

+    ddp_setup_detected: bool = (rank != -1 or torch.cuda.device_count() > 1)
+
+    # Enable wandb tracking if the required params are found in the config file and the api key is correct
+    # Also disable in DDP scenarios.


Naïve question, why is wandb deactivated with DDP?
AFAIK, wandb is the primary tools used for tracking training in ivadomed, is it possible to have it with DDP as well?

mariehbourget · 2022-11-14T16:33:20Z

ivadomed/training.py

+    if ddp_setup_detected:
+        sampler_train = DistributedSampler(dataset=dataset_train, rank=rank, num_replicas=n_process)
+        train_loader = DataLoader(dataset_train, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
+                            shuffle=False, pin_memory=True, sampler=sampler_train,
+                            collate_fn=imed_loader_utils.imed_collate,
+                            num_workers=0)
+    elif no_ddp_setup_detected:
+        sampler_train, shuffle_train = get_sampler(dataset_train, conditions, training_params[TrainingParamsKW.BALANCE_SAMPLES][BalanceSamplesKW.TYPE])
+        train_loader = DataLoader(dataset_train, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
+                                shuffle=shuffle_train, pin_memory=True, sampler=sampler_train,
+                                collate_fn=imed_loader_utils.imed_collate,
+                                num_workers=0)


Here the behavior with DDP is not the same as without DDP.

Without DDP: we call the get_sampler function that may or may not use the BalancedSampler depending on a config parameter here.

Shouldn't we have the same principle with DDP?

mariehbourget · 2022-11-14T16:33:46Z

ivadomed/training.py

+        if ddp_setup_detected:
+            sampler_val = DistributedSampler(dataset=dataset_val, rank=rank, num_replicas=n_process)
+            val_loader = DataLoader(dataset_val, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
+                                shuffle=False, pin_memory=True, sampler=sampler_val,
                                collate_fn=imed_loader_utils.imed_collate,
                                num_workers=0)
+        else:
+            sampler_val, shuffle_val = get_sampler(dataset_val, conditions,
+                                                   training_params[TrainingParamsKW.BALANCE_SAMPLES][
+                                                       BalanceSamplesKW.TYPE])
+            val_loader = DataLoader(dataset_val, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
+                                    shuffle=shuffle_val, pin_memory=True, sampler=sampler_val,
+                                    collate_fn=imed_loader_utils.imed_collate,
+                                    num_workers=0)


Same as above

hermancollin · 2022-11-14T18:11:18Z

I trained a model on 10 epochs with/without the feature to compare the results. On bireli with 2 GPUs, time reported a wall clock time of 3min38. On the master branch, using only 1 GPU, we get a 6min2 wall clock time, such that we have an impressive 40% decrease of training time when using 2 GPUs. @mariehbourget, was there a noticeable change in training time on using 8-gpus?

I tried training on rosenberg and the training on multiple GPUs seemed to work well.
However, I was wondering was is the relation between the config parameter gpu_ids and the DDP process. I was under the impression that I could control which GPUs would be used and found that the job was spawned on all available GPUs regardless of the gpu_ids values in the config file.

Yes this is something I was worried about too. While it is a good thing performance-wise to use all available GPUs, we sometimes have to share the cluster with someone else who is using other cores so it would be great if we could control the GPU ids that can be used.

So, training-wise, I think the feature works, but aside from the problems @mariehbourget encountered, maybe the user should have more control over which GPUs are used (i.e. some config keys?). If it isn't possible to specify certain gpu IDs to torch, the user should at least be able to control if he wants to use 1 specific GPU or all of them for the sake of sharing the cluster with other students. I'm guessing we can control how many GPUs to use with the API and the rank parameter, but this could be more user-friendly.

mariehbourget · 2022-11-14T18:40:01Z

@mariehbourget, was there a noticeable change in training time on using 8-gpus?

Yes! Sorry I forgot to mention it.
For a 100 epochs training with the tutorial config with and without the feature on rosenberg, I had:

8min48s with 1 GPU
2min01s with 8 GPUs

entiri and others added 13 commits October 28, 2021 23:56

adding logic for training with ddp

4731998

implementing local rank, adding distinct data sampler and loader for ddp

33d4189

Merge branch 'master' into en/ddp_integration

3c4ca8c

resolved import error with torch.cuda.device_count

5e0c7df

fixed config file lookup for ddp

0121b88

moving mp.spawn from main to context fn in order to pass args

6e609b3

updated train.py to account for rank value passed to train function t…

0a3fcfc

…hrough torch.multiprocessing.spawn

fixed variable references in distributed dataloader

edd28bc

adding if condition to initialize default process groups

cffe1c0

adding condition for loading best model to avoid PytorchStreamReader …

9b8d3aa

…error

Saving state dict when DDP is used

88748a4

storing best checkpoint to 1 gpu is DDP is used

32ad1ee

saving DDP module instead of state dict

e26650b

jcohenadad reviewed Nov 15, 2021

View reviewed changes

ivadomed/training.py Outdated Show resolved Hide resolved

dyt811 added 2 commits November 20, 2021 09:34

Missing bracket fix and ignore default output

d3af69d

Merge remote-tracking branch 'origin/master' into en/ddp_integration

878f8c9

dyt811 reviewed Nov 21, 2021

View reviewed changes

ivadomed/training.py Outdated Show resolved Hide resolved

dyt811 reviewed Nov 21, 2021

View reviewed changes

arbalest2 reviewed Feb 8, 2022

View reviewed changes

ivadomed/main.py Outdated Show resolved Hide resolved

arbalest2 reviewed Feb 8, 2022

View reviewed changes

ivadomed/training.py Outdated Show resolved Hide resolved

entiri added 3 commits May 12, 2022 17:02

updated branch with recommendations from pull request

7f0a115

Merge branch 'master' into en/ddp_integration

72fef67

updated with condition for ddp when json file is reqd

c37773e

kousu mentioned this pull request May 17, 2022

CI failure: "RuntimeError: tk.h version (8.6) doesn't match libtk.a version (8.5)" #1132

Closed

kousu force-pushed the master branch from d064c11 to ec13667 Compare June 1, 2022 21:45

entiri added 2 commits June 16, 2022 16:35

Merge branch 'master' into en/ddp_integration

08ebfd3

Incorporated wandb from master branch; updated checkpoint loading met…

9e7dc46

…hod to load from checkpoint based on gpu

entiri and others added 10 commits June 18, 2022 17:46

Included wandb args in distributed method call

ac692c9

upd json loading method

dd9e81c

resolved error with how json file containing scores are loaded

e0b3fa1

moved reference to ddp condition to earlier in the method

59ceedd

added device reference to loss fn when ddp is used

510b70d

changed reference to ddp condition

b0366bd

added reference to gpu in load_checkpoint

e8154c4

Merge branch 'master' into en/ddp_integration

b3c7481

Merge remote-tracking branch 'origin/master' into en/ddp_integration

82e22bc

Minor code clean up and readability improvement

c63efc8

dyt811 marked this pull request as ready for review October 31, 2022 03:18

dyt811 self-assigned this Oct 31, 2022

dyt811 marked this pull request as draft October 31, 2022 15:17

Fixing bug introduced during refactoring on wandb tracking

7f2ca9f

dyt811 marked this pull request as ready for review October 31, 2022 16:11

Merge branch 'master' into en/ddp_integration

2c2e1f5

hermancollin self-requested a review October 31, 2022 18:30

Merge branch 'master' into en/ddp_integration

6e10512

mariehbourget reviewed Nov 14, 2022

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integration of multi GPU processing (Issue #441) #990

Integration of multi GPU processing (Issue #441) #990

entiri commented Nov 15, 2021

jcohenadad Nov 15, 2021

coveralls commented Nov 20, 2021 •

edited

Loading

dyt811 Nov 21, 2021

dyt811 commented Nov 29, 2021

mariehbourget left a comment

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

mariehbourget Nov 14, 2022

hermancollin commented Nov 14, 2022

mariehbourget commented Nov 14, 2022

	# Address/Port information for DDP inter process communicaitons.
	# Address/Port information for DDP inter process communications.

	BEST_VALIDATION_LOSS: str = "best_validation_loss"
	BEST_VALIDATION_LOSS: str = "best_validation_loss"

		n_process (int): the total number of processes that will be used to run train. Default is set to 1,
		indicating that no other processes will be run in parallel.

Integration of multi GPU processing (Issue #441) #990

Are you sure you want to change the base?

Integration of multi GPU processing (Issue #441) #990

Conversation

entiri commented Nov 15, 2021

Checklist

GitHub

PR contents

Description

Linked issues

Choose a reason for hiding this comment

coveralls commented Nov 20, 2021 • edited Loading

Pull Request Test Coverage Report for Build 3412880409

💛 - Coveralls

Choose a reason for hiding this comment

dyt811 commented Nov 29, 2021

mariehbourget left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hermancollin commented Nov 14, 2022

mariehbourget commented Nov 14, 2022

coveralls commented Nov 20, 2021 •

edited

Loading