Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration of multi GPU processing (Issue #441) #990

Open
wants to merge 33 commits into
base: master
Choose a base branch
from

Conversation

entiri
Copy link
Collaborator

@entiri entiri commented Nov 15, 2021

Checklist

GitHub

  • I've given this PR a concise, self-descriptive, and meaningful title
  • I've linked relevant issues in the PR body
  • I've applied the relevant labels to this PR
  • I've assigned a reviewer

PR contents

Description

This PR aims to address the feature request in issue #441. Data parallelism is implemented with PyTorch's Distributed Data Parallel (DDP). If users have multiple GPUs, DDP will be implemented, leading to a faster training time.

The behavior of training/testing models with one/no GPU was not changed. No new test is required, and all tests passed. The commands --train and --test were tested in multi contrasts.

Testing the PR:
To test the PR, you can train or test a model.

Linked issues

Adresses #441

resume_training=resume_training,
debugging=context["debugging"])
n_gpus = device_count()
if n_gpus > 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cannot we make this work with the n_gpus=1 case (for the sake of minimizing code duplication)

ivadomed/training.py Outdated Show resolved Hide resolved
@coveralls
Copy link

coveralls commented Nov 20, 2021

Pull Request Test Coverage Report for Build 3412880409

  • 17 of 99 (17.17%) changed or added relevant lines in 3 files are covered.
  • 3 unchanged lines in 2 files lost coverage.
  • Overall coverage decreased (-0.7%) to 68.36%

Changes Missing Coverage Covered Lines Changed/Added Lines %
ivadomed/main.py 4 24 16.67%
ivadomed/training.py 6 68 8.82%
Files with Coverage Reduction New Missed Lines %
ivadomed/main.py 1 15.17%
ivadomed/training.py 2 10.7%
Totals Coverage Status
Change from base Build 3370120657: -0.7%
Covered Lines: 4252
Relevant Lines: 6220

💛 - Coveralls

ivadomed/training.py Outdated Show resolved Hide resolved
@@ -25,8 +27,9 @@
cudnn.benchmark = True


def train(model_params, dataset_train, dataset_val, training_params, path_output, device,
cuda_available=True, metric_fns=None, n_gif=0, resume_training=False, debugging=False):
def train(rank, model_params, dataset_train, dataset_val, training_params, path_output, device,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comment below for the rank (int) comment.

@dyt811
Copy link
Member

dyt811 commented Nov 29, 2021

@entiri is already aware of this but just posting as a general update for others who are curious.

This PR still needs a bit work in terms of getting the best_ values back from the multi-process spawning. Currently tutorial testing it is crashing at the end when trying to report the best timing. These are now issues (even though the model trained is still valid and can be used for testings). Once mp.process adoption is complete by Edward, I think these issues should be more clarified.

image

ivadomed/main.py Outdated Show resolved Hide resolved
ivadomed/training.py Outdated Show resolved Hide resolved
@dyt811 dyt811 marked this pull request as ready for review October 31, 2022 03:18
@dyt811 dyt811 self-assigned this Oct 31, 2022
@dyt811 dyt811 marked this pull request as draft October 31, 2022 15:17
@dyt811 dyt811 marked this pull request as ready for review October 31, 2022 16:11
@hermancollin hermancollin self-requested a review October 31, 2022 18:30
Copy link
Member

@mariehbourget mariehbourget left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank @entiri for looking into this!

I did a first pass review and I ran into some questions/issues with training and testing.
I'm not familiar at all with DDP training so some questions may be a bit naïve at this point.

  1. I tried training on rosenberg and the training on multiple GPUs seemed to work well.
    However, I was wondering was is the relation between the config parameter gpu_ids and the DDP process. I was under the impression that I could control which GPUs would be used and found that the job was spawned on all available GPUs regardless of the gpu_ids values in the config file.

  2. Similarly, there is a line in the terminal output that says which GPU is used and remain unchanged when DDP is used:
    INFO | ivadomed.utils:define_device:166 - Using GPU ID 0

  3. I noticed that the log file in the path_output does not contain the progress of the training and stop logging after this line:
    ivadomed.main:run_command:472 - Spawning workers

  4. I was unable to use the command --test with my DDP trained model with the following error:

2022-11-14 10:06:48.603 | INFO     | ivadomed.testing:test:52 - Loading model: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/best_model.pt
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/bin/ivadomed", line 33, in <module>
    sys.exit(load_entry_point('ivadomed', 'console_scripts', 'ivadomed')())
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 677, in run_main
    run_command(context=context,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 581, in run_command
    pred_metrics = imed_testing.test(model_params=model_params,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/testing.py", line 55, in test
    model.cuda()
AttributeError: 'collections.OrderedDict' object has no attribute 'cuda'
  1. Similarly, I was unable to use the --segment command with the following error:
2022-11-14 10:09:43.781 | INFO     | ivadomed.utils:define_device:166 - Using GPU ID 0
2022-11-14 10:09:43.783 | DEBUG    | ivadomed.inference:get_preds:71 - PyTorch model detected at: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/my_model/my_model.pt
2022-11-14 10:09:43.783 | DEBUG    | ivadomed.inference:get_preds:72 - Loading model from: ../data_extrassd_maboudb/test_pr990/spineGeneric_01/my_model/my_model.pt
Traceback (most recent call last):
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/bin/ivadomed", line 33, in <module>
    sys.exit(load_entry_point('ivadomed', 'console_scripts', 'ivadomed')())
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 677, in run_main
    run_command(context=context,
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 397, in run_command
    run_segment_command(context, model_params, no_patch, overlap_2d)
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/main.py", line 314, in run_segment_command
    pred_list, target_list = imed_inference.segment_volume(str(path_model),
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/inference.py", line 496, in segment_volume
    preds = get_preds(context, fname_model, model_params, gpu_id, batch)
  File "/home/GRAMES.POLYMTL.CA/maboudb/ivadomed/ivadomed/inference.py", line 73, in get_preds
    model = torch.load(fname_model, map_location=device)
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/serialization.py", line 712, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/serialization.py", line 1046, in _load
    result = unpickler.load()
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/nn/parallel/distributed.py", line 750, in __setstate__
    self.process_group = _get_default_group()
  File "/home/GRAMES.POLYMTL.CA/maboudb/venv-ivadomed-297/lib/python3.10/site-packages/torch/distributed/distributed_c10d.py", line 429, in _get_default_group
    raise RuntimeError(
RuntimeError: Default process group has not been initialized, please make sure to call init_process_group.

For reference, I followed the One-class segmentation with 2D U-Net for training and testing.


@dataclass(frozen=True)
class MultiGPUsKW:
# Address/Port information for DDP inter process communicaitons.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# Address/Port information for DDP inter process communicaitons.
# Address/Port information for DDP inter process communications.

BEST_TRAINING_DICE: str = "best_training_dice"
BEST_TRAINING_LOSS: str = "best_training_loss"
BEST_VALIDATION_DICE: str = "best_validation_dice"
BEST_VALIDATION_LOSS: str = "best_validation_loss"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
BEST_VALIDATION_LOSS: str = "best_validation_loss"
BEST_VALIDATION_LOSS: str = "best_validation_loss"

Comment on lines +39 to +41
rank (int): the rank of the training function as a process. Default value is set to -1,
indicating that Distributed Parallel Processing will not be used.
rank == 0 means single GPU usually
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
rank (int): the rank of the training function as a process. Default value is set to -1,
indicating that Distributed Parallel Processing will not be used.
rank == 0 means single GPU usually
rank (int): the rank of the training function as a process. Default value is set to -1,
indicating that Distributed Parallel Processing will not be used.
rank == 0 means single GPU usually.

Comment on lines +59 to +60
n_process (int): the total number of processes that will be used to run train. Default is set to 1,
indicating that no other processes will be run in parallel.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n_process (int): the total number of processes that will be used to run train. Default is set to 1,
indicating that no other processes will be run in parallel.
n_process (int): the total number of processes that will be used to run train. Default is set to 1,
indicating that no other processes will be run in parallel.

ddp_setup_detected: bool = (rank != -1 or torch.cuda.device_count() > 1)

# Enable wandb tracking if the required params are found in the config file and the api key is correct
# Also disable in DDP scenarios.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Naïve question, why is wandb deactivated with DDP?
AFAIK, wandb is the primary tools used for tracking training in ivadomed, is it possible to have it with DDP as well?

Comment on lines +117 to +128
if ddp_setup_detected:
sampler_train = DistributedSampler(dataset=dataset_train, rank=rank, num_replicas=n_process)
train_loader = DataLoader(dataset_train, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
shuffle=False, pin_memory=True, sampler=sampler_train,
collate_fn=imed_loader_utils.imed_collate,
num_workers=0)
elif no_ddp_setup_detected:
sampler_train, shuffle_train = get_sampler(dataset_train, conditions, training_params[TrainingParamsKW.BALANCE_SAMPLES][BalanceSamplesKW.TYPE])
train_loader = DataLoader(dataset_train, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
shuffle=shuffle_train, pin_memory=True, sampler=sampler_train,
collate_fn=imed_loader_utils.imed_collate,
num_workers=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here the behavior with DDP is not the same as without DDP.

Without DDP: we call the get_sampler function that may or may not use the BalancedSampler depending on a config parameter here.

Shouldn't we have the same principle with DDP?

Comment on lines +133 to +146
if ddp_setup_detected:
sampler_val = DistributedSampler(dataset=dataset_val, rank=rank, num_replicas=n_process)
val_loader = DataLoader(dataset_val, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
shuffle=False, pin_memory=True, sampler=sampler_val,
collate_fn=imed_loader_utils.imed_collate,
num_workers=0)
else:
sampler_val, shuffle_val = get_sampler(dataset_val, conditions,
training_params[TrainingParamsKW.BALANCE_SAMPLES][
BalanceSamplesKW.TYPE])
val_loader = DataLoader(dataset_val, batch_size=training_params[TrainingParamsKW.BATCH_SIZE],
shuffle=shuffle_val, pin_memory=True, sampler=sampler_val,
collate_fn=imed_loader_utils.imed_collate,
num_workers=0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

@hermancollin
Copy link
Contributor

I trained a model on 10 epochs with/without the feature to compare the results. On bireli with 2 GPUs, time reported a wall clock time of 3min38. On the master branch, using only 1 GPU, we get a 6min2 wall clock time, such that we have an impressive 40% decrease of training time when using 2 GPUs. @mariehbourget, was there a noticeable change in training time on using 8-gpus?

  1. I tried training on rosenberg and the training on multiple GPUs seemed to work well.
    However, I was wondering was is the relation between the config parameter gpu_ids and the DDP process. I was under the impression that I could control which GPUs would be used and found that the job was spawned on all available GPUs regardless of the gpu_ids values in the config file.

Yes this is something I was worried about too. While it is a good thing performance-wise to use all available GPUs, we sometimes have to share the cluster with someone else who is using other cores so it would be great if we could control the GPU ids that can be used.

So, training-wise, I think the feature works, but aside from the problems @mariehbourget encountered, maybe the user should have more control over which GPUs are used (i.e. some config keys?). If it isn't possible to specify certain gpu IDs to torch, the user should at least be able to control if he wants to use 1 specific GPU or all of them for the sake of sharing the cluster with other students. I'm guessing we can control how many GPUs to use with the API and the rank parameter, but this could be more user-friendly.

@mariehbourget
Copy link
Member

@mariehbourget, was there a noticeable change in training time on using 8-gpus?

Yes! Sorry I forgot to mention it.
For a 100 epochs training with the tutorial config with and without the feature on rosenberg, I had:

  • 8min48s with 1 GPU
  • 2min01s with 8 GPUs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants