An example of using pytorch distributed data parallel on 1 machine with arbitrary multiple GPUs. #155

li-li-github · 2022-12-24T23:53:56Z

Motivation

The existing example pytorch_distributed_simple.py is designed to be used with MPI. There is no example for training a model with arbitrary multiple GPUs on a single machine, where no MPI is needed, which is not trivial to setup. This example should be very flexible that is also applicable to pytorch-lightning in the cases that the existing pytorch_lightning_ddp.py may not work.

Description of the changes

This example uses torch.distributed.init_process_group and torch.distributed.destroy_process_group as well as torch.multiprocessing to distribute jobs that MPI would do in pytorch_distributed_simple.py.

github-actions · 2023-01-01T23:03:58Z

This pull request has not seen any recent activity.

toshihikoyanase

I'm sorry for the delayed response. I think the proposed example seems useful since it enable users to optimize hyperparameters casually compared with pytorch_distributed_simple.py.
Let me share my first comments.

pytorch/pytorch_ddp_1machine.py

nzw0301 · 2023-01-13T12:39:37Z

pytorch/pytorch_ddp_1machine.py

+
+
+def objective(single_trial, device_id):
+    device = torch.device("cuda", device_id)


Do we need this variable? I thought rank was enough when I implemented similar example
https://gist.github.com/nzw0301/9ba3c837a02539e8bde194f4f2c9d0e5.

The code still runs without converting device_id to device but I'm not sure how did your example get the variable rank in

def objective(single_trial): trial = optuna.integration.TorchDistributedTrial(single_trial, rank)

I can modify the code so that it only uses device_id, but I think objective still needs an argument for device_id unless I'm missing something.

Hmm, indeed. Thank you for your comment. As you did, partial or lambda might be necessary.

def objective(single_trial):
trial = optuna.integration.TorchDistributedTrial(single_trial, rank)

The new TorchDistributedTrial in optuna==v3.1.0 will not require rank or device.
It creates a process group with gloo backend only for Optuna's communication.
So, we may be able to update this part after the release of optuna v3.1.0.
https://github.com/optuna/optuna/blob/25a135c26662e5e26f88d7266eb2922703bf72dd/optuna/integration/pytorch_distributed.py#L119

Thank you @toshihikoyanase. Indeed having rank argument yields an error with optuna v3.1.0, but without this argument, optuna < 3.1.0 will raise an error (i.e. backward incompatible). I can push a commit to remove rank argument, and perhaps with a comment in the beginning of the code to let users know that this example only works with optuna >= 3.1.0.
Also, do we need to specify optuna version in the requirements.txt?

pytorch/pytorch_ddp_1machine.py

nzw0301 · 2023-01-13T18:32:12Z

pytorch/pytorch_ddp_1machine.py

+# Arbitrary GPUs can be selected
+DEVICE_IDS = [0, 1]


Due to this part, this example does not work with arbitrary multiple GPUs. I believe that torchrun style script (: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py) would be more practical in my PyTorch experience since it doesn't need to update the script at all and we can specify the number of gpus from the command line.

By arbiturary multiple GPUs, I mean that you can specify exactly which GPUs to use. I understand that in a large scale setting where you have multiple nodes that have multiple GPUs in each node, using MPI (and I guess as well as torchrun because it seems to require MPI) would be definitely a better practice.

This example, however, was made with the mind that you don't need to install any MPI packages. Say, I have a local machine with 4 GPUs and I want to run an experiment on GPU1 & 3 because I want GPU0 to be dedicated to graphics and GPU2 is running other programs. I can just run the code as a normal python code like python pytorch/pytorch_ddp_1machine.py.

I can make DEVICE_IDS an argument so that user can specify which particular GPUs to use from command line.

Thanks. I got the point. I should have discussed the necessity of this PR because #150 also addresses the same issue. (I suppose #150 still needs some update to decide rank since it uses mpi's variable.)

In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.

In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.

Do you mean passing the environment variables in the command line like this?
CUDA_VISIBLE_DEVICES=1,2 python myscript.py

I was told changing environment variable is generally not a good practice, so I tried avoiding this. Also, with the above example, I guess you might need to configure a default value in the python code.
In the updated commit, I offered a suggestion using --device_ids argument to specify device_ids with a default value 0. Let me know if you like it or what you have in your mind.

Do you mean ...

Yes!

I was told changing environment variable is generally not a good practice

Could you share any reference, ideally, PyTorch page? Thanks!

Could you share any reference, ideally, PyTorch page? Thanks!

I can't find a good reference but I feel the idea comes from more production or security side after some google search.
To summarize, my understanding is that

Potentially making debugging harder because the environment change may not be applied immediately.
https://support.cloud.engineyard.com/hc/en-us/articles/205407508-Environment-Variables-and-Why-You-Shouldn-t-Use-Them

There're risks of leaking sensitive information
https://blog.dnsimple.com/2018/05/using-environment-as-configuration/

Again, I was just told by a colleague of mine and I don't really have any experience about it. And apparently, there also seem to be many cases that env variables should be actively used. So I don't have strong opinion about this.
I just thought if there was a less risky way, then I just pick that.

So, if you think using environment variable is better, then I can change the code accordingly.

nzw0301 · 2023-01-13T18:38:52Z

pytorch/pytorch_ddp_1machine.py

+        storage="sqlite:///example.db",
+        direction="maximize",
+        study_name="pytorch_ddp",
+        load_if_exists=True,


Do we need sqlite for this example? Personally, the example without database would be simpler and easy to run for users thanks to less dependencies.

I think just followed the example of pytorch_lightning_ddp.py. I got your point, but I guess the most likely next thing after getting the example code run is probably to find a way to log the tuning history, because in most of the cases, one would want to check the tuning history. I don't mind removing this part, but I also feel it would offer some convenience to users. We can make it optional with an argument, for example.

This probably should be a separate topic, but since you mentioned about database, I wonder if optuna could avoid repeating the same (or very similar) parameters that it tried in the past without saving them to a database. I'm asking this because it seems to me that optuna gets new parameters by referencing to the database (or the tuning history) especially when working with multiple GPUs.

pytorch_lightning_ddp.py

I've reviewed this code, so I know a little bit of the background of this example and integration module. Due to technical difficulties, we would only support distributed training with a database. This is why the example code uses SQL. So I don't think we need to follow pytorch_lightning_ddp.py. Another drawback to using SQL for storage is its speed performance. Using SQL is widely documented in Optuna, so I believe that Optuna users can extend the example without database storage to with database storage.

Repeated same/similar params

As an optuna feature, there isn't, but I think we can implement a similar feature: optuna/optuna#1001 (comment) or optuna/optuna#2021 (comment).

we would only support distributed training with a database

Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?

Another drawback to using SQL for storage is its speed performance.

What would be a better way to store tuning history then? I thought optuna only supports SQL until the latest version 3.1.0.
I also want to know if there're alternative storage options.

As an optuna feature, there isn't,

Do you mean that optuna doesn't have a feature to check parameters it tired in the past?
I think I got my answer. When I run my code without specifying storage, I see it prints out
A new study created in memory with name: no-name-eefeae74-8d1d-4613-9b57-b18177f10da3
So it seems like optuna is storing tuning history in memory instead of SQL, and I guess multiple trials can still refer to this storage in the memory. But I get following errors at the end of tuning if I don't specify storage.

Study statistics: Number of finished trials: 0 Number of pruned trials: 0 Number of complete trials: 0 Best trial: Traceback (most recent call last): File ".../optuna-examples/pytorch/pytorch_ddp_1machine.py", line 219, in <module> trial = study.best_trial File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/study/study.py", line 159, in best_trial return copy.deepcopy(self._storage.get_best_trial(self._study_id)) File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/storages/_in_memory.py", line 262, in get_best_trial raise ValueError("No trials are completed yet.") ValueError: No trials are completed yet.

So I guess, since I'm using distributed training, I still need SQL storage anyway?
I have no problem with finishing the entire tuning at pytorch_simple.py in which no storage is given.

Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?

I mean, pytorch lighting with data distributed training works only with database. Optuna's docs says

For the distributed data parallel training, the version of PyTorchLightning needs to be higher than or equal to v1.5.0. In addition, Study should be instantiated with RDB storage.

Reference: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.PyTorchLightningPruningCallback.html

What would be a better way to store tuning history then?

I suppose saving study would be simple to save optimisation history. https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-save-and-resume-studies

Another way is to use optuna's callback feature, such as mlflow or wandb.

Do you mean that optuna doesn't have a feature to check parameters it tired in the past?

Yes.

So I guess, since I'm using distributed training, I still need SQL storage anyway?

I don't think so because another distributed training code: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py, doesn't use database.

Sorry, my last answer might be wrong. When we used spawn for ddp training for pytorch-lighting we actually had similar issue. So as you said, swamp-based ddp might require database storage...

I suppose saving study would be simple to save optimisation history

Thank you for the information. Good to know that study can be saved as pickle.

swamp-based ddp might require database storage

I see, thanks for the background information.

nzw0301 · 2023-01-15T09:38:58Z

@li-li-github cc: @toshihikoyanase
Thank you for your contributions and comments. Let me summarise my thoughts on this PR. I believe #150 would support nccl backend, which enables us to use multiple GPUs environments. This PR is also motivated using nccl backend (: GPUs), but by using spawn in the script rather than mpirun or torchrun from CLI, which https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py assumes.

When we use spawn, I suppose we might require a database to store study like PyTorchLigntingCallback because child processes report intermediate values. Thus as TorchDistributedTrial with spawn example (not multiple GPUs, but can be supported), this PR would be helpful. So I would like to suggest supporting other backends if possible because Optuna CI doesn't have GPUs.

li-li-github · 2023-01-16T00:16:34Z

Thank you for the summary.
I modified the code so that it takes CPU and gloo backend as default. Let me know what you think.

nzw0301 · 2023-01-16T03:27:58Z

Thanks! Before reviewing the updates, I'd like to hear the opinion from @toshihikoyanase. What do you think about #155 (comment)?

toshihikoyanase · 2023-01-17T10:07:06Z

@nzw0301 @li-li-github Thank you for your insightful discussion.

but by using spawn in the script rather than mpirun or torchrun from CLI, which https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py assumes.

The official tutorial of PyTorch DDP also employs spawn, and it seems reasonable as a minimum setup. As @li-li-github said in the description, MPI setup is not a trivial task, and I guess more users can try TorchDistributedTrial.

When we use spawn, I suppose we might require a database to store study like PyTorchLigntingCallback because child processes report intermediate values.

Hmm... In my understanding, InMemoryStorage can work with spawn since only the rank-0 process accesses the storage (including writing/reading intermediate values), and the other process receives broadcasted data from rank-0. I confirmed that the example worked with InMemoryStorage except for https://github.com/li-li-github/optuna-examples/blob/926a86faa5114076d77299520d7731f02fcd49bb/pytorch/pytorch_ddp_1machine.py#L224-L239. The parent process could not refer to the storage in the rank-0 process, and the optimization result lost.
So, I think it is reasonable to recommend RDBStorage instead of InMemoryStorage in this example.

Thus as TorchDistributedTrial with spawn example (not multiple GPUs, but can be supported), this PR would be helpful. So I would like to suggest supporting other backends if possible because Optuna CI doesn't have GPUs.

I agree. @li-li-github kindly added support for gloo, and I confirmed that it worked with four processes.

In summary, I'd like to proceed with the review.

nzw0301 · 2023-01-17T11:04:53Z

@toshihikoyanase Thank you for your comments and for checking the InMemoryStorage case! I totally agree with you.

…ith optuna v3.1.0

li-li-github · 2023-01-22T08:40:39Z

Thank you @toshihikoyanase for the detailed information.

InMemoryStorage can work with spawn since only the rank-0 process ...

I got you. It seems that study must be created under the condition of if rank == 0:

I can modify the code such that study is created under rank == 0, and use torch.multiprocessing.Manager to retrieve study after mp.spawn if you guys think that would make it a better example.

In run_optimize

def run_optimize(rank, world_size, device_ids, return_dict):
    ...

    if rank == 0:
        study = optuna.create_study(
            # storage="sqlite:///example.db",
            direction="maximize",
            study_name="pytorch_ddp",
            # load_if_exists=True,
        )
        study.optimize(
            partial(objective, device_id=devi),
            n_trials=N_TRIALS,
            timeout=300,
        )
        return_dict['study'] = study
        
    ...

Before & after mp.spawn

    manager = mp.Manager()
    return_dict = manager.dict()
    mp.spawn(
        run_optimize,
        args=(world_size, device_ids, return_dict),
        nprocs=world_size,
        join=True,
    )
    study = return_dict['study']

li-li-github · 2023-01-22T08:54:58Z

I just pushed the changes I mentioned above. Please let me know what you guys think! @toshihikoyanase @nzw0301

github-actions · 2023-01-29T23:04:00Z

This pull request has not seen any recent activity.

toshihikoyanase · 2023-02-06T09:03:10Z

Since this PR adds a new example file, we need to update the CI config to test it.
Could you add a new step to test this example with CPU next to Run multi-node examples?

optuna-examples/.github/workflows/pytorch.yml

Lines 50 to 56 in 4baba3a

    
               - name: Run multi-node examples 
        
                 run: | 
        
                   export OMPI_MCA_rmaps_base_oversubscribe=yes 
        
                   STORAGE_URL=sqlite:///example.db 
        
                   mpirun -n 2 -- python pytorch/pytorch_distributed_simple.py 
        
                 env: 
        
                   OMP_NUM_THREADS: 1

toshihikoyanase · 2023-02-06T09:08:21Z

CI failed due to the network errors. I'll retry the failed jobs after a while.

nzw0301 · 2023-02-06T15:16:48Z

How about renaming the script name, such as pytorch_distributed_spawn.py? I think spawn works with mpi backend without GPUs; the example can support a wider use case.

toshihikoyanase · 2023-02-07T01:59:20Z

All CI jobs passed. The failure was due to GitHub Actions, not the example code.

li-li-github · 2023-02-08T09:15:02Z

Could you add a new step to test this example with CPU next to Run multi-node examples?

I have added a test step next to it.

How about renaming the script name, such as pytorch_distributed_spawn.py?

The file name has been changed.

Thank you @nzw0301 and @toshihikoyanase for the above suggestions. I also added an argument --master-port to avoid port number conflict in case the user wants to run more than 1 instance of this code.

toshihikoyanase · 2023-02-09T00:58:28Z

Thank you for your update. The example was tested in CI jobs as expected.

toshihikoyanase

The change almost looks good to me. I have two minor comments.

pytorch/pytorch_distributed_spawn.py

Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>

Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>

Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>

li-li-github · 2023-02-09T08:41:15Z

Thank you again for the suggestions. I have accepted all of them.

github-actions · 2023-02-16T23:04:11Z

This pull request has not seen any recent activity.

toshihikoyanase · 2023-02-24T05:06:48Z

I'm sorry for the delayed response. I'll approve this PR after the CI jobs pass.

li-li-github · 2023-02-24T05:42:00Z

Much appreciated!

toshihikoyanase · 2023-02-24T05:51:09Z

@li-li-github Sorry for bothering you, but a CI job failed due to the update of pytorch lightning integration. #172 resolved this problem, so could you rebase (or merge) the master, please?

AttributeError: 'Trainer' object has no attribute 'strategy'

…ytorch_ddp

li-li-github · 2023-02-24T07:06:45Z

All tests passed after I merged the main branch.

toshihikoyanase

Thank you for your update! LGTM!
I deeply appreciate your contribution.

li-li-github added 4 commits December 24, 2022 15:28

Added an example for the use case of 1 machine with multiple GPUs.

fb5fff7

aligned to double quotation

2c5d7fe

fixed E501 line too long (103 > 99 characters)

5d449d3

fixed isort errors

f8e23b7

github-actions bot added the stale Exempt from stale bot labeling. label Jan 1, 2023

toshihikoyanase self-assigned this Jan 11, 2023

toshihikoyanase reviewed Jan 11, 2023

View reviewed changes

pytorch/pytorch_ddp_1machine.py Outdated Show resolved Hide resolved

pytorch/pytorch_ddp_1machine.py Outdated Show resolved Hide resolved

github-actions bot removed the stale Exempt from stale bot labeling. label Jan 11, 2023

li-li-github added 2 commits January 12, 2023 01:33

fixed docstring about execution command

8815ab5

fixed execution command

51908a0

nzw0301 reviewed Jan 13, 2023

View reviewed changes

pytorch/pytorch_ddp_1machine.py Outdated Show resolved Hide resolved

nzw0301 reviewed Jan 13, 2023

View reviewed changes

li-li-github added 2 commits January 14, 2023 23:06

simplified some redundancies

51128ed

added --device_ids argument to specify device_ids

1ad9228

The code now accepts CPU.

926a86f

removed device argument from TorchDistributedTrial to be compatible w…

e301f03

…ith optuna v3.1.0

li-li-github added 2 commits January 22, 2023 00:50

study now can use InMemoryStorage

990b5e8

changed ' to "

2f3642e

github-actions bot added the stale Exempt from stale bot labeling. label Jan 29, 2023

li-li-github added 5 commits February 8, 2023 00:38

fixed device_ids argument

c227998

added master-port argument

5a2b3e5

fixed black

69616ff

filename has been changed

45f5f8f

added a test for cpytorch_distributed_spawn.py in pytorch.yml

caeb29c

toshihikoyanase reviewed Feb 9, 2023

View reviewed changes

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved

toshihikoyanase added the feature Change that does not break compatibility, but affects the public interfaces of examples. label Feb 9, 2023

nzw0301 reviewed Feb 9, 2023

View reviewed changes

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved

li-li-github and others added 5 commits February 9, 2023 00:36

Update pytorch/pytorch_distributed_spawn.py

97e20b4

Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>

Update pytorch/pytorch_distributed_spawn.py

a602d44

Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>

Update pytorch/pytorch_distributed_spawn.py

5b370d7

Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>

Update pytorch/pytorch_distributed_spawn.py

20cd7b8

Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>

Update pytorch/pytorch_distributed_spawn.py

e583f2f

Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>

github-actions bot added the stale Exempt from stale bot labeling. label Feb 16, 2023

Merge branch 'main' of github.com:li-li-github/optuna-examples into p…

d4e5209

…ytorch_ddp

toshihikoyanase approved these changes Feb 24, 2023

View reviewed changes

toshihikoyanase added this to the v3.2.0 milestone Feb 24, 2023

toshihikoyanase merged commit f72383c into optuna:main Feb 24, 2023



		def objective(single_trial, device_id):
		device = torch.device("cuda", device_id)

An example of using pytorch distributed data parallel on 1 machine with arbitrary multiple GPUs. #155

An example of using pytorch distributed data parallel on 1 machine with arbitrary multiple GPUs. #155

Conversation

li-li-github commented Dec 24, 2022

Motivation

Description of the changes

github-actions bot commented Jan 1, 2023

toshihikoyanase left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzw0301 Jan 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzw0301 Jan 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzw0301 Jan 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzw0301 Jan 13, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pytorch_lightning_ddp.py

Repeated same/similar params

li-li-github Jan 15, 2023 • edited

Choose a reason for hiding this comment

nzw0301 Jan 15, 2023 • edited

Choose a reason for hiding this comment

nzw0301 Jan 15, 2023 • edited

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nzw0301 commented Jan 15, 2023 • edited

li-li-github commented Jan 16, 2023

nzw0301 commented Jan 16, 2023

toshihikoyanase commented Jan 17, 2023

nzw0301 commented Jan 17, 2023

li-li-github commented Jan 22, 2023

li-li-github commented Jan 22, 2023

github-actions bot commented Jan 29, 2023

toshihikoyanase commented Feb 6, 2023

toshihikoyanase commented Feb 6, 2023

nzw0301 commented Feb 6, 2023

toshihikoyanase commented Feb 7, 2023

li-li-github commented Feb 8, 2023

toshihikoyanase commented Feb 9, 2023

toshihikoyanase left a comment

Choose a reason for hiding this comment

li-li-github commented Feb 9, 2023

github-actions bot commented Feb 16, 2023

toshihikoyanase commented Feb 24, 2023

li-li-github commented Feb 24, 2023

toshihikoyanase commented Feb 24, 2023 • edited

li-li-github commented Feb 24, 2023

toshihikoyanase left a comment

Choose a reason for hiding this comment

nzw0301 Jan 13, 2023 •

edited

nzw0301 Jan 15, 2023 •

edited

nzw0301 Jan 15, 2023 •

edited

nzw0301 Jan 13, 2023 •

edited

`pytorch_lightning_ddp.py`

li-li-github Jan 15, 2023 •

edited

nzw0301 Jan 15, 2023 •

edited

nzw0301 Jan 15, 2023 •

edited

nzw0301 commented Jan 15, 2023 •

edited

toshihikoyanase commented Feb 24, 2023 •

edited