Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

An example of using pytorch distributed data parallel on 1 machine with arbitrary multiple GPUs. #155

Merged
merged 27 commits into from
Feb 24, 2023

Conversation

li-li-github
Copy link
Contributor

Motivation

The existing example pytorch_distributed_simple.py is designed to be used with MPI. There is no example for training a model with arbitrary multiple GPUs on a single machine, where no MPI is needed, which is not trivial to setup. This example should be very flexible that is also applicable to pytorch-lightning in the cases that the existing pytorch_lightning_ddp.py may not work.

Description of the changes

This example uses torch.distributed.init_process_group and torch.distributed.destroy_process_group as well as torch.multiprocessing to distribute jobs that MPI would do in pytorch_distributed_simple.py.

@github-actions
Copy link

github-actions bot commented Jan 1, 2023

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 1, 2023
@toshihikoyanase toshihikoyanase self-assigned this Jan 11, 2023
Copy link
Member

@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry for the delayed response. I think the proposed example seems useful since it enable users to optimize hyperparameters casually compared with pytorch_distributed_simple.py.
Let me share my first comments.

pytorch/pytorch_ddp_1machine.py Outdated Show resolved Hide resolved
pytorch/pytorch_ddp_1machine.py Outdated Show resolved Hide resolved
@github-actions github-actions bot removed the stale Exempt from stale bot labeling. label Jan 11, 2023


def objective(single_trial, device_id):
device = torch.device("cuda", device_id)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this variable? I thought rank was enough when I implemented similar example
https://gist.github.com/nzw0301/9ba3c837a02539e8bde194f4f2c9d0e5.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code still runs without converting device_id to device but I'm not sure how did your example get the variable rank in

def objective(single_trial):
    trial = optuna.integration.TorchDistributedTrial(single_trial, rank)

I can modify the code so that it only uses device_id, but I think objective still needs an argument for device_id unless I'm missing something.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, indeed. Thank you for your comment. As you did, partial or lambda might be necessary.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

def objective(single_trial):
trial = optuna.integration.TorchDistributedTrial(single_trial, rank)

The new TorchDistributedTrial in optuna==v3.1.0 will not require rank or device.
It creates a process group with gloo backend only for Optuna's communication.
So, we may be able to update this part after the release of optuna v3.1.0.
https://github.com/optuna/optuna/blob/25a135c26662e5e26f88d7266eb2922703bf72dd/optuna/integration/pytorch_distributed.py#L119

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @toshihikoyanase. Indeed having rank argument yields an error with optuna v3.1.0, but without this argument, optuna < 3.1.0 will raise an error (i.e. backward incompatible). I can push a commit to remove rank argument, and perhaps with a comment in the beginning of the code to let users know that this example only works with optuna >= 3.1.0.
Also, do we need to specify optuna version in the requirements.txt?

Comment on lines 34 to 35
# Arbitrary GPUs can be selected
DEVICE_IDS = [0, 1]
Copy link
Member

@nzw0301 nzw0301 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Due to this part, this example does not work with arbitrary multiple GPUs. I believe that torchrun style script (: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py) would be more practical in my PyTorch experience since it doesn't need to update the script at all and we can specify the number of gpus from the command line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By arbiturary multiple GPUs, I mean that you can specify exactly which GPUs to use. I understand that in a large scale setting where you have multiple nodes that have multiple GPUs in each node, using MPI (and I guess as well as torchrun because it seems to require MPI) would be definitely a better practice.

This example, however, was made with the mind that you don't need to install any MPI packages. Say, I have a local machine with 4 GPUs and I want to run an experiment on GPU1 & 3 because I want GPU0 to be dedicated to graphics and GPU2 is running other programs. I can just run the code as a normal python code like python pytorch/pytorch_ddp_1machine.py.

I can make DEVICE_IDS an argument so that user can specify which particular GPUs to use from command line.

Copy link
Member

@nzw0301 nzw0301 Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I got the point. I should have discussed the necessity of this PR because #150 also addresses the same issue. (I suppose #150 still needs some update to decide rank since it uses mpi's variable.)

In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.

Do you mean passing the environment variables in the command line like this?
CUDA_VISIBLE_DEVICES=1,2 python myscript.py

I was told changing environment variable is generally not a good practice, so I tried avoiding this. Also, with the above example, I guess you might need to configure a default value in the python code.
In the updated commit, I offered a suggestion using --device_ids argument to specify device_ids with a default value 0. Let me know if you like it or what you have in your mind.

Copy link
Member

@nzw0301 nzw0301 Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean ...

Yes!

I was told changing environment variable is generally not a good practice

Could you share any reference, ideally, PyTorch page? Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you share any reference, ideally, PyTorch page? Thanks!

I can't find a good reference but I feel the idea comes from more production or security side after some google search.
To summarize, my understanding is that

Again, I was just told by a colleague of mine and I don't really have any experience about it. And apparently, there also seem to be many cases that env variables should be actively used. So I don't have strong opinion about this.
I just thought if there was a less risky way, then I just pick that.

So, if you think using environment variable is better, then I can change the code accordingly.

Comment on lines 187 to 190
storage="sqlite:///example.db",
direction="maximize",
study_name="pytorch_ddp",
load_if_exists=True,
Copy link
Member

@nzw0301 nzw0301 Jan 13, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need sqlite for this example? Personally, the example without database would be simpler and easy to run for users thanks to less dependencies.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think just followed the example of pytorch_lightning_ddp.py. I got your point, but I guess the most likely next thing after getting the example code run is probably to find a way to log the tuning history, because in most of the cases, one would want to check the tuning history. I don't mind removing this part, but I also feel it would offer some convenience to users. We can make it optional with an argument, for example.

This probably should be a separate topic, but since you mentioned about database, I wonder if optuna could avoid repeating the same (or very similar) parameters that it tried in the past without saving them to a database. I'm asking this because it seems to me that optuna gets new parameters by referencing to the database (or the tuning history) especially when working with multiple GPUs.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytorch_lightning_ddp.py

I've reviewed this code, so I know a little bit of the background of this example and integration module. Due to technical difficulties, we would only support distributed training with a database. This is why the example code uses SQL. So I don't think we need to follow pytorch_lightning_ddp.py. Another drawback to using SQL for storage is its speed performance. Using SQL is widely documented in Optuna, so I believe that Optuna users can extend the example without database storage to with database storage.

Repeated same/similar params

As an optuna feature, there isn't, but I think we can implement a similar feature: optuna/optuna#1001 (comment) or optuna/optuna#2021 (comment).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we would only support distributed training with a database

Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?

Another drawback to using SQL for storage is its speed performance.

What would be a better way to store tuning history then? I thought optuna only supports SQL until the latest version 3.1.0.
I also want to know if there're alternative storage options.

As an optuna feature, there isn't,

Do you mean that optuna doesn't have a feature to check parameters it tired in the past?
I think I got my answer. When I run my code without specifying storage, I see it prints out
A new study created in memory with name: no-name-eefeae74-8d1d-4613-9b57-b18177f10da3
So it seems like optuna is storing tuning history in memory instead of SQL, and I guess multiple trials can still refer to this storage in the memory. But I get following errors at the end of tuning if I don't specify storage.

Study statistics: 
  Number of finished trials:  0
  Number of pruned trials:  0
  Number of complete trials:  0
Best trial:
Traceback (most recent call last):
  File ".../optuna-examples/pytorch/pytorch_ddp_1machine.py", line 219, in <module>
    trial = study.best_trial
  File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/study/study.py", line 159, in best_trial
    return copy.deepcopy(self._storage.get_best_trial(self._study_id))
  File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/storages/_in_memory.py", line 262, in get_best_trial
    raise ValueError("No trials are completed yet.")
ValueError: No trials are completed yet.

So I guess, since I'm using distributed training, I still need SQL storage anyway?
I have no problem with finishing the entire tuning at pytorch_simple.py in which no storage is given.

Copy link
Member

@nzw0301 nzw0301 Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?

I mean, pytorch lighting with data distributed training works only with database. Optuna's docs says

For the distributed data parallel training, the version of PyTorchLightning needs to be higher than or equal to v1.5.0. In addition, Study should be instantiated with RDB storage.

Reference: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.integration.PyTorchLightningPruningCallback.html


What would be a better way to store tuning history then?

I suppose saving study would be simple to save optimisation history. https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-save-and-resume-studies

Another way is to use optuna's callback feature, such as mlflow or wandb.


Do you mean that optuna doesn't have a feature to check parameters it tired in the past?

Yes.


So I guess, since I'm using distributed training, I still need SQL storage anyway?

I don't think so because another distributed training code: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py, doesn't use database.

Copy link
Member

@nzw0301 nzw0301 Jan 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, my last answer might be wrong. When we used spawn for ddp training for pytorch-lighting we actually had similar issue. So as you said, swamp-based ddp might require database storage...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose saving study would be simple to save optimisation history

Thank you for the information. Good to know that study can be saved as pickle.

swamp-based ddp might require database storage

I see, thanks for the background information.

@nzw0301
Copy link
Member

nzw0301 commented Jan 15, 2023

@li-li-github cc: @toshihikoyanase
Thank you for your contributions and comments. Let me summarise my thoughts on this PR. I believe #150 would support nccl backend, which enables us to use multiple GPUs environments. This PR is also motivated using nccl backend (: GPUs), but by using spawn in the script rather than mpirun or torchrun from CLI, which https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py assumes.

When we use spawn, I suppose we might require a database to store study like PyTorchLigntingCallback because child processes report intermediate values. Thus as TorchDistributedTrial with spawn example (not multiple GPUs, but can be supported), this PR would be helpful. So I would like to suggest supporting other backends if possible because Optuna CI doesn't have GPUs.

@li-li-github
Copy link
Contributor Author

Thank you for the summary.
I modified the code so that it takes CPU and gloo backend as default. Let me know what you think.

@nzw0301
Copy link
Member

nzw0301 commented Jan 16, 2023

Thanks! Before reviewing the updates, I'd like to hear the opinion from @toshihikoyanase. What do you think about #155 (comment)?

@toshihikoyanase
Copy link
Member

@nzw0301 @li-li-github Thank you for your insightful discussion.

but by using spawn in the script rather than mpirun or torchrun from CLI, which https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py assumes.

The official tutorial of PyTorch DDP also employs spawn, and it seems reasonable as a minimum setup. As @li-li-github said in the description, MPI setup is not a trivial task, and I guess more users can try TorchDistributedTrial.

When we use spawn, I suppose we might require a database to store study like PyTorchLigntingCallback because child processes report intermediate values.

Hmm... In my understanding, InMemoryStorage can work with spawn since only the rank-0 process accesses the storage (including writing/reading intermediate values), and the other process receives broadcasted data from rank-0. I confirmed that the example worked with InMemoryStorage except for https://github.com/li-li-github/optuna-examples/blob/926a86faa5114076d77299520d7731f02fcd49bb/pytorch/pytorch_ddp_1machine.py#L224-L239. The parent process could not refer to the storage in the rank-0 process, and the optimization result lost.
So, I think it is reasonable to recommend RDBStorage instead of InMemoryStorage in this example.

Thus as TorchDistributedTrial with spawn example (not multiple GPUs, but can be supported), this PR would be helpful. So I would like to suggest supporting other backends if possible because Optuna CI doesn't have GPUs.

I agree. @li-li-github kindly added support for gloo, and I confirmed that it worked with four processes.

In summary, I'd like to proceed with the review.

@nzw0301
Copy link
Member

nzw0301 commented Jan 17, 2023

@toshihikoyanase Thank you for your comments and for checking the InMemoryStorage case! I totally agree with you.

@li-li-github
Copy link
Contributor Author

Thank you @toshihikoyanase for the detailed information.

InMemoryStorage can work with spawn since only the rank-0 process ...

I got you. It seems that study must be created under the condition of if rank == 0:

I can modify the code such that study is created under rank == 0, and use torch.multiprocessing.Manager to retrieve study after mp.spawn if you guys think that would make it a better example.

In run_optimize

def run_optimize(rank, world_size, device_ids, return_dict):
    ...

    if rank == 0:
        study = optuna.create_study(
            # storage="sqlite:///example.db",
            direction="maximize",
            study_name="pytorch_ddp",
            # load_if_exists=True,
        )
        study.optimize(
            partial(objective, device_id=devi),
            n_trials=N_TRIALS,
            timeout=300,
        )
        return_dict['study'] = study
        
    ...

Before & after mp.spawn

    manager = mp.Manager()
    return_dict = manager.dict()
    mp.spawn(
        run_optimize,
        args=(world_size, device_ids, return_dict),
        nprocs=world_size,
        join=True,
    )
    study = return_dict['study']

@li-li-github
Copy link
Contributor Author

I just pushed the changes I mentioned above. Please let me know what you guys think! @toshihikoyanase @nzw0301

@github-actions
Copy link

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 29, 2023
@toshihikoyanase
Copy link
Member

Since this PR adds a new example file, we need to update the CI config to test it.
Could you add a new step to test this example with CPU next to Run multi-node examples?

- name: Run multi-node examples
run: |
export OMPI_MCA_rmaps_base_oversubscribe=yes
STORAGE_URL=sqlite:///example.db
mpirun -n 2 -- python pytorch/pytorch_distributed_simple.py
env:
OMP_NUM_THREADS: 1

@toshihikoyanase
Copy link
Member

CI failed due to the network errors. I'll retry the failed jobs after a while.

@nzw0301
Copy link
Member

nzw0301 commented Feb 6, 2023

How about renaming the script name, such as pytorch_distributed_spawn.py? I think spawn works with mpi backend without GPUs; the example can support a wider use case.

@toshihikoyanase
Copy link
Member

All CI jobs passed. The failure was due to GitHub Actions, not the example code.

@li-li-github
Copy link
Contributor Author

Could you add a new step to test this example with CPU next to Run multi-node examples?

I have added a test step next to it.

How about renaming the script name, such as pytorch_distributed_spawn.py?

The file name has been changed.

Thank you @nzw0301 and @toshihikoyanase for the above suggestions. I also added an argument --master-port to avoid port number conflict in case the user wants to run more than 1 instance of this code.

@toshihikoyanase
Copy link
Member

Thank you for your update. The example was tested in CI jobs as expected.
image

Copy link
Member

@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change almost looks good to me. I have two minor comments.

pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved
pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved
@toshihikoyanase toshihikoyanase added the feature Change that does not break compatibility, but affects the public interfaces of examples. label Feb 9, 2023
pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved
pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved
pytorch/pytorch_distributed_spawn.py Outdated Show resolved Hide resolved
li-li-github and others added 5 commits February 9, 2023 00:36
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
@li-li-github
Copy link
Contributor Author

Thank you again for the suggestions. I have accepted all of them.

@github-actions
Copy link

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Feb 16, 2023
@toshihikoyanase
Copy link
Member

I'm sorry for the delayed response. I'll approve this PR after the CI jobs pass.

@li-li-github
Copy link
Contributor Author

Much appreciated!

@toshihikoyanase
Copy link
Member

toshihikoyanase commented Feb 24, 2023

@li-li-github Sorry for bothering you, but a CI job failed due to the update of pytorch lightning integration. #172 resolved this problem, so could you rebase (or merge) the master, please?

AttributeError: 'Trainer' object has no attribute 'strategy'

@li-li-github
Copy link
Contributor Author

All tests passed after I merged the main branch.

Copy link
Member

@toshihikoyanase toshihikoyanase left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your update! LGTM!
I deeply appreciate your contribution.

@toshihikoyanase toshihikoyanase added this to the v3.2.0 milestone Feb 24, 2023
@toshihikoyanase toshihikoyanase merged commit f72383c into optuna:main Feb 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Change that does not break compatibility, but affects the public interfaces of examples. stale Exempt from stale bot labeling.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants