-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
An example of using pytorch distributed data parallel on 1 machine with arbitrary multiple GPUs. #155
Conversation
This pull request has not seen any recent activity. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm sorry for the delayed response. I think the proposed example seems useful since it enable users to optimize hyperparameters casually compared with pytorch_distributed_simple.py
.
Let me share my first comments.
pytorch/pytorch_ddp_1machine.py
Outdated
|
||
|
||
def objective(single_trial, device_id): | ||
device = torch.device("cuda", device_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this variable? I thought rank
was enough when I implemented similar example
https://gist.github.com/nzw0301/9ba3c837a02539e8bde194f4f2c9d0e5.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code still runs without converting device_id
to device
but I'm not sure how did your example get the variable rank
in
def objective(single_trial):
trial = optuna.integration.TorchDistributedTrial(single_trial, rank)
I can modify the code so that it only uses device_id
, but I think objective
still needs an argument for device_id unless I'm missing something.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, indeed. Thank you for your comment. As you did, partial
or lambda
might be necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
def objective(single_trial):
trial = optuna.integration.TorchDistributedTrial(single_trial, rank)
The new TorchDistributedTrial
in optuna==v3.1.0 will not require rank
or device
.
It creates a process group with gloo backend only for Optuna's communication.
So, we may be able to update this part after the release of optuna v3.1.0.
https://github.com/optuna/optuna/blob/25a135c26662e5e26f88d7266eb2922703bf72dd/optuna/integration/pytorch_distributed.py#L119
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @toshihikoyanase. Indeed having rank
argument yields an error with optuna v3.1.0, but without this argument, optuna < 3.1.0 will raise an error (i.e. backward incompatible). I can push a commit to remove rank
argument, and perhaps with a comment in the beginning of the code to let users know that this example only works with optuna >= 3.1.0.
Also, do we need to specify optuna version in the requirements.txt?
pytorch/pytorch_ddp_1machine.py
Outdated
# Arbitrary GPUs can be selected | ||
DEVICE_IDS = [0, 1] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Due to this part, this example does not work with arbitrary multiple GPUs.
I believe that torchrun
style script (: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py) would be more practical in my PyTorch experience since it doesn't need to update the script at all and we can specify the number of gpus from the command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By arbiturary multiple GPUs
, I mean that you can specify exactly which GPUs to use. I understand that in a large scale setting where you have multiple nodes that have multiple GPUs in each node, using MPI (and I guess as well as torchrun
because it seems to require MPI) would be definitely a better practice.
This example, however, was made with the mind that you don't need to install any MPI packages. Say, I have a local machine with 4 GPUs and I want to run an experiment on GPU1 & 3 because I want GPU0 to be dedicated to graphics and GPU2 is running other programs. I can just run the code as a normal python code like python pytorch/pytorch_ddp_1machine.py
.
I can make DEVICE_IDS
an argument so that user can specify which particular GPUs to use from command line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I got the point. I should have discussed the necessity of this PR because #150 also addresses the same issue. (I suppose #150 still needs some update to decide rank since it uses mpi's variable.)
In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In terms of the number of gpus, we can manage it from the command line; For example, https://discuss.pytorch.org/t/how-to-change-the-default-device-of-gpu-device-ids-0/1041/2.
Do you mean passing the environment variables in the command line like this?
CUDA_VISIBLE_DEVICES=1,2 python myscript.py
I was told changing environment variable is generally not a good practice, so I tried avoiding this. Also, with the above example, I guess you might need to configure a default value in the python code.
In the updated commit, I offered a suggestion using --device_ids
argument to specify device_ids
with a default value 0. Let me know if you like it or what you have in your mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean ...
Yes!
I was told changing environment variable is generally not a good practice
Could you share any reference, ideally, PyTorch page? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you share any reference, ideally, PyTorch page? Thanks!
I can't find a good reference but I feel the idea comes from more production or security side after some google search.
To summarize, my understanding is that
- Potentially making debugging harder because the environment change may not be applied immediately.
https://support.cloud.engineyard.com/hc/en-us/articles/205407508-Environment-Variables-and-Why-You-Shouldn-t-Use-Them - There're risks of leaking sensitive information
https://blog.dnsimple.com/2018/05/using-environment-as-configuration/
Again, I was just told by a colleague of mine and I don't really have any experience about it. And apparently, there also seem to be many cases that env variables should be actively used. So I don't have strong opinion about this.
I just thought if there was a less risky way, then I just pick that.
So, if you think using environment variable is better, then I can change the code accordingly.
pytorch/pytorch_ddp_1machine.py
Outdated
storage="sqlite:///example.db", | ||
direction="maximize", | ||
study_name="pytorch_ddp", | ||
load_if_exists=True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need sqlite for this example? Personally, the example without database would be simpler and easy to run for users thanks to less dependencies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think just followed the example of pytorch_lightning_ddp.py. I got your point, but I guess the most likely next thing after getting the example code run is probably to find a way to log the tuning history, because in most of the cases, one would want to check the tuning history. I don't mind removing this part, but I also feel it would offer some convenience to users. We can make it optional with an argument, for example.
This probably should be a separate topic, but since you mentioned about database, I wonder if optuna could avoid repeating the same (or very similar) parameters that it tried in the past without saving them to a database. I'm asking this because it seems to me that optuna gets new parameters by referencing to the database (or the tuning history) especially when working with multiple GPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pytorch_lightning_ddp.py
I've reviewed this code, so I know a little bit of the background of this example and integration module. Due to technical difficulties, we would only support distributed training with a database. This is why the example code uses SQL. So I don't think we need to follow pytorch_lightning_ddp.py
. Another drawback to using SQL for storage is its speed performance. Using SQL is widely documented in Optuna, so I believe that Optuna users can extend the example without database storage to with database storage.
Repeated same/similar params
As an optuna feature, there isn't, but I think we can implement a similar feature: optuna/optuna#1001 (comment) or optuna/optuna#2021 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we would only support distributed training with a database
Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?
Another drawback to using SQL for storage is its speed performance.
What would be a better way to store tuning history then? I thought optuna only supports SQL until the latest version 3.1.0.
I also want to know if there're alternative storage options.
As an optuna feature, there isn't,
Do you mean that optuna doesn't have a feature to check parameters it tired in the past?
I think I got my answer. When I run my code without specifying storage
, I see it prints out
A new study created in memory with name: no-name-eefeae74-8d1d-4613-9b57-b18177f10da3
So it seems like optuna is storing tuning history in memory instead of SQL, and I guess multiple trials can still refer to this storage in the memory. But I get following errors at the end of tuning if I don't specify storage
.
Study statistics:
Number of finished trials: 0
Number of pruned trials: 0
Number of complete trials: 0
Best trial:
Traceback (most recent call last):
File ".../optuna-examples/pytorch/pytorch_ddp_1machine.py", line 219, in <module>
trial = study.best_trial
File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/study/study.py", line 159, in best_trial
return copy.deepcopy(self._storage.get_best_trial(self._study_id))
File ".../anaconda3/envs/optuna/lib/python3.9/site-packages/optuna/storages/_in_memory.py", line 262, in get_best_trial
raise ValueError("No trials are completed yet.")
ValueError: No trials are completed yet.
So I guess, since I'm using distributed training, I still need SQL storage anyway?
I have no problem with finishing the entire tuning at pytorch_simple.py in which no storage
is given.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean that database is only supported in distributed training but not in other cases? If so, how should users save tuning history?
I mean, pytorch lighting with data distributed training works only with database. Optuna's docs says
For the distributed data parallel training, the version of PyTorchLightning needs to be higher than or equal to v1.5.0. In addition, Study should be instantiated with RDB storage.
What would be a better way to store tuning history then?
I suppose saving study
would be simple to save optimisation history. https://optuna.readthedocs.io/en/stable/faq.html#how-can-i-save-and-resume-studies
Another way is to use optuna's callback feature, such as mlflow or wandb.
Do you mean that optuna doesn't have a feature to check parameters it tired in the past?
Yes.
So I guess, since I'm using distributed training, I still need SQL storage anyway?
I don't think so because another distributed training code: https://github.com/optuna/optuna-examples/blob/main/pytorch/pytorch_distributed_simple.py, doesn't use database.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, my last answer might be wrong. When we used spawn
for ddp training for pytorch-lighting we actually had similar issue. So as you said, swamp-based ddp might require database storage...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose saving study would be simple to save optimisation history
Thank you for the information. Good to know that study can be saved as pickle.
swamp-based ddp might require database storage
I see, thanks for the background information.
@li-li-github cc: @toshihikoyanase When we use |
Thank you for the summary. |
Thanks! Before reviewing the updates, I'd like to hear the opinion from @toshihikoyanase. What do you think about #155 (comment)? |
@nzw0301 @li-li-github Thank you for your insightful discussion.
The official tutorial of PyTorch DDP also employs spawn, and it seems reasonable as a minimum setup. As @li-li-github said in the description, MPI setup is not a trivial task, and I guess more users can try
Hmm... In my understanding,
I agree. @li-li-github kindly added support for In summary, I'd like to proceed with the review. |
@toshihikoyanase Thank you for your comments and for checking the |
…ith optuna v3.1.0
Thank you @toshihikoyanase for the detailed information.
I got you. It seems that study must be created under the condition of I can modify the code such that In run_optimize
Before & after
|
I just pushed the changes I mentioned above. Please let me know what you guys think! @toshihikoyanase @nzw0301 |
This pull request has not seen any recent activity. |
Since this PR adds a new example file, we need to update the CI config to test it. optuna-examples/.github/workflows/pytorch.yml Lines 50 to 56 in 4baba3a
|
CI failed due to the network errors. I'll retry the failed jobs after a while. |
How about renaming the script name, such as |
All CI jobs passed. The failure was due to GitHub Actions, not the example code. |
I have added a test step next to it.
The file name has been changed. Thank you @nzw0301 and @toshihikoyanase for the above suggestions. I also added an argument |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The change almost looks good to me. I have two minor comments.
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Toshihiko Yanase <toshihiko.yanase@gmail.com>
Co-authored-by: Kento Nozawa <k_nzw@klis.tsukuba.ac.jp>
Thank you again for the suggestions. I have accepted all of them. |
This pull request has not seen any recent activity. |
I'm sorry for the delayed response. I'll approve this PR after the CI jobs pass. |
Much appreciated! |
@li-li-github Sorry for bothering you, but a CI job failed due to the update of pytorch lightning integration. #172 resolved this problem, so could you rebase (or merge) the master, please?
|
All tests passed after I merged the main branch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your update! LGTM!
I deeply appreciate your contribution.
Motivation
The existing example pytorch_distributed_simple.py is designed to be used with MPI. There is no example for training a model with arbitrary multiple GPUs on a single machine, where no MPI is needed, which is not trivial to setup. This example should be very flexible that is also applicable to pytorch-lightning in the cases that the existing pytorch_lightning_ddp.py may not work.
Description of the changes
This example uses
torch.distributed.init_process_group
andtorch.distributed.destroy_process_group
as well astorch.multiprocessing
to distribute jobs that MPI would do in pytorch_distributed_simple.py.