Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix test_pytorch_lightning.py #4305

Conversation

Alnusjaponica
Copy link
Collaborator

@Alnusjaponica Alnusjaponica commented Jan 5, 2023

Motivation

Resolve #3418 and #4116

Description of the changes

Refactor deprecated features

  • trainer.training_type_plugin is deleted since v1.8 (PR#11239). The attribute training_type_plugin is just renamed to strategy, so refactored as suggested.

  • An optional argument of trainer, accelerator stopped to accept ddp_cpu. Instead, we can pass ddp to strategy and cpu to accelerator. Also, num_processes will be removed, so we give the number of processes to devices instead.

  • AcceleratorConnector.distributed_backend is deleted, but now they have AcceleratorConnector.is_distributed instead, so refactored as suggested.

  • callback.on_init_start() is deleted since v1.8 (Issue#10894, PR#10940, PR#14867).
    Although they don't provide exactly equivalent alternative, it would be possible to move this confirmation to somewhere else. Although it seems strategy.setup_environment is the right place to implement this check, implementing as a method of Strategy will affect user's code. Therefore it will be reasonable to implement as callback.setup or callback.on_fit_start().

@github-actions github-actions bot added the optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. label Jan 5, 2023
@grburgess
Copy link

Out of curiosity, is all working for PL 1.8 for this branch w.r.t. the pruner?

@codecov-commenter
Copy link

codecov-commenter commented Jan 5, 2023

Codecov Report

Merging #4305 (7e88045) into master (7bc4bb1) will decrease coverage by 0.04%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #4305      +/-   ##
==========================================
- Coverage   90.42%   90.38%   -0.04%     
==========================================
  Files         172      172              
  Lines       13660    13668       +8     
==========================================
+ Hits        12352    12354       +2     
- Misses       1308     1314       +6     
Impacted Files Coverage Δ
optuna/integration/pytorch_lightning.py 0.00% <0.00%> (ø)
optuna/trial/_trial.py 95.26% <0.00%> (-1.19%) ⬇️
optuna/study/study.py 94.96% <0.00%> (+0.77%) ⬆️
optuna/integration/botorch.py 97.55% <0.00%> (+0.81%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

@Alnusjaponica
Copy link
Collaborator Author

Thanks for the question. It is what we aim to do in this PR. @grburgess

@grburgess
Copy link

Thanks! I checked out this branch to try (which I should have done in the first place) and indeed the issue seems fixed

@Alnusjaponica Alnusjaponica marked this pull request as ready for review January 6, 2023 06:09
@Alnusjaponica Alnusjaponica marked this pull request as draft January 6, 2023 06:09
@Alnusjaponica
Copy link
Collaborator Author

Thank you for trying the patch. Please note that the problem of distributed processing still remains for now.

@Alnusjaponica
Copy link
Collaborator Author

Alnusjaponica commented Jan 12, 2023

Problem

When you use DDP and optuna.TrialPruned() is raised from the child process, Pytorch Lightning tries to resynchronize to fix the "error" and finally deal it as a DeadlockDetectedException, which terminates the whole process. For more details, see reconciliate_processes. To fix this problem, we need to change environment variable and private variable.

Alternative

Stop supporting DDP in order to fix daily test integration (#4116) and to support PyTorch Lightning after v1.6.

@nzw0301
Copy link
Member

nzw0301 commented Jan 12, 2023

When I tried to fix this issue before, I also thought that calling TrialPruned in the main process manually like catboost callback is another approach,
https://github.com/optuna/optuna-examples/blob/main/catboost/catboost_pruning.py#L57.

@Alnusjaponica
Copy link
Collaborator Author

Alnusjaponica commented Jan 13, 2023

@nzw0301
Thank you for your advice. I'll see if the suggested approach works for this case.
If it is possible, I am going to apply the change to support DDP after #4322 .

@github-actions
Copy link
Contributor

This pull request has not seen any recent activity.

@github-actions github-actions bot added the stale Exempt from stale bot labeling. label Jan 26, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Feb 9, 2023

This pull request was closed automatically because it had not seen any recent activity. If you want to discuss it, you can reopen it freely.

@github-actions github-actions bot closed this Feb 9, 2023
@Alnusjaponica Alnusjaponica deleted the fix-checks-integration-pytorch-lightning branch February 16, 2023 00:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optuna.integration Related to the `optuna.integration` submodule. This is automatically labeled by github-actions. stale Exempt from stale bot labeling.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Support latest PyTorch lightning
4 participants