Update dist.barrier() and fix stale epochs for torch backend #249

farakiko · 2023-10-23T08:57:35Z

Fixes two of the three issues mentioned in the previous PR pytorch backend major update #240

Broadcasts stale_epochs to all gpus to stop the training in an optimal way (the code used to hang when stale_epochs>patience)
Careful look at dist.barrier() invokes during training to avoid bottlenecks where all gpus wait unnecessarily for the other gpus to reach the same line in code

Remaining issue

Must fix num-workers>0 for gpus>1 (atm runs into error because the code attempts to pickle the tfds which is not supported)

* wip: implement HPO in pytorch pipeline * fix: bugs after rebase * chore: code formatting * fix: minor bug * fix: typo * fix: lr casted to str when read from config * try reducing --ntrain --ntest in tests * update distbarrier and fix stale pochs (#249) * change pytorch CI/CD test to use gravnet model * feat: implemented HPO using Ray Tune Now able to perform hyperparameter search using random search with automatic trial launching and Ray-compatbile checkpointing. Support is still missing for: - Trial schedulers - Advanced Ray Tune search algorithms * fix: flake8 error * chore: update default config values for pyg --------- Co-authored-by: Farouk Mokhtar <farouk.mokhtar@gmail.com>

* wip: implement HPO in pytorch pipeline * fix: bugs after rebase * chore: code formatting * fix: minor bug * fix: typo * fix: lr casted to str when read from config * try reducing --ntrain --ntest in tests * update distbarrier and fix stale pochs (jpata#249) * change pytorch CI/CD test to use gravnet model * feat: implemented HPO using Ray Tune Now able to perform hyperparameter search using random search with automatic trial launching and Ray-compatbile checkpointing. Support is still missing for: - Trial schedulers - Advanced Ray Tune search algorithms * fix: flake8 error * chore: update default config values for pyg --------- Co-authored-by: Farouk Mokhtar <farouk.mokhtar@gmail.com>

update distbarrier and fix stale pochs

a093b80

jpata merged commit 8178164 into jpata:main Oct 23, 2023

farakiko deleted the pyg_training_optimization branch October 23, 2023 09:53

erwulff pushed a commit to erwulff/particleflow that referenced this pull request Oct 23, 2023

update distbarrier and fix stale pochs (jpata#249)

eb99bcc

farakiko added a commit to farakiko/particleflow that referenced this pull request Jan 23, 2024

update distbarrier and fix stale pochs (jpata#249)

76808a6

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update dist.barrier() and fix stale epochs for torch backend #249

Update dist.barrier() and fix stale epochs for torch backend #249

Uh oh!

farakiko commented Oct 23, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Update dist.barrier() and fix stale epochs for torch backend #249

Update dist.barrier() and fix stale epochs for torch backend #249

Uh oh!

Conversation

farakiko commented Oct 23, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

farakiko commented Oct 23, 2023 •

edited

Loading