Skip to content

Conversation

@farakiko
Copy link
Collaborator

@farakiko farakiko commented Oct 23, 2023

Fixes two of the three issues mentioned in the previous PR pytorch backend major update #240

  • Broadcasts stale_epochs to all gpus to stop the training in an optimal way (the code used to hang when stale_epochs>patience)
  • Careful look at dist.barrier() invokes during training to avoid bottlenecks where all gpus wait unnecessarily for the other gpus to reach the same line in code

Remaining issue

  • Must fix num-workers>0 for gpus>1 (atm runs into error because the code attempts to pickle the tfds which is not supported)

@jpata jpata merged commit 8178164 into jpata:main Oct 23, 2023
@farakiko farakiko deleted the pyg_training_optimization branch October 23, 2023 09:53
erwulff pushed a commit to erwulff/particleflow that referenced this pull request Oct 23, 2023
jpata pushed a commit that referenced this pull request Oct 25, 2023
* wip: implement HPO in pytorch pipeline

* fix: bugs after rebase

* chore: code formatting

* fix: minor bug

* fix: typo

* fix: lr casted to str when read from config

* try reducing --ntrain --ntest in tests

* update distbarrier and fix stale pochs (#249)

* change pytorch CI/CD test to use gravnet model

* feat: implemented HPO using Ray Tune

Now able to perform hyperparameter search using random search with
automatic trial launching and Ray-compatbile checkpointing.

Support is still missing for:
- Trial schedulers
- Advanced Ray Tune search algorithms

* fix: flake8 error

* chore: update default config values for pyg

---------

Co-authored-by: Farouk Mokhtar <farouk.mokhtar@gmail.com>
farakiko added a commit to farakiko/particleflow that referenced this pull request Oct 25, 2023
* wip: implement HPO in pytorch pipeline

* fix: bugs after rebase

* chore: code formatting

* fix: minor bug

* fix: typo

* fix: lr casted to str when read from config

* try reducing --ntrain --ntest in tests

* update distbarrier and fix stale pochs (jpata#249)

* change pytorch CI/CD test to use gravnet model

* feat: implemented HPO using Ray Tune

Now able to perform hyperparameter search using random search with
automatic trial launching and Ray-compatbile checkpointing.

Support is still missing for:
- Trial schedulers
- Advanced Ray Tune search algorithms

* fix: flake8 error

* chore: update default config values for pyg

---------

Co-authored-by: Farouk Mokhtar <farouk.mokhtar@gmail.com>
farakiko added a commit to farakiko/particleflow that referenced this pull request Jan 23, 2024
farakiko added a commit to farakiko/particleflow that referenced this pull request Jan 23, 2024
* wip: implement HPO in pytorch pipeline

* fix: bugs after rebase

* chore: code formatting

* fix: minor bug

* fix: typo

* fix: lr casted to str when read from config

* try reducing --ntrain --ntest in tests

* update distbarrier and fix stale pochs (jpata#249)

* change pytorch CI/CD test to use gravnet model

* feat: implemented HPO using Ray Tune

Now able to perform hyperparameter search using random search with
automatic trial launching and Ray-compatbile checkpointing.

Support is still missing for:
- Trial schedulers
- Advanced Ray Tune search algorithms

* fix: flake8 error

* chore: update default config values for pyg

---------

Co-authored-by: Farouk Mokhtar <farouk.mokhtar@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants