Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add elastic run api #3503

Merged
merged 13 commits into from Jun 17, 2022
Merged

Add elastic run api #3503

merged 13 commits into from Jun 17, 2022

Conversation

EnricoMi
Copy link
Collaborator

@EnricoMi EnricoMi commented Apr 4, 2022

Currently, the elastic training mode can only be used through horovodrun and not the existing horovod.run API.

This allows to run horovod.run with min_num_proc or host_discovery_script set to run a func in elastic mode.

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

Unit Test Results

     836 files  +19       836 suites  +19   9h 48m 38s ⏱️ + 35m 37s
     770 tests +  2       727 ✔️ +  2       43 💤 ±0  0 ±0 
18 776 runs  +38  13 431 ✔️ +34  5 345 💤 +4  0 ±0 

Results for commit af09b9a. ± Comparison against base commit a304c81.

♻️ This comment has been updated with latest results.

@github-actions
Copy link

github-actions bot commented Apr 4, 2022

Unit Test Results (with flaky tests)

     968 files  +19       968 suites  +19   10h 21m 50s ⏱️ + 40m 19s
     770 tests +  2       727 ✔️ +  2       43 💤 ±0  0 ±0 
22 042 runs  +38  15 335 ✔️ +34  6 707 💤 +4  0 ±0 

Results for commit af09b9a. ± Comparison against base commit a304c81.

♻️ This comment has been updated with latest results.

@EnricoMi EnricoMi changed the base branch from master to branch-test-run-api-examples April 6, 2022 09:50
@EnricoMi EnricoMi added this to the v0.25.0 milestone Apr 22, 2022
@EnricoMi EnricoMi marked this pull request as ready for review April 26, 2022 21:17
@EnricoMi EnricoMi marked this pull request as draft April 26, 2022 21:21
Base automatically changed from branch-test-run-api-examples to master April 30, 2022 20:53
@EnricoMi EnricoMi marked this pull request as ready for review April 30, 2022 20:56
@EnricoMi EnricoMi force-pushed the branch-elastic-run-api branch 2 times, most recently from ac778f9 to 4dc5aab Compare June 16, 2022 11:32
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi
Copy link
Collaborator Author

EnricoMi commented Jun 16, 2022

I managed to move the KVStoreServer code from launch.py into gloo_run.py, which is where some remote hosts are known first time and the common interface can be determined. This removes the ugly need to provide all driver ips to run_task.py with very low timeout.

…hanges

Reverts changes to run_task.py, launch.py and http_client.py.

Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
Signed-off-by: Enrico Minack <github@enrico.minack.dev>
@EnricoMi EnricoMi merged commit aeb960c into master Jun 17, 2022
@EnricoMi EnricoMi deleted the branch-elastic-run-api branch June 17, 2022 08:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants