Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature(nyz): add new middleware distributed demo #321

Merged
merged 81 commits into from
Dec 12, 2022
Merged

feature(nyz): add new middleware distributed demo #321

merged 81 commits into from
Dec 12, 2022

Conversation

PaParaZz1
Copy link
Member

@PaParaZz1 PaParaZz1 commented May 15, 2022

Description

  • DataParallel demo
  • DistributedDataParallel demo
  • tb logger example
  • Distributed RL demo (Ape-X type)

Related Issue

#102
#176

TODO

Check List

  • merge the latest version source branch/repo, and resolve all the conflicts
  • pass style check
  • pass all the tests

@PaParaZz1 PaParaZz1 added enhancement New feature or request parallel-dist Parallel and distributed training related labels May 15, 2022
@codecov
Copy link

codecov bot commented May 15, 2022

Codecov Report

Merging #321 (dde6009) into main (dd2b3a5) will decrease coverage by 0.59%.
The diff coverage is 79.88%.

@@            Coverage Diff             @@
##             main     #321      +/-   ##
==========================================
- Coverage   85.39%   84.79%   -0.60%     
==========================================
  Files         532      556      +24     
  Lines       43943    44718     +775     
==========================================
+ Hits        37523    37919     +396     
- Misses       6420     6799     +379     
Flag Coverage Δ
unittests 84.79% <79.88%> (-0.60%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
ding/data/buffer/tests/test_buffer_benchmark.py 37.70% <ø> (ø)
ding/entry/tests/test_cli_ditask.py 100.00% <ø> (ø)
ding/policy/base_policy.py 74.85% <ø> (+0.84%) ⬆️
ding/policy/sac.py 60.29% <ø> (-0.08%) ⬇️
...ework/middleware/functional/termination_checker.py 22.50% <16.66%> (-8.75%) ⬇️
ding/data/tests/test_model_loader.py 23.63% <23.63%> (ø)
ding/framework/tests/test_task.py 92.50% <35.71%> (-7.50%) ⬇️
ding/framework/middleware/functional/trainer.py 84.84% <40.00%> (-3.04%) ⬇️
ding/policy/dqn.py 87.34% <40.00%> (-1.54%) ⬇️
ding/framework/middleware/functional/enhancer.py 39.65% <42.85%> (-0.73%) ⬇️
... and 271 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

@zxzzz0
Copy link

zxzzz0 commented Jun 21, 2022

What is the throughput of this? Does this beat SampleFactory? @PaParaZz1 @sailxjx

@sailxjx
Copy link
Member

sailxjx commented Jun 22, 2022

@zxzzz0 This is not to compare the speed with sample factory, because you know that the bottleneck of RL training may appear in any one of the collecting, training, and evaluation, for example, too fast collecting may lead to too much generation difference and underfitting of the model, and because of the existence of GIL, the deserialization of data on the training process will also slow down the overall training efficiency, and there are many points that we need to consider in this project.
This issue will provide a new design pattern for global RL training, starting from the idea that users can easily scale from single-computer studies to large-scale distributed systems without very large code modification costs or performance losses.
If you are really very very concerned about environment-side collecting, then you can use sample factory inside the di-engine to achieve the collecting efficiency you expect.

@zxzzz0
Copy link

zxzzz0 commented Jun 22, 2022

If you are really very very concerned about environment-side collecting

No. To clarify, we only care about overall performance, which means the time it will take to reach certain reward in the end.

Usually if you can squeeze every drop of performance out of the CPU/GPU, you can learn faster. Environment-side collecting is just one indicator and there are many other indicators as well. You will also have to pay attention to the learner FPS, GPU utilization and other indicators for you to understand the throughput of the whole system.

When doing benchmarking, it's not targeted for the collector side but the overall growth speed of reward.

@sailxjx
Copy link
Member

sailxjx commented Jun 23, 2022

No. To clarify, we only care about overall performance, which means the time it will take to reach certain reward in the end.

Yeah, that's right, the purpose of the distributed version is to maximize overall performance while not requiring too much effort to write code on multiple tasks.

Another consideration is that we need to go design-first. Only after the upper layer interface is unified and stable, it will be possible to gradually optimize all aspects of performance without disturbing the user. You can see that from version 0.x to version 1.0, we have gradually developed a definite interface style example, and the purpose of this branch is to extend this interface style to distributed operation.

@zxzzz0
Copy link

zxzzz0 commented Jun 23, 2022

Sounds good. In the future, please benchmark different design/interface so that you are confident enough to say that you've chosen the design with the best overall performance.

If you don't do benchmarking (as I did before for di-engine) and you find something that you could improve the performance after the design is frozen in version 1.0, you can't change it without a major version update.

@PaParaZz1 PaParaZz1 force-pushed the main branch 2 times, most recently from 6dfebeb to 813580f Compare July 31, 2022 09:31
PaParaZz1 and others added 3 commits September 9, 2022 00:20
…ader (#425)

* Add singleton log writer

* Use get_instance on writer

* feature(nyz): polish atari ddp demo and add dist demo

* Refactor dist version

* Wrap class based middleware

* Change if condition in wrapper

* Only run enhancer on learner

* Support new parallel mode on slurm cluster

* Temp data loader

* Stash commit

* Init data serializer

* Update dump part of code

* Test StorageLoader

* Turn data serializer into storage loader, add storage loader in context exchanger

* Add local id and startup interval

* Fix storage loader

* Support treetensor

* Add role on event name in context exchanger, use share_memory function on tensor

* Double size buffer

* Copy tensor to cpu, skip wait for context on collector and evaluator

* Remove data loader middleware

* Upgrade k8s parser

* Add epoch timer

* Dont use lb

* Change tensor to numpy

* Remove files when stop storage loader

* Discard shared object

* Ensure correct load shm memory

* Add model loader

* Rename model_exchanger to ModelExchanger

* Add model loader benchmark

* Shutdown loaders when task finish

* Upgrade supervisor

* Dont cleanup files when shutting down

* Fix async cleanup in model loader

* Check model loader on dqn

* Dont use loader in dqn example

* Fix style check

* Fix dp

* Fix github tests

* Skip github ci

* Fix bug in event loop

* Fix enhancer tests, move router from start to __init__

* Change default ttl

* Add comments

Co-authored-by: niuyazhe <niuyazhe@sensetime.com>
@PaParaZz1 PaParaZz1 marked this pull request as ready for review October 19, 2022 09:07
@PaParaZz1 PaParaZz1 merged commit b4c152f into main Dec 12, 2022
@PaParaZz1 PaParaZz1 deleted the dev-dist branch December 12, 2022 02:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request parallel-dist Parallel and distributed training related
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants