feature(nyz): add new middleware distributed demo #321

PaParaZz1 · 2022-05-15T11:54:36Z

Description

DataParallel demo
DistributedDataParallel demo
tb logger example
Distributed RL demo (Ape-X type)

Related Issue

#102
#176

TODO

Check List

merge the latest version source branch/repo, and resolve all the conflicts
pass style check
pass all the tests

codecov · 2022-05-15T12:13:31Z

Codecov Report

Merging #321 (dde6009) into main (dd2b3a5) will decrease coverage by 0.59%.
The diff coverage is 79.88%.

@@            Coverage Diff             @@
##             main     #321      +/-   ##
==========================================
- Coverage   85.39%   84.79%   -0.60%     
==========================================
  Files         532      556      +24     
  Lines       43943    44718     +775     
==========================================
+ Hits        37523    37919     +396     
- Misses       6420     6799     +379

Flag	Coverage Δ
unittests	`84.79% <79.88%> (-0.60%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
ding/data/buffer/tests/test_buffer_benchmark.py	`37.70% <ø> (ø)`
ding/entry/tests/test_cli_ditask.py	`100.00% <ø> (ø)`
ding/policy/base_policy.py	`74.85% <ø> (+0.84%)`	⬆️
ding/policy/sac.py	`60.29% <ø> (-0.08%)`	⬇️
...ework/middleware/functional/termination_checker.py	`22.50% <16.66%> (-8.75%)`	⬇️
ding/data/tests/test_model_loader.py	`23.63% <23.63%> (ø)`
ding/framework/tests/test_task.py	`92.50% <35.71%> (-7.50%)`	⬇️
ding/framework/middleware/functional/trainer.py	`84.84% <40.00%> (-3.04%)`	⬇️
ding/policy/dqn.py	`87.34% <40.00%> (-1.54%)`	⬇️
ding/framework/middleware/functional/enhancer.py	`39.65% <42.85%> (-0.73%)`	⬇️
... and 271 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

… bug

…cell/DI-engine into dev-dist

Add more desc (ci skip) Add timeout on model exchanger

…o dev-dist

zxzzz0 · 2022-06-21T21:22:40Z

What is the throughput of this? Does this beat SampleFactory? @PaParaZz1 @sailxjx

sailxjx · 2022-06-22T01:58:07Z

@zxzzz0 This is not to compare the speed with sample factory, because you know that the bottleneck of RL training may appear in any one of the collecting, training, and evaluation, for example, too fast collecting may lead to too much generation difference and underfitting of the model, and because of the existence of GIL, the deserialization of data on the training process will also slow down the overall training efficiency, and there are many points that we need to consider in this project.
This issue will provide a new design pattern for global RL training, starting from the idea that users can easily scale from single-computer studies to large-scale distributed systems without very large code modification costs or performance losses.
If you are really very very concerned about environment-side collecting, then you can use sample factory inside the di-engine to achieve the collecting efficiency you expect.

zxzzz0 · 2022-06-22T14:12:56Z

If you are really very very concerned about environment-side collecting

No. To clarify, we only care about overall performance, which means the time it will take to reach certain reward in the end.

Usually if you can squeeze every drop of performance out of the CPU/GPU, you can learn faster. Environment-side collecting is just one indicator and there are many other indicators as well. You will also have to pay attention to the learner FPS, GPU utilization and other indicators for you to understand the throughput of the whole system.

When doing benchmarking, it's not targeted for the collector side but the overall growth speed of reward.

sailxjx · 2022-06-23T02:14:39Z

No. To clarify, we only care about overall performance, which means the time it will take to reach certain reward in the end.

Yeah, that's right, the purpose of the distributed version is to maximize overall performance while not requiring too much effort to write code on multiple tasks.

Another consideration is that we need to go design-first. Only after the upper layer interface is unified and stable, it will be possible to gradually optimize all aspects of performance without disturbing the user. You can see that from version 0.x to version 1.0, we have gradually developed a definite interface style example, and the purpose of this branch is to extend this interface style to distributed operation.

zxzzz0 · 2022-06-23T14:26:52Z

Sounds good. In the future, please benchmark different design/interface so that you are confident enough to say that you've chosen the design with the best overall performance.

If you don't do benchmarking (as I did before for di-engine) and you find something that you could improve the performance after the design is frozen in version 1.0, you can't change it without a major version update.

…o dev-dist

…ader (#425) * Add singleton log writer * Use get_instance on writer * feature(nyz): polish atari ddp demo and add dist demo * Refactor dist version * Wrap class based middleware * Change if condition in wrapper * Only run enhancer on learner * Support new parallel mode on slurm cluster * Temp data loader * Stash commit * Init data serializer * Update dump part of code * Test StorageLoader * Turn data serializer into storage loader, add storage loader in context exchanger * Add local id and startup interval * Fix storage loader * Support treetensor * Add role on event name in context exchanger, use share_memory function on tensor * Double size buffer * Copy tensor to cpu, skip wait for context on collector and evaluator * Remove data loader middleware * Upgrade k8s parser * Add epoch timer * Dont use lb * Change tensor to numpy * Remove files when stop storage loader * Discard shared object * Ensure correct load shm memory * Add model loader * Rename model_exchanger to ModelExchanger * Add model loader benchmark * Shutdown loaders when task finish * Upgrade supervisor * Dont cleanup files when shutting down * Fix async cleanup in model loader * Check model loader on dqn * Dont use loader in dqn example * Fix style check * Fix dp * Fix github tests * Skip github ci * Fix bug in event loop * Fix enhancer tests, move router from start to __init__ * Change default ttl * Add comments Co-authored-by: niuyazhe <niuyazhe@sensetime.com>

PaParaZz1 added 2 commits May 14, 2022 22:57

demo(nyz): add naive dp demo

8f7135b

demo(nyz): add naive ddp demo

6733b69

PaParaZz1 added enhancement New feature or request parallel-dist Parallel and distributed training related labels May 15, 2022

PaParaZz1 and others added 13 commits May 15, 2022 21:58

feature(nyz): add naive tb_logger in new evaluator

b907c63

Add singleton log writer

4b833d0

Use get_instance on writer

b0e6238

feature(nyz): add general logger middleware

7a344e6

feature(nyz): add soft update in DQN target network

ae37da2

fix(nyz): fix termination env_step bug and eval task.finish broadcast…

dbee60a

… bug

Merge branch 'dev-dist' of https://gitlab.bj.sensetime.com/open-XLab/…

7a4789d

…cell/DI-engine into dev-dist

Support distributed dqn

e7c9d96

Add more desc (ci skip)

472abb7

Support distributed dqn

f80f047

Add more desc (ci skip) Add timeout on model exchanger

Merge branch 'dev-dist' of https://github.com/opendilab/DI-engine int…

a563a91

…o dev-dist

feature(nyz): add online logger freq

43f6f01

fix(nyz): fix policy set device bug

1e0c4a1

PaParaZz1 force-pushed the dev-dist branch from 37e1b18 to 1e0c4a1 Compare June 7, 2022 05:28

hiha3456 and others added 11 commits June 7, 2022 15:43

add offline rl logger

c043ebd

change a bit

df8719a

add else in checking ctx type

7ade025

add test_logger.py

fe6a32f

add mock of offline_logger

4754c80

add mock of online writer

83869c8

reformat

302c824

reformat

51e3e0e

feature(nyz): polish atari ddp demo and add dist demo

aa29252

fix(nyz): fix mq listen bug when stop

16d8107

demo(nyz): add atari ppo(sm+ddp) demo

01c9868

PaParaZz1 added 4 commits July 18, 2022 12:13

refactor(nyz): split dist ddp demo implementation

84ea5cb

Merge branch 'dev-dist' of https://github.com/opendilab/DI-engine int…

e706494

…o dev-dist

Merge branch 'main' into dev-dist

6daaf2d

feature(nyz): add rdma test demo(ci skip)

c350bae

PaParaZz1 force-pushed the main branch 2 times, most recently from 6dfebeb to 813580f Compare July 31, 2022 09:31

PaParaZz1 and others added 3 commits September 9, 2022 00:20

Merge branch 'main' into dev-dist

3f34393

Merge branch 'main' into dev-dist

702240b

PaParaZz1 marked this pull request as ready for review October 19, 2022 09:07

PaParaZz1 and others added 8 commits October 19, 2022 17:22

style(nyz): correct yapf style

018b197

fix(nyz): fix ctx and logger compatibility bugs

7d931f9

polish(nyz): update demo from cartpole v0 to v1

f096cfd

fix(nyz): fix evaluator condition bug

8d34671

Merge branch 'main' into dev-dist

06c3a8f

Merge branch 'main' into dev-dist

5fe4f7e

style(nyz): correct flake8 style

590728f

demo(nyz): move back to CartPole-v0

2236e8f

PaParaZz1 mentioned this pull request Nov 29, 2022

Roadmap for DI-engine #548

Open

PaParaZz1 and others added 4 commits December 1, 2022 16:08

fix(nyz): fix context manager env step merge bug(ci skip)

6a6798f

fix(nyz): fix context manager env step merge bug(ci skip)

060d1b8

Merge branch 'main' into dev-dist

c12a76c

fix(nyz): fix flake8 style

dde6009

PaParaZz1 merged commit b4c152f into main Dec 12, 2022

PaParaZz1 deleted the dev-dist branch December 12, 2022 02:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feature(nyz): add new middleware distributed demo #321

feature(nyz): add new middleware distributed demo #321

PaParaZz1 commented May 15, 2022 •

edited

codecov bot commented May 15, 2022 •

edited

zxzzz0 commented Jun 21, 2022

sailxjx commented Jun 22, 2022

zxzzz0 commented Jun 22, 2022

sailxjx commented Jun 23, 2022

zxzzz0 commented Jun 23, 2022 •

edited

feature(nyz): add new middleware distributed demo #321

feature(nyz): add new middleware distributed demo #321

Conversation

PaParaZz1 commented May 15, 2022 • edited

Description

Related Issue

TODO

Check List

codecov bot commented May 15, 2022 • edited

Codecov Report

zxzzz0 commented Jun 21, 2022

sailxjx commented Jun 22, 2022

zxzzz0 commented Jun 22, 2022

sailxjx commented Jun 23, 2022

zxzzz0 commented Jun 23, 2022 • edited

PaParaZz1 commented May 15, 2022 •

edited

codecov bot commented May 15, 2022 •

edited

zxzzz0 commented Jun 23, 2022 •

edited