Proximal Policy Optimization example #470

wrzadkow · 2020-09-17T12:39:34Z

Reinforcement learning example using the Proximal Policy Optimization algorithm, prepared in close collaboration with @jheek and @lespeholt .

The implementation learns to play Atari games implemented in OpenAI gym environment. Tests on BeamRider, Breakout, Pong, Qbert, Seaquest, and SpaceInvaeders show that the training performance from the original paper is reproduced. The speed is ~1000 FPS on a VM with one V100 GPU. Unit tests and documentation are provided.

…fluence on results)

examples/ppo/main.py

examples/ppo/README.md

examples/ppo/agent.py

examples/ppo/unit_tests.py

examples/ppo/ppo_lib.py

PiperOrigin-RevId: 334765232

wrzadkow added 19 commits September 11, 2020 10:43

Initial PPO commit

d38a671

Use jax.nn.one_hot instead of list comprehension for speed

c0ff3ef

Clarity: calculate only advantages in gae_advantages()

f576a76

jit-compile training step

11bc593

Clarity: get rid of most [:-1] indexing

5feeec7

jit & vmap Generalized Advantage Estimation

8be6677

Add advantage normalization

670978f

Small code cleanup

3414100

Add some asserts & debug info logging

f40b049

Add unit tests

2bd52d8

Add more debugging info

b943afc

Add forward pass tests

b0543a9

Explicitly mention values shape being (batch,1), not (batch, ) (no in…

6eedf84

…fluence on results)

Add more asserts, test more frequently

04763aa

Use log_probs from the start

be01451

Thread sync: wait for experience before starting the training

a99baac

Reduce amount of information printed when testing

c06e8d7

Clarity: use namedtuple instead of tuple

21a3540

Add README

c18dd9d

google-cla bot added the cla: yes label Sep 17, 2020

Enhance docstrings

d9ad5be

andsteing assigned jheek Sep 17, 2020

wrzadkow added 2 commits September 18, 2020 08:30

Allow more flexible game choice (don't hardcode game-pecific features)

d0ff2ae

Correctly specify the number of frames

1af5bbb

wrzadkow force-pushed the rl-example-ppo branch from b809cea to 1af5bbb Compare September 18, 2020 10:25

jheek requested changes Sep 18, 2020

View reviewed changes

examples/ppo/main.py Outdated Show resolved Hide resolved

examples/ppo/main.py Outdated Show resolved Hide resolved

andsteing reviewed Sep 18, 2020

View reviewed changes

examples/ppo/README.md Outdated Show resolved Hide resolved

examples/ppo/README.md Show resolved Hide resolved

wrzadkow added 3 commits September 18, 2020 14:24

Add device_get() for speed as suggested by @jheek

f88e45b

Add requirements.txt

690a9c8

Use absl.flags for better hyperparameter handling

58c4ca0

lespeholt reviewed Sep 24, 2020

View reviewed changes

examples/ppo/agent.py Show resolved Hide resolved

wrzadkow added 2 commits September 24, 2020 16:35

Streamline training: use one thread, divide code into smaller chunks

50b2b79

Avoid using global variables

df3daa1

wrzadkow force-pushed the rl-example-ppo branch 4 times, most recently from d411337 to 1e89a8b Compare September 24, 2020 20:22

Adhere to file naming standard

7e036ae

wrzadkow force-pushed the rl-example-ppo branch from 1e89a8b to 7e036ae Compare September 25, 2020 07:29

wrzadkow added 2 commits September 25, 2020 08:10

Merge remote.py with agent.py due to similar function

9ff33b9

Use tensorboard for logging and add checkpointing

08bd344

wrzadkow force-pushed the rl-example-ppo branch from 5e3eb19 to 08bd344 Compare September 28, 2020 10:34

wrzadkow added 2 commits September 28, 2020 13:29

Simplify and format code

65faed8

Save checkpoints less frequently

68b8713

wrzadkow marked this pull request as ready for review September 29, 2020 10:15

wrzadkow added 6 commits September 29, 2020 12:24

Update the README

57dd0a3

Don't send values and log probs to remote process and back

d7a8fa4

Add tensorboard.dev trace

f9e37fe

Remove unneeded function get_state()

70d21f7

Small type hints & docstrings enhancement

342786b

Use ml_collections for hyperparameter handling

a4dade8

wrzadkow force-pushed the rl-example-ppo branch from aeabf2a to a4dade8 Compare September 30, 2020 22:27

Refactor a long statement

315902b

jheek reviewed Oct 1, 2020

View reviewed changes

examples/ppo/unit_tests.py Outdated Show resolved Hide resolved

examples/ppo/unit_tests.py Outdated Show resolved Hide resolved

examples/ppo/ppo_lib.py Outdated Show resolved Hide resolved

examples/ppo/ppo_lib.py Outdated Show resolved Hide resolved

copybara-service bot pushed a commit that referenced this pull request Oct 1, 2020

Merge pull request #470 from wrzadkow:rl-example-ppo

fed1aaf

PiperOrigin-RevId: 334765232

wrzadkow added 3 commits October 1, 2020 08:30

Test: use assertEqual and clip rewards when testing them

d2eae5c

Compile vectorized code instead of vectorizing compiled code

d444075

Specify static_argnums with proper int

f3a9d03

copybara-service bot merged commit 45937af into google:master Oct 1, 2020

wrzadkow mentioned this pull request Oct 2, 2020

PPO Linen example #508

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proximal Policy Optimization example #470

Proximal Policy Optimization example #470

Uh oh!

wrzadkow commented Sep 17, 2020 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Proximal Policy Optimization example #470

Proximal Policy Optimization example #470

Uh oh!

Conversation

wrzadkow commented Sep 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

wrzadkow commented Sep 17, 2020 •

edited

Loading