PyTorch implementation of Trust Region Policy Optimization
Clone or download
Latest commit e200eb8 Sep 13, 2018
Permalink
Type Name Latest commit message Commit time
Failed to load latest commit information.
LICENSE.md Create LICENSE.md May 26, 2017
README.md Update README.md May 19, 2018
conjugate_gradients.py Initial commit May 26, 2017
main.py Update to pytorch 0.4 Sep 13, 2018
models.py Update to pytorch 0.4 Sep 13, 2018
replay_memory.py Initial commit May 26, 2017
running_state.py Initial commit May 26, 2017
trpo.py Update to pytorch 0.4 Sep 13, 2018
utils.py Fix bugs introduces with keepdim Oct 7, 2017

README.md

PyTorch implementation of TRPO

Try my implementation of PPO (aka newer better variant of TRPO), unless you need to you TRPO for some specific reasons.

This is a PyTorch implementation of "Trust Region Policy Optimization (TRPO)".

This is code mostly ported from original implementation by John Schulman. In contrast to another implementation of TRPO in PyTorch, this implementation uses exact Hessian-vector product instead of finite differences approximation.

Contributions

Contributions are very welcome. If you know how to make this code better, don't hesitate to send a pull request.

Usage

python main.py --env-name "Reacher-v1"

Recommended hyper parameters

InvertedPendulum-v1: 5000

Reacher-v1, InvertedDoublePendulum-v1: 15000

HalfCheetah-v1, Hopper-v1, Swimmer-v1, Walker2d-v1: 25000

Ant-v1, Humanoid-v1: 50000

Results

More or less similar to the original code. Coming soon.

Todo

  • Plots.
  • Collect data in multiple threads.