A parallel version of Trust Region Policy Optimization
Switch branches/tags
Nothing to show
Clone or download
kvfrans Revert "lat"
This reverts commit a3777e5.
Latest commit a08599d Mar 6, 2017
Failed to load latest commit information.
results Revert "lat" Mar 6, 2017
too-long-trials Revert "lat" Mar 6, 2017
.gitignore graphed results Oct 9, 2016
MUJOCO_LOG.TXT rearrange Jan 15, 2017
README.md updated readme Dec 9, 2016
main.py Revert "lat" Mar 6, 2017
model.py new method and halfcheetah results Oct 8, 2016
rollouts.py Revert "lat" Mar 6, 2017
texput.log graphed results Oct 9, 2016
trials.txt updated readme Dec 9, 2016
trials_old.txt new trials Nov 23, 2016
utils.py init Aug 19, 2016
value_function.py init Aug 19, 2016



A parallel implementation of Trust Region Policy Optimization on environments from OpenAI gym

Now includes hyperparaemter adaptation as well! More more info, check my post on this project.

I'm working towards the ideas at this openAI research request. The code is based off of this implementation.

I'm currently working together with Danijar on writing an updated version of this preliminary paper, describing the multiple actors setup.

How to run:

# This just runs a simple training on Reacher-v1.
python main.py

# For the commands used to recreate results, check trials.txt


--task: what gym environment to run on
--timesteps_per_batch: how many timesteps for each policy iteration
--n_iter: number of iterations
--gamma: discount factor for future rewards_1
--max_kl: maximum KL divergence between new and old policy
--cg_damping: damp on the KL constraint (ratio of original gradient to use)
--num_threads: how many async threads to use
--monitor: whether to monitor progress for publishing results to gym or not