.. automodule:: stable_baselines.trpo_mpi
Trust Region Policy Optimization (TRPO) is an iterative approach for optimizing policies with guaranteed monotonic improvement.
Note
TRPO requires :ref:`OpenMPI <openmpi>`. If OpenMPI isn't enabled, then TRPO isn't imported into the stable_baselines module.
- Original paper: https://arxiv.org/abs/1502.05477
- OpenAI blog post: https://blog.openai.com/openai-baselines-ppo/
mpirun -np 16 python -m stable_baselines.trpo_mpi.run_atari
runs the algorithm for 40M frames = 10M timesteps on an Atari game. See help (-h
) for more options.python -m stable_baselines.trpo_mpi.run_mujoco
runs the algorithm for 1M timesteps on a Mujoco environment.
- Recurrent policies: ❌
- Multi processing: ✔️ (using MPI)
- Gym spaces:
Space | Action | Observation |
---|---|---|
Discrete | ✔️ | ✔️ |
Box | ✔️ | ✔️ |
MultiDiscrete | ✔️ | ✔️ |
MultiBinary | ✔️ | ✔️ |
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines import TRPO
env = gym.make('CartPole-v1')
model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("trpo_cartpole")
del model # remove to demonstrate saving and loading
model = TRPO.load("trpo_cartpole")
obs = env.reset()
while True:
action, _states = model.predict(obs)
obs, rewards, dones, info = env.step(action)
env.render()
.. autoclass:: TRPO :members: :inherited-members: