Skip to content

Latest commit

 

History

History
73 lines (50 loc) · 1.73 KB

trpo.rst

File metadata and controls

73 lines (50 loc) · 1.73 KB

stable_baselines.trpo_mpi

TRPO

Trust Region Policy Optimization (TRPO) is an iterative approach for optimizing policies with guaranteed monotonic improvement.

Notes

Can I use?

  • Reccurent policies: ✔️
  • Multi processing: ✔️ (using MPI)
  • Gym spaces:
Space Action Observation
Discrete ✔️

✔️

Box ✔️

✔️

MultiDiscrete ✔️

✔️

MultiBinary ✔️

✔️

Example

import gym

from stable_baselines.common.policies import MlpPolicy, MlpLstmPolicy, MlpLnLstmPolicy
from stable_baselines.common.vec_env import DummyVecEnv
from stable_baselines import TRPO

env = gym.make('CartPole-v1')
env = DummyVecEnv([lambda: env])

model = TRPO(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save("trpo_cartpole")

del model # remove to demonstrate saving and loading

model = TRPO.load("trpo_cartpole")

obs = env.reset()
while True:
    action, _states = model.predict(obs)
    obs, rewards, dones, info = env.step(action)
    env.render()

Parameters

TRPO