# RL Exercise 2 - Proximal Policy Optimization

**GOAL:** The goal of this exercise is to demonstrate how to use the proximal policy optimization (PPO) algorithm.

To understand how to use **RLlib**, see the documentation at http://rllib.io.

PPO is described in detail in https://arxiv.org/abs/1707.06347. It is a variant of Trust Region Policy Optimization (TRPO) described in https://arxiv.org/abs/1502.05477

PPO works in two phases. In one phase, a large number of rollouts are performed (in parallel). The rollouts are then aggregated on the driver and a surrogate optimization objective is defined based on those rollouts. We then use SGD to find the policy that maximizes that objective with a penalty term for diverging too much from the current policy.

![ppo](https://raw.githubusercontent.com/ucbrise/risecamp/risecamp2018/ray/tutorial/rllib_exercises/ppo.png)

**NOTE:** The SGD optimization step is best performed in a data-parallel manner over multiple GPUs. This is exposed through the `num_gpus` field of the `config` dictionary (for this to work, you must be using a machine that has GPUs).

In [1]:
from __future__ import absolute_import
from __future__ import division
from __future__ import print_function

import gym
import ray
from ray.rllib.agents.ppo import PPOAgent, DEFAULT_CONFIG
from ray.tune.logger import pretty_print

Start up Ray. This must be done before we instantiate any RL agents.

In [2]:
ray.init(ignore_reinit_error=True)

Process STDOUT and STDERR is being redirected to /tmp/ray/session_2019-01-24_17-57-27_29893/logs.
Waiting for redis server at 127.0.0.1:45208 to respond...
Waiting for redis server at 127.0.0.1:12644 to respond...
Starting the Plasma object store with 20.0 GB memory using /dev/shm.

View the web UI at http://localhost:8890/notebooks/ray_ui.ipynb?token=489d93316da6e238de39d7b709a8dbc50ea6f352da6a5be8



{'node_ip_address': '192.168.23.45',
 'redis_address': '192.168.23.45:45208',
 'object_store_addresses': ['/tmp/ray/session_2019-01-24_17-57-27_29893/sockets/plasma_store'],
 'raylet_socket_names': ['/tmp/ray/session_2019-01-24_17-57-27_29893/sockets/raylet'],
 'webui_url': 'http://localhost:8890/notebooks/ray_ui.ipynb?token=489d93316da6e238de39d7b709a8dbc50ea6f352da6a5be8'}

Instantiate a PPOAgent object. We pass in a config object that specifies how the network and training procedure should be configured. Some of the parameters are the following.

- `num_workers` is the number of actors that the agent will create. This determines the degree of parallelism that will be used.
- `num_sgd_iter` is the number of epochs of SGD (passes through the data) that will be used to optimize the PPO surrogate objective at each iteration of PPO.
- `sgd_minibatch_size` is the SGD batch size that will be used to optimize the PPO surrogate objective.
- `model` contains a dictionary of parameters describing the neural net used to parameterize the policy. The `fcnet_hiddens` parameter is a list of the sizes of the hidden layers.

In [3]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0  # This avoids running out of resources in the notebook environment when this cell is re-executed

agent = PPOAgent(config, 'CartPole-v0')

Created LogSyncer for /home/jared/ray_results/PPO_CartPole-v0_2019-01-24_17-57-28dot8dw05 -> None
  result = entry_point.load(False)
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "
2019-01-24 17:57:29,251	INFO multi_gpu_optimizer.py:62 -- LocalMultiGPUOptimizer devices ['/cpu:0']


Train the policy on the `CartPole-v0` environment for 2 steps. The CartPole problem is described at https://gym.openai.com/envs/CartPole-v0.

**EXERCISE:** Inspect how well the policy is doing by looking for the lines that say something like

```
total reward is  22.3215974777
trajectory length mean is  21.3215974777
```

This indicates how much reward the policy is receiving and how many time steps of the environment the policy ran. The maximum possible reward for this problem is 200. The reward and trajectory length are very close because the agent receives a reward of one for every time step that it survives (however, that is specific to this environment).

In [4]:
for i in range(2):
    result = agent.train()
    print(pretty_print(result))

  if np.issubdtype(value, float):


custom_metrics: {}
date: 2019-01-24_17-57-40
done: false
episode_len_mean: 21.409523809523808
episode_reward_max: 60.0
episode_reward_mean: 21.409523809523808
episode_reward_min: 8.0
episodes_this_iter: 210
episodes_total: 210
experiment_id: e1c8873880fa47cca619a31a8b3399f6
hostname: santaka
info:
  cur_lr: 4.999999873689376e-05
  entropy: 0.662660539150238
  grad_time_ms: 2363.136
  kl: 0.032554734498262405
  load_time_ms: 67.202
  num_steps_sampled: 4000
  num_steps_trained: 4000
  policy_loss: -0.041429586708545685
  sample_time_ms: 1610.597
  total_loss: 161.13922119140625
  update_time_ms: 784.02
  vf_explained_var: 0.03205602988600731
  vf_loss: 161.1741485595703
iterations_since_restore: 1
node_ip: 192.168.23.45
num_metric_batches_dropped: 0
pid: 29893
policy_reward_mean: {}
time_since_restore: 4.8688483238220215
time_this_iter_s: 4.8688483238220215
time_total_s: 4.8688483238220215
timestamp: 1548377860
timesteps_since_restore: 4000
timesteps_this_iter: 4000
timesteps_total: 400

**EXERCISE:** The current network and training configuration are too large and heavy-duty for a simple problem like CartPole. Modify the configuration to use a smaller network and to speed up the optimization of the surrogate objective (fewer SGD iterations and a larger batch size should help).

In [5]:
config = DEFAULT_CONFIG.copy()
config['num_workers'] = 3
config['num_sgd_iter'] = 30
config['sgd_minibatch_size'] = 128
config['model']['fcnet_hiddens'] = [100, 100]
config['num_cpus_per_worker'] = 0

agent = PPOAgent(config, 'CartPole-v0')

Created LogSyncer for /home/jared/ray_results/PPO_CartPole-v0_2019-01-24_17-57-43z6o57v08 -> None
2019-01-24 17:57:44,853	INFO multi_gpu_optimizer.py:62 -- LocalMultiGPUOptimizer devices ['/cpu:0']


**EXERCISE:** Train the agent and try to get a reward of 200. If it's training too slowly you may need to modify the config above to use fewer hidden units, a larger `sgd_minibatch_size`, a smaller `num_sgd_iter`, or a larger `num_workers`.

This should take around 20 or 30 training iterations.

In [6]:
for i in range(2):
    result = agent.train()
    print(pretty_print(result))

custom_metrics: {}
date: 2019-01-24_17-57-51
done: false
episode_len_mean: 21.956521739130434
episode_reward_max: 56.0
episode_reward_mean: 21.956521739130434
episode_reward_min: 9.0
episodes_this_iter: 207
episodes_total: 207
experiment_id: f8999da60abf40ca97844bc098ba9571
hostname: santaka
info:
  cur_lr: 4.999999873689376e-05
  entropy: 0.6624599099159241
  grad_time_ms: 2663.643
  kl: 0.0321684293448925
  load_time_ms: 36.153
  num_steps_sampled: 4000
  num_steps_trained: 4000
  policy_loss: -0.04183917120099068
  sample_time_ms: 985.151
  total_loss: 162.8946533203125
  update_time_ms: 364.865
  vf_explained_var: 0.050697725266218185
  vf_loss: 162.9300994873047
iterations_since_restore: 1
node_ip: 192.168.23.45
num_metric_batches_dropped: 0
pid: 29893
policy_reward_mean: {}
time_since_restore: 4.083256721496582
time_this_iter_s: 4.083256721496582
time_total_s: 4.083256721496582
timestamp: 1548377871
timesteps_since_restore: 4000
timesteps_this_iter: 4000
timesteps_total: 4000
tra

Checkpoint the current model. The call to `agent.save()` returns the path to the checkpointed model and can be used later to restore the model.

In [7]:
checkpoint_path = agent.save()
print(checkpoint_path)

/home/jared/ray_results/PPO_CartPole-v0_2019-01-24_17-57-43z6o57v08/checkpoint_2/checkpoint-2


Now let's use the trained policy to make predictions.

**NOTE:** Here we are loading the trained policy in the same process, but in practice, this would often be done in a different process (probably on a different machine).

In [8]:
trained_config = config.copy()

test_agent = PPOAgent(trained_config, 'CartPole-v0')
test_agent.restore(checkpoint_path)

Created LogSyncer for /home/jared/ray_results/PPO_CartPole-v0_2019-01-24_17-57-54gp29qa4v -> None
2019-01-24 17:57:55,394	INFO multi_gpu_optimizer.py:62 -- LocalMultiGPUOptimizer devices ['/cpu:0']


Now use the trained policy to act in an environment. The key line is the call to `test_agent.compute_action(state)` which uses the trained policy to choose an action.

**EXERCISE:** Verify that the reward received roughly matches up with the reward printed in the training logs.

In [9]:
env = gym.make('CartPole-v0')
state = env.reset()
done = False
cumulative_reward = 0

while not done:
    action = test_agent.compute_action(state)
    state, reward, done, _ = env.step(action)
    cumulative_reward += reward

print(cumulative_reward)

27.0


## Visualize results with TensorBoard

**EXERCISE**: Finally, you can visualize your training results using TensorBoard. To do this, open a new terminal in Jupyter lab using the "+" button, and run:
    
`$ tensorboard --logdir=~/ray_results --host=0.0.0.0`

And open your browser to the address printed (or change the current URL to go to port 6006). Check the "episode_reward_mean" learning curve of the PPO agent. Toggle the horizontal axis between both the "STEPS" and "RELATIVE" view to compare efficiency in number of timesteps vs real time time.