# Ray RLlib - Extra Application Example - MountainCar-v0

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../../../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

This example uses [RLlib](https://ray.readthedocs.io/en/latest/rllib.html) to train a policy with the `MountainCar-v0` environment, ([gym.openai.com/envs/MountainCar-v0/]. The idea is that a cart starts at an arbitrar point on a hill. Without any "pushes", it will rock back and forth between the two sides of the valley below, never rising above the starting point. However, there are three actions, accelerate to the left (by some unit), accelerate to the right, or apply no acceleration. Timing accelerations in the appropriate directions at the appropriate steps is the key to getting to the top of the hill.

The primary idea demonstrated in this lesson is how to start from a previous checkpoint. A checkpoint is provided in the `mountain-car-checkpoint` directory, captured after 200 training episodes. Still, the with the provided checkpoint and addition training of 50 episodes, the cart is unable to reach the top.

Hence, you should consider this lesson a big exercise to try when you aren't pressed for time (like in a class setting). Modifications you can try are discussed below.

> **Note:** This rollout can only show the rollout visualization popup windows when running on a local laptop.

Like `CartPole`, _MountainCar_ is one of OpenAI Gym's ["classic control"](https://gym.openai.com/envs/#classic_control) examples.

For more background about this problem, see:

* ["Efficient memory-based learning for robot control"](https://www.cl.cam.ac.uk/techreports/UCAM-CL-TR-209.pdf), [Andrew William Moore](https://www.cl.cam.ac.uk/~awm22/), University of Cambridge (1990)
* ["Solving Mountain Car with Q-Learning"](https://medium.com/@ts1829/solving-mountain-car-with-q-learning-b77bf71b1de2), [Tim Sullivan](https://twitter.com/ts_1829)

Import Ray and the PPO support, then start Ray…

In [1]:
import pandas as pd
import json, os, shutil, sys
import ray
import ray.rllib.agents.ppo as ppo

Let's start up Ray as in the previous lessons:

In [2]:
!../../../tools/start-ray.sh --check --verbose


INFO: Ray is not running. Run ../tools/start-ray.sh with no options in a terminal window to start Ray.
INFO: (You can start a terminal in Jupyter. Click the + under the Edit menu.)



In [3]:
ray.init(address='auto', ignore_reinit_error=True)

2020-06-13 12:14:28,026	INFO resource_spec.py:212 -- Starting Ray with 4.0 GiB memory available for workers and up to 2.02 GiB for objects. You can adjust these settings with ray.init(memory=<bytes>, object_store_memory=<bytes>).
2020-06-13 12:14:28,345	INFO services.py:1170 -- View the Ray dashboard at [1m[32mlocalhost:8265[39m[22m


{'node_ip_address': '192.168.1.149',
 'raylet_ip_address': '192.168.1.149',
 'redis_address': '192.168.1.149:43983',
 'object_store_address': '/tmp/ray/session_2020-06-13_12-14-28_013436_4596/sockets/plasma_store',
 'raylet_socket_name': '/tmp/ray/session_2020-06-13_12-14-28_013436_4596/sockets/raylet',
 'webui_url': 'localhost:8265',
 'session_dir': '/tmp/ray/session_2020-06-13_12-14-28_013436_4596'}

The Ray Dashboard is useful for monitoring Ray:

In [4]:
print(f'Dashboard URL: http://{ray.get_webui_url()}')

Dashboard URL: http://localhost:8265


Next we'll train an RLlib policy with the `MountainCar-v0` environment.

By default, training runs for `10` iterations. Increase the `N_ITER` setting if you want to train longer and see the resulting rewards improve.
Also note that *checkpoints* get saved after each iteration into the `tmp/ppo/mountain-car` directory.

For `MountainCar`, the environment has these parameters and behaviors (from this [source code](https://github.com/openai/gym/blob/master/gym/envs/classic_control/mountain_car.py)):

```
Observation
    Type: Box(2)
    Num    Observation               Min            Max
    0      Car Position              -1.2           0.6
    1      Car Velocity              -0.07          0.07
Actions:
    Type: Discrete(3)
    Num    Action
    0      Accelerate to the Left
    1      Don't accelerate
    2      Accelerate to the Right
    Note: This does not affect the amount of velocity affected by the
    gravitational pull acting on the car.
Reward:
     Reward of 0 is awarded if the agent reached the flag (position = 0.5)
     on top of the mountain.
     Reward of -1 is awarded if the position of the agent is less than 0.5.
Starting State:
     The position of the car is assigned a uniform random value in
     [-0.6 , -0.4].
     The starting velocity of the car is always assigned to 0.
 Episode Termination:
     The car position is more than 0.5
     Episode length is greater than 200
```

Clean up previous stuff:

In [41]:
checkpoint_root = 'tmp/ppo/mountain-car'
shutil.rmtree(checkpoint_root, ignore_errors=True, onerror=None)   # clean up old runs

Here is the default configuration for PPO applied to this environment. There are no configuration parameters that are passed to _MountainCar_ itself:

In [33]:
ppo.DEFAULT_CONFIG

{'num_workers': 2,
 'num_envs_per_worker': 1,
 'rollout_fragment_length': 200,
 'sample_batch_size': -1,
 'batch_mode': 'truncate_episodes',
 'num_gpus': 0,
 'train_batch_size': 4000,
 'model': {'conv_filters': None,
  'conv_activation': 'relu',
  'fcnet_activation': 'tanh',
  'fcnet_hiddens': [256, 256],
  'free_log_std': False,
  'no_final_linear': False,
  'vf_share_layers': True,
  'use_lstm': False,
  'max_seq_len': 20,
  'lstm_cell_size': 256,
  'lstm_use_prev_action_reward': False,
  'state_shape': None,
  'framestack': True,
  'dim': 84,
  'grayscale': False,
  'zero_mean': True,
  'custom_model': None,
  'custom_action_dist': None,
  'custom_options': {},
  'custom_preprocessor': None},
 'optimizer': {},
 'gamma': 0.99,
 'horizon': None,
 'soft_horizon': False,
 'no_done_at_end': False,
 'env_config': {},
 'env': None,
 'normalize_actions': False,
 'clip_rewards': None,
 'clip_actions': True,
 'preprocessor_pref': 'deepmind',
 'lr': 5e-05,
 'monitor': False,
 'log_level': 'WAR

The next cell copies the default configuration and makes a few modifications, like a larger training batch size. Other changes you might consider are the following:

* Tweak the `model` parameters for the neural net.
* Try other `train_batch_size` values (default: `4000`).
* SGD parameters: `num_sgd_iter` and `sgd_minibatch_size`.

To speed up training:

* Increase the `num_workers` to fully utilize your available machine or cluster, 
* Use GPUs if you have them available.

In [43]:
SELECT_ENV = "MountainCar-v0"
N_ITER = 20

In [45]:
config = ppo.DEFAULT_CONFIG.copy()
config["log_level"] = "WARN"            # the default, at this time
config["num_workers"] = 4               # default = 2
config["train_batch_size"] = 10000      # default = 4000
config["sgd_minibatch_size"] = 256      # default = 128
config["evaluation_num_episodes"] = 50  # default = 10

agent = ppo.PPOTrainer(config, env=SELECT_ENV)

2020-06-13 13:52:03,349	INFO trainable.py:217 -- Getting current IP.


In [46]:
agent.restore('mountain-car-checkpoint/checkpoint-200')

2020-06-13 13:52:03,555	INFO trainable.py:217 -- Getting current IP.
2020-06-13 13:52:03,562	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: mountain-car-checkpoint/checkpoint-200
2020-06-13 13:52:03,568	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 200, '_timesteps_total': 1100000, '_time_total': 1285.8826241493225, '_episodes_total': 5500}


In [37]:
results = []
episode_data = []
episode_json = []

for n in range(N_ITER):
    result = agent.train()
    results.append(result)
    episode = {'n': n, 
               'episode_reward_min': result['episode_reward_min'], 
               'episode_reward_mean': result['episode_reward_mean'], 
               'episode_reward_max': result['episode_reward_max'],  
               'episode_len_mean': result['episode_len_mean']}
    episode_data.append(episode)
    episode_json.append(json.dumps(episode))
    file_name = agent.save(checkpoint_root)
    print(f'{n+1:3d}: Min/Mean/Max reward: {result["episode_reward_min"]:8.4f}/{result["episode_reward_mean"]:8.4f}/{result["episode_reward_max"]:8.4f}, len mean: {result["episode_len_mean"]:8.4f}. Checkpoint saved to {file_name}')

  1: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_201/checkpoint-201
  2: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_202/checkpoint-202
  3: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_203/checkpoint-203
  4: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_204/checkpoint-204
  5: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_205/checkpoint-205
  6: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountain-car/checkpoint_206/checkpoint-206
  7: Min/Mean/Max reward: -200.0000/-200.0000/-200.0000, len mean: 200.0000. Checkpoint saved to tmp/ppo/mountai

Training gives up on an episode after 200 steps. The reward is `-1*N` when the cart doesn't reach the top of the hill. The reward is zero if it does reach the top. Hence, there are no incremental rewards here; it's success or failure.

Let's print out the policy and model to see the results of training in detail…

In [38]:
import pprint

policy = agent.get_policy()
model = policy.model

pprint.pprint(model.variables())
pprint.pprint(model.value_function())

print(model.base_model.summary())

[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(2, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/kernel:0' shape=(2, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_1/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/kernel:0' shape=(256, 256) dtype=float32>,
 <tf.Variable 'default_policy/fc_value_2/bias:0' shape=(256,) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/kernel:0' shape=(256, 3) dtype=float32>,
 <tf.Variable 'default_policy/fc_out/bias:0' shape=(3,) dtype=float32>,
 <tf.Variable 'default_policy/value_out/kernel:0' shape=(256, 1) dtype=float32>,
 <tf.Variable 'default_policy/value_out/bias:0' shape=(1,) dtype=float32>]
<tf.Tensor 'Reshape:0' shape=(?,) dtype=float32>
Model: "model"
__________

## Rollout

Next we'll use the [`rollout` script](https://ray.readthedocs.io/en/latest/rllib-training.html#evaluating-trained-policies) to evaluate the trained policy.

This visualizes the "car" agent operating within the simulation: rocking back and forth to gain momentum to overcome the mountain, using the last checkpoint. Edit the number in the checkpoint path if necessary! Also change the configuration to match the changes above.

> **Note:** This rollout can only show the visualization popup windows when running on a local laptop.

In [48]:
!RAY_ADDRESS=auto rllib rollout \
    tmp/ppo/mountain-car/checkpoint_200/checkpoint-200 \
    --config '{"env": "MountainCar-v0", "num_workers":4, "train_batch_size":10000, "sgd_minibatch_size":256, "evaluation_num_episodes":50}' --run PPO \
    --steps 2000

2020-06-13 13:55:39,634	INFO trainer.py:421 -- Tip: set 'eager': true or the --eager flag to enable TensorFlow eager execution
2020-06-13 13:55:39,651	INFO trainer.py:580 -- Current log_level is WARN. For more information, set 'log_level': 'INFO' / 'DEBUG' or use the -v and -vv flags.
2020-06-13 13:55:43,327	INFO trainable.py:217 -- Getting current IP.
2020-06-13 13:55:43,399	INFO trainable.py:217 -- Getting current IP.
2020-06-13 13:55:43,400	INFO trainable.py:423 -- Restored on 192.168.1.149 from checkpoint: tmp/ppo/mountain-car/checkpoint_200/checkpoint-200
2020-06-13 13:55:43,400	INFO trainable.py:430 -- Current state after restoring: {'_iteration': 200, '_timesteps_total': 1100000, '_time_total': 1285.8826241493225, '_episodes_total': 5500}
Episode #0: reward: -200.0
Episode #1: reward: -200.0
Episode #2: reward: -200.0
Episode #3: reward: -200.0
Episode #4: reward: -200.0
Episode #5: reward: -200.0
Episode #6: reward: -200.0
Episode #7: reward: -200.0
Episode #8: reward: -200.0
E

2020-06-13 14:05:02,754	ERROR worker.py:1092 -- listen_error_messages_raylet: Connection closed by server.
2020-06-13 14:05:02,758	ERROR import_thread.py:93 -- ImportThread: Connection closed by server.
2020-06-13 14:05:02,759	ERROR worker.py:992 -- print_logs: Connection closed by server.


The rollout uses the second saved checkpoint, evaluated through `2000` steps.
Modify the path to view other checkpoints.

## Exercise ("Homework")

In addition to _Mountain Car_ and _Cart Pole_, there are other so-called ["classic control"](https://gym.openai.com/envs/#classic_control) examples you can try.