# Exercise 02: Running RLlib Experiments

This tutorial walks you through the process of running traffic simulations in Flow with trainable RLlib-powered agents. Autonomous agents will learn to maximize a certain reward over the rollouts, using the **RLlib** library. Simulations of this form will depict the propensity of RL agents to influence the traffic of a human fleet in order to make the whole fleet more efficient (for some given metrics). 

In this exercise, we simulate an initially perturbed single lane ring road, where we introduce a single autonomous vehicle. We witness that, after some training, that the autonomous vehicle learns to dissipate the formation and propagation of "phantom jams" which form when only human driver dynamics are involved.

## 1. Components of a Simulation
All simulations, both in the presence and absence of RL, require two components: a *scenario*, and an *environment*. Scenarios describe the features of the transportation network used in simulation. This includes the positions and properties of nodes and edges constituting the lanes and junctions, as well as properties of the vehicles, traffic lights, inflows, etc... in the network. Environments, on the other hand, initialize, reset, and advance simulations, and act as the primary interface between the reinforcement learning algorithm and the scenario. Moreover, custom environments may be used to modify the dynamical features of an scenario. Finally, in the RL case, it is in the *environment* that the state/action spaces and the reward function are defined. 

## 2. Setting up a Scenario
Flow contains a plethora of pre-designed scenarios used to replicate highways, intersections, and merges in both closed and open settings. All these scenarios are located in flow/scenarios. For this exercise, which involves a single lane ring road, we will use the scenario `LoopScenario`.

### 2.1 Setting up Scenario Parameters

The scenario mentioned at the start of this section, as well as all other scenarios in Flow, are parameterized by the following arguments: 
* name
* generator_class
* vehicles
* net_params
* initial_config

These parameters are explained in detail in exercise 1. Moreover, all parameters excluding vehicles (covered in section 2.2) do not change from the previous exercise. Accordingly, we specify them as we have before, and leave further explanations of the parameters to exercise 1.

In [1]:
# ring road scenario class
scenario_name = "LoopScenario"

# ring road generator class
generator_name = "CircleGenerator"

# input parameter classes to the scenario class
from flow.core.params import NetParams, InitialConfig

# name of the scenario
name = "training_example"

# network-specific parameters
from flow.scenarios.loop.loop_scenario import ADDITIONAL_NET_PARAMS
net_params = NetParams(additional_params=ADDITIONAL_NET_PARAMS)

# initial configuration to vehicles
initial_config = InitialConfig(spacing="uniform", perturbation=1)

### 2.2 Adding Trainable Autonomous Vehicles
The `Vehicles` class stores state information on all vehicles in the network. This class is used to identify the dynamical features of a vehicle and whether it is controlled by a reinforcement learning agent. Morover, information pertaining to the observations and reward function can be collected from various `get` methods within this class.

The dynamics of vehicles in the `Vehicles` class can either be depicted by sumo or by the dynamical methods located in flow/controllers. For human-driven vehicles, we use the IDM model for acceleration behavior, with exogenous gaussian acceleration noise with std 0.2 m/s2 to induce perturbations that produce stop-and-go behavior. In addition, we use the `ContinousRouter` routing controller so that the vehicles may maintain their routes closed networks.

As we have done in exercise 1, human-driven vehicles are defined in the `Vehicles` class as follows:

In [2]:
# vehicles class
from flow.core.vehicles import Vehicles

# vehicles dynamics models
from flow.controllers import IDMController, ContinuousRouter

vehicles = Vehicles()
vehicles.add("human",
             acceleration_controller=(IDMController, {}),
             routing_controller=(ContinuousRouter, {}),
             num_vehicles=21)

The above addition to the `Vehicles` class only accounts for 21 of the 22 vehicles that are placed in the network. We now add an additional trainable autuonomous vehicle whose actions are dictated by an RL agent. This is done by specifying an `RLController` as the acceleraton controller to the vehicle. 

In [3]:
from flow.controllers import RLController

Note that this controller serves primarirly as a placeholder that marks the vehicle as a component of the RL agent, meaning that lane changing and routing actions can also be specified by the RL agent to this vehicle.

We finally add the vehicle as follows, while again using the `ContinuousRouter` to perpetually maintain the vehicle within the network.

In [4]:
vehicles.add(veh_id="rl",
             acceleration_controller=(RLController, {}),
             routing_controller=(ContinuousRouter, {}),
             num_vehicles=1)

## 3. Setting up an Environment

Several environments in Flow exist to train RL agents of different forms (e.g. autonomous vehicles, traffic lights) to perform a variety of different tasks. The use of an environment allows us to view the cumulative reward simulation rollouts receive, along with to specify the state/action spaces.

Envrionments in Flow are parametrized by three components:
* env_params
* sumo_params
* scenario

### 3.1 SumoParams
`SumoParams` specifies simulation-specific variables. These variables include the length of any simulation step and whether to render the GUI when running the experiment. For this example, we consider a simulation step length of 0.1s and activate the GUI. 

**Note** For training purposes, it is highly recommanded to deactivate the GUI in order to avoid global slow down. In such case, one just need to specify the following: `sumo_binary="sumo"`

In [5]:
from flow.core.params import SumoParams

sumo_params = SumoParams(sim_step=0.1, sumo_binary="sumo")

### 3.2 EnvParams

`EnvParams` specifies environment and experiment-specific parameters that either affect the training process or the dynamics of various components within the scenario. For the environment "WaveAttenuationPOEnv", these parameters are used to dictate bounds on the accelerations of the autonomous vehicles, as well as the range of ring lengths (and accordingly network densities) the agent is trained on.

Finally, it is important to specify here the *horizon* of the experiment, which is the duration of one episode (during which the RL-agent acquire data). 

In [6]:
from flow.core.params import EnvParams

HORIZON=100

env_params = EnvParams(
    # length of one rollout
    horizon=HORIZON,

    additional_params={
        # maximum acceleration of autonomous vehicles
        "max_accel": 1,
        # maximum deceleration of autonomous vehicles
        "max_decel": 1,
        # bounds on the ranges of ring road lengths the autonomous vehicle 
        # is trained on
        "ring_length": [220, 270],
    },
)

### 3.3 Initializing a Gym Environment

Now, we have to specify our Gym Environment and the algorithm that our RL agents will use. To specify the environment, one has to use the environment's name (a simple string). A list of all environment names is located in `flow/envs/__init__.py`. The names of available environments can be seen below.

In [7]:
import flow.envs as flowenvs

print(flowenvs.__all__)

['Env', 'AccelEnv', 'LaneChangeAccelEnv', 'LaneChangeAccelPOEnv', 'GreenWaveTestEnv', 'GreenWaveEnv', 'WaveAttenuationMergePOEnv', 'TwoLoopsMergeEnv', 'BottleneckEnv', 'BottleNeckAccelEnv', 'WaveAttenuationEnv', 'WaveAttenuationPOEnv']


We will use the environment "WaveAttenuationPOEnv", which is used to train autonomous vehicles to attenuate the formation and propagation of waves in a partially observable variable density ring road. To create the Gym Environment, the only necessary parameters are the environment name plus the previously defined variables. These are defined as follows:

In [8]:
env_name = "WaveAttenuationPOEnv"

### 4 Setting up Flow Parameters

Explain `flow_params` here and `make_create_env`

In [9]:
flow_params = dict(
    exp_tag=name,
    env_name=env_name,
    scenario=scenario_name,
    generator=generator_name,
    sumo=sumo_params,
    env=env_params,
    net=net_params,
    veh=vehicles,
    initial=initial_config)

In [10]:
flow_params

{'env': <flow.core.params.EnvParams at 0x11020aa58>,
 'env_name': 'WaveAttenuationPOEnv',
 'exp_tag': 'training_example',
 'generator': 'CircleGenerator',
 'initial': <flow.core.params.InitialConfig at 0x10662d320>,
 'net': <flow.core.params.NetParams at 0x10662d2b0>,
 'scenario': 'LoopScenario',
 'sumo': <flow.core.params.SumoParams at 0x10662d0b8>,
 'veh': <flow.core.vehicles.Vehicles at 0x10664e5c0>}

### 5 Running in Ray
Explanation here

In [11]:
import json

import ray
import ray.rllib.ppo as ppo
from ray.tune import run_experiments
from ray.tune.registry import register_env

from flow.utils.rllib import make_create_env, FlowParamsEncoder

  return f(*args, **kwds)
  from ._conv import register_converters as _register_converters


In [12]:
# number of parallel workers
PARALLEL_ROLLOUTS = 2
# number of rollouts per training iteration
N_ROLLOUTS = 20

ray.init(num_cpus=PARALLEL_ROLLOUTS, redirect_output=True)

config=ppo.DEFAULT_CONFIG.copy()
config["num_workers"] = PARALLEL_ROLLOUTS
config["timesteps_per_batch"] = HORIZON * N_ROLLOUTS
config["gamma"] = 0.999  # discount rate
config["model"].update({"fcnet_hiddens": [16, 16]})
config["use_gae"] = True
config["lambda"] = 0.97
config["sgd_batchsize"] = min(16 * 1024, config["timesteps_per_batch"])
config["kl_target"] = 0.02
config["num_sgd_iter"] = 10
config["horizon"] = HORIZON

Process STDOUT and STDERR is being redirected to /tmp/raylogs/.
Waiting for redis server at 127.0.0.1:63546 to respond...
Waiting for redis server at 127.0.0.1:61018 to respond...
Starting local scheduler with the following resources: {'GPU': 0, 'CPU': 2}.

View the web UI at http://localhost:8889/notebooks/ray_ui42024.ipynb?token=e7812e62b34843c494ff6254189a1a30f00712901562ef4e



In [13]:
# save the flow params for replay
flow_json = json.dumps(flow_params, cls=FlowParamsEncoder, sort_keys=True,
                       indent=4)
config['env_config']['flow_params'] = flow_json

In [14]:
create_env, env_name = make_create_env(params=flow_params, version=0)
# Register as rllib env
register_env(env_name, create_env)

In [15]:
trials = run_experiments({
    "ring_stabilize": {
        "run": "PPO",
        "env": env_name,
        "config": {
            **config
        },
        "checkpoint_freq": 20,
        "max_failures": 999,
        "stop": {
            "training_iteration": 200,
        },
        "repeat": 3,
        "trial_resources": {
            "cpu": 1,
            "gpu": 0,
            "extra_cpu": PARALLEL_ROLLOUTS - 1,
        },
    },
})

== Status ==
Using FIFO scheduling algorithm.
Result logdir: /Users/nishant/ray_results/ring_stabilize
PENDING trials:
 - PPO_WaveAttenuationPOEnv-v0_0:	PENDING
 - PPO_WaveAttenuationPOEnv-v0_1:	PENDING
 - PPO_WaveAttenuationPOEnv-v0_2:	PENDING

Created LogSyncer for /Users/nishant/ray_results/ring_stabilize/PPO_WaveAttenuationPOEnv-v0_0_2018-06-05_13-20-544g72xv0w -> 
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 2/2 CPUs, 0/0 GPUs
Result logdir: /Users/nishant/ray_results/ring_stabilize
PENDING trials:
 - PPO_WaveAttenuationPOEnv-v0_1:	PENDING
 - PPO_WaveAttenuationPOEnv-v0_2:	PENDING
RUNNING trials:
 - PPO_WaveAttenuationPOEnv-v0_0:	RUNNING

Remote function [31m__init__[39m failed with:

Traceback (most recent call last):
  File "/Users/nishant/Development/research/ray/python/ray/worker.py", line 862, in _process_task
    *arguments)
  File "/Users/nishant/Development/research/ray/python/ray/actor.py", line 245, in actor_method_executor
    method_returns = me

Remote function [31m__init__[39m failed with:

Traceback (most recent call last):
  File "/Users/nishant/Development/research/ray/python/ray/worker.py", line 862, in _process_task
    *arguments)
  File "/Users/nishant/Development/research/ray/python/ray/actor.py", line 245, in actor_method_executor
    method_returns = method(actor, *args)
  File "/Users/nishant/Development/research/ray/python/ray/rllib/agent.py", line 84, in __init__
    Trainable.__init__(self, config, registry, logger_creator)
  File "/Users/nishant/Development/research/ray/python/ray/tune/trainable.py", line 90, in __init__
    self._setup()
  File "/Users/nishant/Development/research/ray/python/ray/rllib/agent.py", line 91, in _setup
    self.env_creator = self.registry.get(ENV_CREATOR, env)
  File "/Users/nishant/Development/research/ray/python/ray/tune/registry.py", line 97, in get
    return _from_pinnable(ray.get(value))
  File "/Users/nishant/Development/research/ray/python/ray/worker.py", line 2474, in ge

Worker ip unknown, skipping log sync for /Users/nishant/ray_results/ring_stabilize/PPO_WaveAttenuationPOEnv-v0_2_2018-06-05_13-21-03ke6l47_0
== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/0 GPUs
Result logdir: /Users/nishant/ray_results/ring_stabilize
ERROR trials:
 - PPO_WaveAttenuationPOEnv-v0_0:	ERROR, 1 failures: /Users/nishant/ray_results/ring_stabilize/PPO_WaveAttenuationPOEnv-v0_0_2018-06-05_13-20-544g72xv0w/error_2018-06-05_13-21-00.txt
 - PPO_WaveAttenuationPOEnv-v0_1:	ERROR, 1 failures: /Users/nishant/ray_results/ring_stabilize/PPO_WaveAttenuationPOEnv-v0_1_2018-06-05_13-21-00bl0dt9u5/error_2018-06-05_13-21-03.txt
 - PPO_WaveAttenuationPOEnv-v0_2:	ERROR, 1 failures: /Users/nishant/ray_results/ring_stabilize/PPO_WaveAttenuationPOEnv-v0_2_2018-06-05_13-21-03ke6l47_0/error_2018-06-05_13-21-09.txt

== Status ==
Using FIFO scheduling algorithm.
Resources requested: 0/2 CPUs, 0/0 GPUs
Result logdir: /Users/nishant/ray_results/ring_stabilize
ERROR tri

TuneError: ('Trial did not complete', PPO_WaveAttenuationPOEnv-v0_0)