# Ray RLlib - Introduction to RLlib

© 2019-2020, Anyscale. All Rights Reserved

![Anyscale Academy](../images/AnyscaleAcademy_Logo_clearbanner_141x100.png)

In the [previous lesson](01-Introduction-to-Reinforcement-Learning.ipynb), we learned the basic concepts of reinforcement learning, with a "taste" of [RLlib](https://rllib.io) and [OpenAI Gym](https://gym.openai.com). This lesson takes a step back to provide more information about RLlib and the features it provides. The subsequent lessons will continue our exploration of RL concepts and RLlib tools.

For more detailed information about RLlib and its open source community, see the following:

* [rllib.io](http://rllib.io) (the documentation)
* [GitHub repo](https://github.com/ray-project/ray/tree/master/rllib#rllib-scalable-reinforcement-learning)

RLlib is structured conceptually like this:

![RLlib Stack](../images/rllib/RLlib-Stack-smaller.png)

The _(1) Application Support_ boxes are components used for particular RL algorithms. The _(2) Abstractions for RL_ provide building blocks used by the many algorithms that are implemented in RLlib (listed below). They also provide hooks for implementing your own algorithms. RLlib leverages Ray for efficient, cluster-wide, _(3) Distributed Execution_.

Let's start up Ray as in the previous lesson:

In [None]:
!../tools/start-ray.sh --check --verbose

## RLlib in 60 Seconds (plus some...)

Here is a fast introduction to using RLlib from a command line, adapted from the [documentation](https://docs.ray.io/en/latest/rllib.html#rllib-in-60-seconds).

First you would install [PyTorch](http://pytorch.org/) or [TensorFlow](https://www.tensorflow.org/), whichever you prefer.  Then install RLlib. **All of these items are already installed in this tutorial environment.**

```shell
pip install ray[rllib]  # or consider using: ray[debug]
```

Then **train** `CartPole` using _PPO_ with the `rllib` CLI. We connect to the running Ray cluster, we'll stop at 20 iterations, and we'll save checkpoints every 10 iterations and at the end:

```shell
rllib train --run PPO --env CartPole-v1 --stop='{"training_iteration": 20}' --ray-address auto --checkpoint-freq 10 --checkpoint-at-end
```

The `rllib` CLI has a `--help` flag that prints details about the supported options:

```shell
rllib --help          # general help
rllib train --help    # specific help on the training options
rllib rollout --help  # specific help on the rollout options
```

_Rollout_ means running an episode with the trained model, which you specify by passing a checkpoint directory to the command. During rollout a continuous loop of taking an action and observing the new state and reward continues until some final state or number of iterations is reached.

You can execute the same training logic using the following Python code, which leverages [Ray Tune](http://tune.io), specifically the [tune.run](https://docs.ray.io/en/latest/tune/api_docs/execution.html#tune-run) method:

```python
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, 
    config={"env": "CartPole-v1"},
    stop={"training_iteration": 20},
    checkpoint_at_end=True,
    verbose=2            # 2 for INFO; change to 1 or 0 to reduce the output.
    )  
```

Try the `rllib` CLI just shown. The following cell will take between one and two minutes to run.

You could also run this command in a separate terminal window.

> **Tip:** The output will be long. When this happens for a cell, right click and select _Enable scrolling for outputs_.

In [1]:
!rllib train --run PPO --env CartPole-v1 --stop='{"training_iteration": 20}' --ray-address auto --checkpoint-freq 10 --checkpoint-at-end

  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
  _np_qint8 = np.dtype([("qint8", np.int8, 1)])
  _np_quint8 = np.dtype([("quint8", np.uint8, 1)])
  _np_qint16 = np.dtype([("qint16", np.int16, 1)])
  _np_quint16 = np.dtype([("quint16", np.uint16, 1)])
  _np_qint32 = np.dtype([("qint32", np.int32, 1)])
  np_resource = np.dtype([("resource", np.ubyte, 1)])
I0718 22:59:49.436642 11352 11352 global_state_accessor.cc:25] Redis server address = 192.168.1.105:6379, is test flag = 0
I0718 22:59:49.437047 11352 11352 redis_client.cc:141] RedisClient connected.
I0718 22:59:49.437063 11352 11352 redis_gcs_client.cc:88] RedisGcsClient Connected.
I0718 22:59:49.437363 11352 11352 service_based_gcs_client.cc:75] ServiceBasedGcsClient Connected

You can also view the training results using [TensorBoard](https://www.tensorflow.org/tensorboard). The results during training were written to a directory under `$HOME/ray_results`

If you are viewing this lesson on the Anyscale hosted platform, use the provided link to open TensorBoard.

If you are viewing this lesson on a laptop, open a terminal and run the following command, then open the URL shown in the output. (You can open a terminal using the `+` in the upper left-hand corner of Jupyter Lab.)

```shell
tensorboard --logdir=~/ray_results
```

Here is a [TensorBoard screenshot](../images/rllib/TensorBoard-CartPole-PPO.png).

The directory `$HOME/ray_results` will contain the results for all the RL training we'll do in this tutorial. You may wish to clean out old results periodically. For the run we just did, look for the results in `$HOME/ray_results/default/PPO-CartPole-V1_0_YYYY-MM-DD_HH-MM-SS*`. 

#### Rollout

> **WARNING:** The `rllib rollout` command discussed next won't work in a cloud environment, because it attempts to pop up a window. If you are taking a live class, the instructor will demonstrate what you would see. You can also watch the next video of a single _episode_.

In [2]:
from IPython.display import Video

cart_pole_sample_video='../images/rllib/Cart-Pole-Example-Video.mp4'
Video(cart_pole_sample_video)

If you are working on a laptop, you can use `rllib rollout <checkpoint> --run PPO` to run episodes from a `<checkpoint>`, which in this case will be a directory with a name like this:

```
$HOME/ray_results/default/PPO_CartPole-v1_0_YYYY-MM-DD_HH-MM-SS.../checkpoint_20/checkpoint-20/
```

The following shell command will find the correct directory for you and run the `rllib rollout <checkpoint> --run PPO` command. (It will print the actual command, with the correct checkpoint directory.) If it finds more than once checkpoint directory, for example from a previous run, it uses the latest one.

> **Note:** If you are working in a cloud environment, add the flag `--no-render` to the command. Otherwise, an error will occur because RLlib won't be able to open the window discussed above.

In [3]:
!rollout.sh --episodes 5

/bin/bash: rollout.sh: command not found


See [this RLlib page on training policies](https://docs.ray.io/en/master/rllib-training.html) for more examples.

## More on RLlib Concepts: Policies, Environments, Samples, and Trainers

### Policies and Environments

[Policies](https://docs.ray.io/en/latest/rllib-concepts.html#policies) in RLlib are Python classes that define how an agent acts in an environment.

[Rollout workers](https://docs.ray.io/en/latest/rllib-concepts.html#policy-evaluation) query the policy to determine agent actions. 

In a [gym](https://docs.ray.io/en/latest/rllib-env.html#openai-gym) environment, there is a single agent and policy. In [vector environments](https://docs.ray.io/en/latest/rllib-env.html#vectorized), policy inference is for multiple agents at once, and in [multi-agent and hierachical environments](https://docs.ray.io/en/latest/rllib-env.html#multi-agent-and-hierarchical), there may be multiple policies, each controlling one or more agents.

![Environments and Policies in RLlib](../images/rllib/multi-flat.svg)

The RLlib documentation on [environments](https://docs.ray.io/en/latest/rllib-env.html#rllib-environments) provides more details.

Policies can be implemented using any framework ([RLlib policy.py code](https://github.com/ray-project/ray/blob/master/rllib/policy/policy.py)). However, for TensorFlow and PyTorch, RLlib has [build_tf_policy](https://docs.ray.io/en/latest/rllib-concepts.html#building-policies-in-tensorflow) and [build_torch_policy](https://docs.ray.io/en/latest/rllib-concepts.html#building-policies-in-pytorch) helper functions, respectively, that let you define a trainable policy with a functional-style API. This example is taken from the [documentation](https://docs.ray.io/en/latest/rllib.html#policies):

```python
def policy_gradient_loss(policy, model, dist_class, train_batch):
    logits, _ = model.from_batch(train_batch)
    action_dist = dist_class(logits, model)
    return -tf.reduce_mean(
        action_dist.logp(train_batch["actions"]) * train_batch["rewards"])

# <class 'ray.rllib.policy.tf_policy_template.MyTFPolicy'>
MyTFPolicy = build_tf_policy(
    name="MyTFPolicy",
    loss_fn=policy_gradient_loss)
```

### Sample Batches

From single processes to large clusters, all data interchange in RLlib uses [sample batches](https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py). Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size `rollout_fragment_length` from rollout workers, and concatenates one or more of these batches into a batch of size `train_batch_size` that is the input to SGD (stochastic gradient descent).

A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network:

```python
 { 'action_logp': np.ndarray((200,), dtype=float32, min=-0.701, max=-0.685, mean=-0.694),
   'actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.495),
   'dones': np.ndarray((200,), dtype=bool, min=0.0, max=1.0, mean=0.055),
   'infos': np.ndarray((200,), dtype=object, head={}),
   'new_obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.018),
   'obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.016),
   'rewards': np.ndarray((200,), dtype=float32, min=1.0, max=1.0, mean=1.0),
   't': np.ndarray((200,), dtype=int64, min=0.0, max=34.0, mean=9.14)}
```

In [multi-agent mode](https://docs.ray.io/en/latest/rllib-concepts.html#policies-in-multi-agent), sample batches are collected separately for each individual policy.

### Trainers

At a high level, RLlib provides [trainer classes](https://docs.ray.io/en/latest/rllib-concepts.html#trainers) ([Trainer source code](https://github.com/ray-project/ray/blob/master/rllib/agents/trainer.py)) that hold a policy for environment interaction. Through the trainer interface, the policy can be trained, checkpointed, or an action computed. In multi-agent training, the trainer manages the querying and optimization of multiple policies at once.

![RLlib API](../images/rllib/RLlib-API.svg)

the trainer classes coordinate the distributed workflow of running rollouts and optimizing policies. They do this by leveraging Ray [parallel iterators](https://docs.ray.io/en/latest/iter.html) (see also this lesson: [Ray Crash Course: 05 Ray Parallel Iterators](../ray-crash-course/05-Ray-Parallel-Iterators.ipynb)) to implement the desired computation pattern. The following figure shows *synchronous sampling*, the simplest of [these patterns](https://docs.ray.io/en/latest/rllib-algorithms.html):

![Synchronous Sampling](../images/rllib/a2c-arch.svg)

    Synchronous Sampling (e.g., A2C, PG, PPO)

RLlib uses [Ray actors](https://docs.ray.io/en/latest/actors.html) to scale training from a single core to many thousands of cores in a cluster. You can [configure the parallelism](https://docs.ray.io/en/latest/rllib-training.html#specifying-resources) used for training by changing the `num_workers` parameter. Check out the [scaling guide](https://docs.ray.io/en/latest/rllib-training.html#scaling-guide) for more details.

### Policies 

Each policy implementation defines a `learn_on_batch()` method that improves the policy given a sample batch of input. For TensorFlow and PyTorch policies, this is implemented using a _loss function_ that takes as input sample batch tensors and outputs a scalar loss value. Here are a few example loss functions:

* Simple [policy gradient loss](https://github.com/ray-project/ray/blob/master/rllib/agents/pg/pg_tf_policy.py).
* Simple [Q-function loss](https://github.com/ray-project/ray/blob/a1d2e1762325cd34e14dc411666d63bb15d6eaf0/rllib/agents/dqn/simple_q_policy.py#L136)
* Importance-weighted _APPO surrogate loss_ for [TensorFlow](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo_tf_policy.py), [PyTorch](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo_torch_policy.py)

## Offline Data

Beyond environments defined in Python,  batch training on [offline datasets](https://docs.ray.io/en/latest/rllib-offline.html) is supported. This is an important use case for RL when it's not possible to run traditional training and rollout in a physical environment (like a chemical plant or assembly line) and a suitable simulator doesn't exist. In this approach, data for past activity is used to train a policy.

This is sometimes called [imitation learning](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#advantage-re-weighted-imitation-learning-marwil).

## Application Support and Customization

[RLlib supports]((https://docs.ray.io/en/latest/rllib.html#application-support)) a variety of integration strategies for [external applications](https://docs.ray.io/en/latest/rllib-env.html#external-agents-and-applications).

RLlib provides ways to customize almost all aspects of training, including the [environment](https://docs.ray.io/en/latest/rllib-env.html#configuring-environments), [neural network model](https://docs.ray.io/en/latest/rllib-models.html#tensorflow-models), [action distributions](https://docs.ray.io/en/latest/rllib-models.html#custom-action-distributions), and [policy definitions](https://docs.ray.io/en/latest/rllib-concepts.html#policies>).

![RLlib components](../images/rllib/RLlib-components.svg)

## Algorithms Implemented in RLlib

Here is the current list of supported algorithms in RLlib. The links go to the corresponding RLlib documentation, which includes links to the original papers and other references.

In this tutorial, we will mostly use [Proximal Policy Optimization (PPO)](https://docs.ray.io/en/latest/rllib-algorithms.html#proximal-policy-optimization-ppo), [Deep Q Networks (DQN, Rainbow, Parametric DQN)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#deep-q-networks-dqn-rainbow-parametric-dqn), and the contextual bandit algorithms, [Linear Upper Confidence Bound (LinUCB)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-upper-confidence-bound-contrib-linucb) and [Linear Thompson Sampling (LinTS)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-thompson-sampling-contrib-lints).

See also the documentation's [Feature Compatibility Matrix](https://docs.ray.io/en/latest/rllib-algorithms.html#feature-compatibility-matrix), which lists the algorithms and useful properties for them. It appears at the beginning of the descriptions of all the algorithms, with links to the research papers that introduced them and discussions of their strengths and weaknesses.

### High-throughput Architectures

* [Distributed Prioritized Experience Replay (Ape-X)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#distributed-prioritized-experience-replay-ape-x)
* [Importance Weighted Actor-Learner Architecture (IMPALA)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#importance-weighted-actor-learner-architecture-impala)
* [Asynchronous Proximal Policy Optimization (APPO)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#asynchronous-proximal-policy-optimization-appo)
* [Decentralized Distributed Proximal Policy Optimization (DD-PPO)](https://docs.ray.io/en/latest/rllib-algorithms.html#decentralized-distributed-proximal-policy-optimization-dd-ppo)

### Gradient-based

* [Advantage Actor-Critic (A2C, A3C)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#advantage-actor-critic-a2c-a3c)
* [Deep Deterministic Policy Gradients (DDPG, TD3)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#deep-deterministic-policy-gradients-ddpg-td3)
* [Deep Q Networks (DQN, Rainbow, Parametric DQN)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#deep-q-networks-dqn-rainbow-parametric-dqn)
* [Policy Gradients](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#policy-gradients)
* [Proximal Policy Optimization (PPO)](https://docs.ray.io/en/latest/rllib-algorithms.html#proximal-policy-optimization-ppo)
* [Soft Actor-Critic (SAC)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#soft-actor-critic-sac)

### Gradient-free

* [Augmented Random Search (ARS)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#augmented-random-search-ars)
* [Evolution Strategies](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#evolution-strategies)

### Multi-agent Specific

* [QMIX Monotonic Value Factorisation (QMIX, VDN, IQN)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#qmix-monotonic-value-factorisation-qmix-vdn-iqn)
* [Multi-Agent Deep Deterministic Policy Gradient (contrib/MADDPG)](https://docs.ray.io/en/latest/rllib-algorithms.html#multi-agent-deep-deterministic-policy-gradient-contrib-maddpg)

### Offline

* [Advantage Re-Weighted Imitation Learning (MARWIL)](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#advantage-re-weighted-imitation-learning-marwil)

### Contextual Bandits (contrib/bandits)

* [Linear Upper Confidence Bound (contrib/LinUCB)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-upper-confidence-bound-contrib-linucb)
* [Linear Thompson Sampling (contrib/LinTS)](https://docs.ray.io/en/latest/rllib-algorithms.html#linear-thompson-sampling-contrib-lints)

### Other

* [Single-Player Alpha Zero (contrib/AlphaZero)](https://docs.ray.io/en/latest/rllib-algorithms.html#single-player-alpha-zero-contrib-alphazero)

See the [Overview](00-Ray-RLlib-Overview.ipynb) for recommendations on which lessons to study next.