# Rllib
the purpose of this jupyter notebook is to have preliminary understanding of how Rllib is structured and working
it is based on [their site](https://ray.readthedocs.io/en/latest/rllib.html)

![image of rllib stack](rllib-stack.svg)

After installing ray\[rllib\] we can run it in either of 2 ways:

In [None]:
# through shell command line
! rllib train --run=PPO --env=CartPole-v0  # -v [-vv] for verbose,
                                         # --eager [--trace] for eager execution,
                                         # --torch to use PyTorch

In [None]:
# with python API (using tune)
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer
tune.run(PPOTrainer, config={"env": "CartPole-v0"})  # "log_level": "INFO" for verbose,
                                                     # "eager": True for eager execution,
                                                     # "torch": True for PyTorch

## Key concepts in Rllib
There are 3 key concepts: Policies, Samples and Trainers

### Policies
[Policies](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policies) are Python classes that define how an agent acts in an environment. [Rollout workers](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policy-evaluation) query the policy to determine agent actions. In a gym environment, there is a single agent and policy. In [vector envs](https://ray.readthedocs.io/en/latest/rllib-env.html#vectorized), policy inference is for multiple agents at once, and in [multi-agent](https://ray.readthedocs.io/en/latest/rllib-env.html#multi-agent-and-hierarchical), there may be multiple policies, each controlling one or more agents.  
Policies can be implemented using any framework. However, for TensorFlow and PyTorch, RLlib has [build_tf_policy](https://ray.readthedocs.io/en/latest/rllib-concepts.html#building-policies-in-tensorflow) and [build_torch_policy](https://ray.readthedocs.io/en/latest/rllib-concepts.html#building-policies-in-pytorch) helper functions that let you define a trainable policy with a functional-style API, for example:

In [None]:
def policy_gradient_loss(policy, model, dist_class, train_batch):
    logits, _ = model.from_batch(train_batch)
    action_dist = dist_class(logits, model)
    return -tf.reduce_mean(
        action_dist.logp(train_batch["actions"]) * train_batch["rewards"])

# <class 'ray.rllib.policy.tf_policy_template.MyTFPolicy'>
MyTFPolicy = build_tf_policy(
    name="MyTFPolicy",
    loss_fn=policy_gradient_loss)

### Sample Batches
Whether running in a single process or [large cluster](https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-resources), all data interchange in RLlib is in the form of [sample batches](https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py). Sample batches encode one or more fragments of a trajectory. Typically, RLlib collects batches of size `sample_batch_size` from rollout workers, and concatenates one or more of these batches into a batch of size `train_batch_size` that is the input to SGD.  
A typical sample batch looks something like the following when summarized. Since all values are kept in arrays, this allows for efficient encoding and transmission across the network:

```
{ 'action_logp': np.ndarray((200,), dtype=float32, min=-0.701, max=-0.685, mean=-0.694),
  'actions': np.ndarray((200,), dtype=int64, min=0.0, max=1.0, mean=0.495),
  'dones': np.ndarray((200,), dtype=bool, min=0.0, max=1.0, mean=0.055),
  'infos': np.ndarray((200,), dtype=object, head={}),
  'new_obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.018),
  'obs': np.ndarray((200, 4), dtype=float32, min=-2.46, max=2.259, mean=0.016),
  'rewards': np.ndarray((200,), dtype=float32, min=1.0, max=1.0, mean=1.0),
  't': np.ndarray((200,), dtype=int64, min=0.0, max=34.0, mean=9.14)}```

### Training
Policies each define a `learn_on_batch()` method that improves the policy given a sample batch of input. For TF and Torch policies, this is implemented using a loss function that takes as input sample batch tensors and outputs a scalar loss. Here are a few example loss functions:  
- Simple [policy gradient loss](https://github.com/ray-project/ray/blob/master/rllib/agents/pg/pg_tf_policy.py)
- Simple [Q-function loss](https://github.com/ray-project/ray/blob/a1d2e1762325cd34e14dc411666d63bb15d6eaf0/rllib/agents/dqn/simple_q_policy.py#L136)
- Importance-weighted [APPO surrogate loss](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/appo_policy.py)  

RLlib [Trainer classes](https://ray.readthedocs.io/en/latest/rllib-concepts.html#trainers) coordinate the distributed workflow of running rollouts and optimizing policies. They do this by leveraging [policy optimizers](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policy-optimization) that implement the desired computation pattern. The following figure shows synchronous sampling, the simplest of [these patterns](https://ray.readthedocs.io/en/latest/rllib-algorithms.html):  

![a2c-arch](a2c-arch.svg)  


RLlib uses [Ray actors](https://ray.readthedocs.io/en/latest/actors.html) to scale training from a single core to many thousands of cores in a cluster. You can [configure the parallelism](https://ray.readthedocs.io/en/latest/rllib-training.html#specifying-resources) used for training by changing the `num_workers` parameter.

### Customization
RLlib provides ways to customize almost all aspects of training, including the [environment](https://ray.readthedocs.io/en/latest/rllib-env.html#configuring-environments), [neural network model](https://ray.readthedocs.io/en/latest/rllib-models.html#tensorflow-models), [action distribution](https://ray.readthedocs.io/en/latest/rllib-models.html#custom-action-distributions), and [policy definitions](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policies):
![rllib_components](rllib-components.svg)

# Conecpts and Custom Algorithms
The following is selected items taken from the [documentation of rllib](https://ray.readthedocs.io/en/latest/rllib-toc.html)

## [Policies](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policies)
This section describes the internal concepts used to implement algorithms in RLlib. You might find this **useful if modifying or adding new algorithms to RLlib.**  
Policy classes encapsulate the core numerical components of RL algorithms. typically includes:
 - Policy model that determines actions to take
 - A trajectory postprocessor for experiences
 - loss function to improve the policy given postprocessed experiences  

for simple example, see the PG [policy definition](https://github.com/ray-project/ray/blob/master/rllib/agents/pg/pg_tf_policy.py)  

Most of the interaction with deep learning is isolated to the [Policy interface](https://github.com/ray-project/ray/blob/master/rllib/policy/policy.py) allowing RLlib to support multiple frameworks. there are [Tensorflow](https://ray.readthedocs.io/en/latest/rllib-concepts.html#building-policies-in-tensorflow) and [PyTorch](https://ray.readthedocs.io/en/latest/rllib-concepts.html#building-policies-in-pytorch) specific templates.  
You can write your own from scratch as follows:

In [None]:
class CustomPolicy(Policy):
    """Example of a custom policy written from scratch.

    You might find it more convenient to use the `build_tf_policy` and
    `build_torch_policy` helpers instead for a real policy, which are
    described in the next sections.
    """

    def __init__(self, observation_space, action_space, config):
        Policy.__init__(self, observation_space, action_space, config)
        # example parameter
        self.w = 1.0

    def compute_actions(self,
                        obs_batch,
                        state_batches,
                        prev_action_batch=None,
                        prev_reward_batch=None,
                        info_batch=None,
                        episodes=None,
                        **kwargs):
        # return action batch, RNN states, extra values to include in batch
        return [self.action_space.sample() for _ in obs_batch], [], {}

    def learn_on_batch(self, samples):
        # implement your learning code here
        return {}  # return stats

    def get_weights(self):
        return {"w": self.w}

    def set_weights(self, weights):
        self.w = weights["w"]

For using policy abstraction in multi agent, see the [rock-paper-scisors example](https://ray.readthedocs.io/en/latest/rllib-env.html#rock-paper-scissors-example)

### Building policies in Tensorflow
describes how to build a tensorflow RLlib policy using `tf_policy_template.build_tf_policy()`  
to start, we first have to define a loss function

#### Define the loss function
In RLlib, loss functions are defined over batches of trajectory data produced by policy evaluation. A basic policy gradient loss that only tries to maximize the 1-step reward can be defined as follows:

In [None]:
import tensorflow as tf
from ray.rllib.policy.sample_batch import SampleBatch

def policy_gradient_loss(policy, model, dist_class, train_batch):
    actions = train_batch[SampleBatch.ACTIONS]
    rewards = train_batch[SampleBatch.REWARDS]
    logits, _ = model.from_batch(train_batch)
    action_dist = dist_class(logits, model)
    return -tf.reduce_mean(action_dist.logp(actions) * rewards)

where `actions` is a Tensor placeholder of shape \[batch_size, action_dim...\], and `rewards` is a placeholder of shape \[batch_size\].  
<font color='red'> Question: why does the function gets policy as input ? I dont see it used anywhere </font>  
The `action_dist` object is an [ActionDistribution](https://ray.readthedocs.io/en/latest/rllib-package-ref.html#ray.rllib.models.ActionDistribution) that is parameterized by the output of the neural network policy model. Passing this loss function to `build_tf_policy` is enough to produce a very basic TF policy:

#### Build the policy

In [None]:
from ray.rllib.policy.tf_policy_template import build_tf_policy

# <class 'ray.rllib.policy.tf_policy_template.MyTFPolicy'>
MyTFPolicy = build_tf_policy(
    name="MyTFPolicy",
    loss_fn=policy_gradient_loss)

#### Build a trainer
as an exercise (runnable file [here](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_tf_policy.py)) we can create a [Trainer](https://ray.readthedocs.io/en/latest/rllib-concepts.html#trainers)  and try running this policy on a toy env with two parallel rollout workers:

In [None]:
import ray
from ray import tune
from ray.rllib.agents.trainer_template import build_trainer

# <class 'ray.rllib.agents.trainer_template.MyCustomTrainer'>
MyTrainer = build_trainer(
    name="MyCustomTrainer",
    default_policy=MyTFPolicy)

ray.init()
tune.run(MyTrainer, config={"env": "CartPole-v0", "num_workers": 2})

#### extending with postprocessing
if we want to compute the advantage (sum of rewards over time) we need to define a trajectory postprocessor for the policy. this can be done by defining `postprocess_fn`:

In [None]:
from ray.rllib.evaluation.postprocessing import compute_advantages, \
    Postprocessing

def postprocess_advantages(policy,
                           sample_batch,
                           other_agent_batches=None,
                           episode=None):
    return compute_advantages(
        sample_batch, 0.0, policy.config["gamma"], use_gae=False, use_critic=False)

def policy_gradient_loss(policy, model, dist_class, train_batch):
    logits, _ = model.from_batch(train_batch)
    action_dist = dist_class(logits, model)
    return -tf.reduce_mean(
        action_dist.logp(train_batch[SampleBatch.ACTIONS]) *
        train_batch[Postprocessing.ADVANTAGES])

MyTFPolicy = build_tf_policy(
    name="MyTFPolicy",
    loss_fn=policy_gradient_loss,
    postprocess_fn=postprocess_advantages)

how RLlib makes the advantages placeholder automatically available as `train_batch[Postprocessing.ADVANTAGES]` ?
When building your policy, RLlib will create a “dummy” trajectory batch where all observations, actions, rewards, etc. are zeros. It then calls your postprocess_fn, and generates TF placeholders based on the numpy shapes of the postprocessed batch. RLlib tracks which placeholders that `loss_fn` and `stats_fn` access, and then feeds the corresponding sample data into those placeholders during loss optimization. You can also access these placeholders via `policy.get_placeholder(<name>)` after loss initialization

#### Building policies in Eager
Policies built with `build_tf_policy` (most of the reference algorithms are) can be run in eager mode by setting the `"eager": True` / `"eager_tracing": True` config options or using `rllib train --eager [--trace]`. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.

Eager mode makes debugging much easier, since you can now use line-by-line debugging with breakpoints or Python `print()` to inspect intermediate tensor values. However, eager can be slower than graph mode unless tracing is enabled.

You can also selectively leverage eager operations within graph mode execution with tf.py_function. Here’s [an example](https://github.com/ray-project/ray/blob/master/rllib/examples/eager_execution.py) of using eager ops embedded within a loss function.

### Example : PPO implementation
in this example we'll see how the above flow is used to buid the PPO trainer and how we can modify it.  
We'll go through the [PPO trainer definition](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo.py)


at the bottom of the file, we are using `build_trainer` to build the PPOTrainer:

```
PPOTrainer = build_trainer(
    name="PPOTrainer",    
    default_policy=PPOTFPolicy,
    
    default_config=DEFAULT_CONFIG,
    make_policy_optimizer=choose_policy_optimizer,
    validate_config=validate_config,
    after_optimizer_step=update_kl,
    before_train_step=warn_about_obs_filter,
    after_train_result=warn_about_bad_reward_scales)
```

Lets dive into some of the parameters used above

#### choose_policy_optimizer
this function is fed through `make_policy_optimizer` and chooses which [Policy Optimizer](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policy-optimization) to use for distributed training. You can think of these policy optimizers as coordinating the distributed workflow needed to improve the policy. Depending on the trainer config, PPO can switch between a simple synchronous optimizer, or a multi-GPU optimizer that implements minibatch SGD (the default):

```
def choose_policy_optimizer(workers, config):
    if config["simple_optimizer"]:
        return SyncSamplesOptimizer(
            workers,
            num_sgd_iter=config["num_sgd_iter"],
            train_batch_size=config["train_batch_size"])

    return LocalMultiGPUOptimizer(
        workers,
        sgd_batch_size=config["sgd_minibatch_size"],
        num_sgd_iter=config["num_sgd_iter"],
        num_gpus=config["num_gpus"],
        sample_batch_size=config["sample_batch_size"],
        num_envs_per_worker=config["num_envs_per_worker"],
        train_batch_size=config["train_batch_size"],
        standardize_fields=["advantages"],
        straggler_mitigation=config["straggler_mitigation"])
```

Suppose we want to customize PPO to use an asynchronous-gradient optimization strategy similar to A3C. To do that, we could define a new function that returns `AsyncGradientsOptimizer` and override the `make_policy_optimizer` component of PPOTrainer:

in the below code we'll see how to use `with_updates` to override specific fields of a predefined trainer.
The `with_updates` method is also available for Torch and TF policies built from templates

In [None]:
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.optimizers import AsyncGradientsOptimizer

def make_async_optimizer(workers, config):
    return AsyncGradientsOptimizer(workers, grads_per_step=100)

CustomTrainer = PPOTrainer.with_updates(
    make_policy_optimizer=make_async_optimizer)

#### update_kl
This is used to adaptively adjust the KL penalty coefficient on the PPO loss, which bounds the policy change per training step. You’ll notice the code handles both single and multi-agent cases (where there are be multiple policies each with different KL coeffs):
```
def update_kl(trainer, fetches):
    if "kl" in fetches:
        # single-agent
        trainer.workers.local_worker().for_policy(
            lambda pi: pi.update_kl(fetches["kl"]))
    else:

        def update(pi, pi_id):
            if pi_id in fetches:
                pi.update_kl(fetches[pi_id]["kl"])
            else:
                logger.debug("No data for {}, not updating kl".format(pi_id))

        # multi-agent
        trainer.workers.local_worker().foreach_trainable_policy(update)
```

The `update_kl` method on the policy is defined in [PPOTFPolicy](https://github.com/ray-project/ray/blob/master/rllib/agents/ppo/ppo_policy.py) via the `KLCoeffMixin`, along with several other advanced features. Let’s look at each new feature used by the policy:
```
PPOTFPolicy = build_tf_policy(
    name="PPOTFPolicy",
    get_default_config=lambda: ray.rllib.agents.ppo.ppo.DEFAULT_CONFIG,
    loss_fn=ppo_surrogate_loss,
    stats_fn=kl_and_loss_stats,
    extra_action_fetches_fn=vf_preds_and_logits_fetches,
    postprocess_fn=postprocess_ppo_gae,
    gradients_fn=clip_gradients,
    before_loss_init=setup_mixins,
    mixins=[LearningRateSchedule, KLCoeffMixin, ValueNetworkMixin])
```

**`stats_fn`**:  
The stats function returns a dictionary of Tensors that will be reported with the training results. This also includes the `kl` metric which is used by the trainer to adjust the KL penalty. Note that many of the values below reference `policy.loss_obj`, which is assigned by `loss_fn` (not shown here since the PPO loss is quite complex). RLlib will always call `stats_fn` after `loss_fn`, so you can rely on using values saved by `loss_fn` as part of your statistics:

```
def kl_and_loss_stats(policy, train_batch):
    policy.explained_variance = explained_variance(
        train_batch[Postprocessing.VALUE_TARGETS], policy.model.value_function())

    stats_fetches = {
        "cur_kl_coeff": policy.kl_coeff,
        "cur_lr": tf.cast(policy.cur_lr, tf.float64),
        "total_loss": policy.loss_obj.loss,
        "policy_loss": policy.loss_obj.mean_policy_loss,
        "vf_loss": policy.loss_obj.mean_vf_loss,
        "vf_explained_var": policy.explained_variance,
        "kl": policy.loss_obj.mean_kl,
        "entropy": policy.loss_obj.mean_entropy,
    }

    return stats_fetches
```

**`extra_action_fetches_fn`**  
This function defines extra outputs that will be recorded when generating actions with the policy.  
For example, this enables saving the raw policy logits in the experience batch, which e.g. means it can be referenced in the PPO loss function via `batch[BEHAVIOUR_LOGITS]`. Other values such as the current value prediction can also be emitted for debugging or optimization purposes:
```
def vf_preds_and_logits_fetches(policy):
    return {
        SampleBatch.VF_PREDS: policy.model.value_function(),
        BEHAVIOUR_LOGITS: policy.model.last_output(),
    }
```

**`gradients_fn`**  
 If defined, this function returns TF gradients for the loss function. You’d typically only want to override this to apply transformations such as gradient clipping:  
```
def clip_gradients(policy, optimizer, loss):
    if policy.config["grad_clip"] is not None:
        grads = tf.gradients(loss, policy.model.trainable_variables())
        policy.grads, _ = tf.clip_by_global_norm(grads,
                                                 policy.config["grad_clip"])
        clipped_grads = list(zip(policy.grads, policy.model.trainable_variables()))
        return clipped_grads
    else:
        return optimizer.compute_gradients(
            loss, colocate_gradients_with_ops=True)
```

**`mixings`**  
To add arbitrary stateful components, you can add mixin classes to the policy. Methods defined by these mixins will have higher priority than the base policy class, so you can use these to override methods (as in the case of `LearningRateSchedule`), or define extra methods and attributes (e.g., `KLCoeffMixin`, `ValueNetworkMixin`). Like any other Python superclass, these should be initialized at some point, which is what the `setup_mixins` function does:  
```
def setup_mixins(policy, obs_space, action_space, config):
    ValueNetworkMixin.__init__(policy, obs_space, action_space, config)
    KLCoeffMixin.__init__(policy, config)
    LearningRateSchedule.__init__(policy, config["lr"], config["lr_schedule"])
```
In PPO we run `setup_mixins` before the loss function is called (i.e., `before_loss_init`), but other callbacks you can use include `before_init` and `after_init`.

### Example : DQN implementation
Let’s look at how to implement a different family of policies, by looking at the [SimpleQ policy definition](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/simple_q_policy.py):  
(Note that this is a simplified version of [DQNTFPolicy](https://github.com/ray-project/ray/blob/master/rllib/agents/dqn/dqn_policy.py))

```
SimpleQPolicy = build_tf_policy(
    name="SimpleQPolicy",
    get_default_config=lambda: ray.rllib.agents.dqn.dqn.DEFAULT_CONFIG,
    make_model=build_q_models,
    action_sampler_fn=build_action_sampler,
    loss_fn=build_q_losses,
    extra_action_feed_fn=exploration_setting_inputs,
    extra_action_fetches_fn=lambda policy: {"q_values": policy.q_values},
    extra_learn_fetches_fn=lambda policy: {"td_error": policy.td_error},
    before_init=setup_early_mixins,
    after_init=setup_late_mixins,
    obs_include_prev_action_reward=False,
    mixins=[
        ExplorationStateMixin,
        TargetNetworkMixin,
    ])
```

The biggest difference from the policy gradient policies you saw previously is that SimpleQPolicy defines its own `make_model` and `action_sampler_fn`. This means that the policy builder will not internally create a model and action distribution, rather it will call `build_q_models` and `build_action_sampler` to get the output action tensors.

**`build_q_models`**  
The model creation function actually creates two different models for DQN: the base Q network, and also a target network. It requires each model to be of type `SimpleQModel`, which implements a `get_q_values()` method. The model catalog will raise an error if you try to use a custom ModelV2 model that isn’t a subclass of `SimpleQModel`. Similarly, the full DQN policy requires models to subclass `DistributionalQModel`, which implements `get_q_value_distributions()` and `get_state_value()`:  
```
def build_q_models(policy, obs_space, action_space, config):
    ...

    policy.q_model = ModelCatalog.get_model_v2(
        obs_space,
        action_space,
        num_outputs,
        config["model"],
        framework="tf",
        name=Q_SCOPE,
        model_interface=SimpleQModel,
        q_hiddens=config["hiddens"])

    policy.target_q_model = ModelCatalog.get_model_v2(
        obs_space,
        action_space,
        num_outputs,
        config["model"],
        framework="tf",
        name=Q_TARGET_SCOPE,
        model_interface=SimpleQModel,
        q_hiddens=config["hiddens"])

    return policy.q_model

```

**`action_sampler`**
The action sampler is straightforward, it just takes the q_model, runs a forward pass, and returns the argmax over the actions:
```
def build_action_sampler(policy, q_model, input_dict, obs_space, action_space,
                         config):
    # do max over Q values...
    ...
    return action, action_logp

```

The remainder of DQN is similar to other algorithms. Target updates are handled by a `after_optimizer_step` callback that periodically copies the weights of the Q network to the target.

Finally, note that you do not have to use `build_tf_policy` to define a TensorFlow policy. You can alternatively subclass `Policy`, `TFPolicy`, or `DynamicTFPolicy` as convenient.

### Extending existing policies
You can use the with_updates method on Trainers and Policy objects built with `make_*` to create a copy of the object with some changes, for example:  

```
from ray.rllib.agents.ppo import PPOTrainer
from ray.rllib.agents.ppo.ppo_tf_policy import PPOTFPolicy

CustomPolicy = PPOTFPolicy.with_updates(
    name="MyCustomPPOTFPolicy",
    loss_fn=some_custom_loss_fn)

CustomTrainer = PPOTrainer.with_updates(
    default_policy=CustomPolicy)
```

## Policy Evaluation
Given an environment and policy, policy evaluation produces batches of experiences. This is your classic “environment interaction loop”. Efficient policy evaluation can be burdensome to get right, especially when leveraging vectorization, RNNs, or when operating in a multi-agent environment. RLlib provides a RolloutWorker class that manages all of this, and this class is used in most RLlib algorithms.

You can use rollout workers standalone to produce batches of experiences. This can be done by calling `worker.sample()` on a worker instance, or `worker.sample.remote()` in parallel on worker instances created as Ray actors (see WorkerSet).

Here is an example of creating a set of rollout workers and using them gather experiences in parallel. The trajectories are concatenated, the policy learns on the trajectory batch, and then we broadcast the policy weights to the workers for the next round of rollouts:
```
# Setup policy and rollout workers
env = gym.make("CartPole-v0")
policy = CustomPolicy(env.observation_space, env.action_space, {})
workers = WorkerSet(
    policy=CustomPolicy,
    env_creator=lambda c: gym.make("CartPole-v0"),
    num_workers=10)

while True:
    # Gather a batch of samples
    T1 = SampleBatch.concat_samples(
        ray.get([w.sample.remote() for w in workers.remote_workers()]))

    # Improve the policy using the T1 batch
    policy.learn_on_batch(T1)

    # Broadcast weights to the policy evaluation workers
    weights = ray.put({"default_policy": policy.get_weights()})
    for w in workers.remote_workers():
        w.set_weights.remote(weights)

```

## Policy Optimization  
Similar to how a [gradient-descent optimizer](https://www.tensorflow.org/api_docs/python/tf/train/GradientDescentOptimizer) can be used to improve a model, RLlib’s [policy optimizers](https://github.com/ray-project/ray/tree/master/rllib/optimizers) implement different strategies for improving a policy.

For example, in A3C you’d want to compute gradients asynchronously on different workers, and apply them to a central policy replica. This strategy is implemented by the [AsyncGradientsOptimizer](https://github.com/ray-project/ray/blob/master/rllib/optimizers/async_gradients_optimizer.py). Another alternative is to gather experiences synchronously in parallel and optimize the model centrally, as in [SyncSamplesOptimizer](https://github.com/ray-project/ray/blob/master/rllib/optimizers/sync_samples_optimizer.py). Policy optimizers abstract these strategies away into reusable modules.

This is how the example in the previous section looks when written using a policy optimizer:

```
# Same setup as before
workers = WorkerSet(
    policy=CustomPolicy,
    env_creator=lambda c: gym.make("CartPole-v0"),
    num_workers=10)

# this optimizer implements the IMPALA architecture
optimizer = AsyncSamplesOptimizer(workers, train_batch_size=500)

while True:
    optimizer.step()
```

## Trainers & API
Trainers are the boilerplate classes that put the above components together, making algorithms accessible via Python API and the command line. They manage algorithm configuration, setup of the rollout workers and optimizer, and collection of training metrics. Trainers also implement the Trainable API for easy experiment management.

![rrlib_api](rllib-api.svg)

Example of three equivalent ways of interacting with the PPO trainer, all of which log results in `~/ray_results`:

**Method 1**

```
trainer = PPOTrainer(env="CartPole-v0", config={"train_batch_size": 4000})
while True:
    print(trainer.train())
```

**Method 2**
```
rllib train --run=PPO --env=CartPole-v0 --config='{"train_batch_size": 4000}'
```
or 
```
rllib train --run DQN --env CartPole-v0  # --eager [--trace] for eager execution
```
or if we have a tuned example we can provide the yaml file:
```
rllib train -f /path/to/tuned/example.yaml
```
running the train command is equivalent to running the `train.py` script. 
The most important options for the scripts are:  
`--env` - for choosing the environment (any OpenAI gym environment including ones registered by the user can be used)   
`--run` - for choosing the algorithm (available options are SAC, PPO, PG, A2C, A3C, IMPALA, ES, DDPG, DQN, MARWIL, APEX, and APEX_DDPG).


**Method 3**  
All RLlib trainers are compatible with the [Tune API](https://ray.readthedocs.io/en/latest/tune-usage.html). This enables them to be easily used in experiments with [Tune](https://ray.readthedocs.io/en/latest/tune.html).
```
from ray import tune
tune.run(PPOTrainer, config={"env": "CartPole-v0", "train_batch_size": 4000})
```

another example (taken from [here](https://ray.readthedocs.io/en/latest/rllib-training.html#basic-python-api)):  
```
import ray
from ray import tune

ray.init()
tune.run(
    "PPO",
    stop={"episode_reward_mean": 200},
    config={
        "env": "CartPole-v0",
        "num_gpus": 0,
        "num_workers": 1,
        "lr": tune.grid_search([0.01, 0.001, 0.0001]),
        "eager": False,
    },
)
```
All RLlib trainers are compatible with the Tune API. This enables them to be easily used in experiments with Tune.

### Configuration

#### Specifying parameters
Each algorithm has specific hyperparameters that can be set with `--config`, in addition to a number of [common hyperparameters](https://github.com/ray-project/ray/blob/master/rllib/agents/trainer.py). See the [algorithms documentation](https://ray.readthedocs.io/en/latest/rllib-algorithms.html) for more information.  

you can also find the [common parameters in the documentation](https://ray.readthedocs.io/en/latest/rllib-training.html#common-parameters)

#### Specifying Resources
You can control the degree of parallelism used by setting the `num_workers` hyperparameter for most algorithms.  
The number of GPUs the driver should use can be set via the `num_gpus` option.  
Similarly, the resource allocation to workers can be controlled via `num_cpus_per_worker`, `num_gpus_per_worker`, and `custom_resources_per_worker`.  
The number of GPUs can be a fractional quantity to allocate only a fraction of a GPU. For example, with DQN you can pack five trainers onto one GPU by setting `num_gpus: 0.2`.

![rllib-config](rllib-config.svg)

### Evaluating trained policies
In order to save checkpoints from which to evaluate policies, set `--checkpoint-freq` (number of training iterations between checkpoints) when running `rllib train`.

An example of evaluating a previously trained DQN policy is as follows:
```
rllib rollout \
    ~/ray_results/default/DQN_CartPole-v0_0upjmdgr0/checkpoint_1/checkpoint-1 \
    --run DQN --env CartPole-v0 --steps 10000
```

For more advanced evaluation functionality, refer to [Customized Evaluation During Training](https://ray.readthedocs.io/en/latest/rllib-training.html#customized-evaluation-during-training)

## Basic Python API
The Python API provides the needed flexibility for applying RLlib to new problems. You will need to use this API if you wish to use custom environments, preprocessors, or models with RLlib.

### Training 
Here is an example of the basic usage (for a more complete example, see custom_env.py):

In [None]:
import ray
import ray.rllib.agents.ppo as ppo
from ray.tune.logger import pretty_print

ray.init()
config = ppo.DEFAULT_CONFIG.copy()
config["num_gpus"] = 0
config["num_workers"] = 1
config["eager"] = False
trainer = ppo.PPOTrainer(config=config, env="CartPole-v0")

# Can optionally call trainer.restore(path) to load a checkpoint.

for i in range(1000):
   # Perform one iteration of training the policy with PPO
   result = trainer.train()
   print(pretty_print(result))

   if i % 100 == 0:
       checkpoint = trainer.save()
       print("checkpoint saved at", checkpoint)

**Note**
It’s recommended that you run RLlib trainers with [Tune](https://ray.readthedocs.io/en/latest/tune.html), for easy experiment management and visualization of results. Just set `"run": ALG_NAME, "env": ENV_NAME` in the experiment config: 

In [None]:
import ray
from ray import tune

ray.init()
tune.run(
    "PPO",
    stop={"episode_reward_mean": 200},
    config={
        "env": "CartPole-v0",
        "num_gpus": 0,
        "num_workers": 1,
        "lr": tune.grid_search([0.01, 0.001, 0.0001]),
        "eager": False,
    },
)

### Computing Actions
The simplest way to programmatically compute actions from a trained agent is to use `trainer.compute_action()`. This method preprocesses and filters the observation before passing it to the agent policy. For more advanced usage, you can access the workers and policies held by the trainer directly as `compute_action()` does:



In [None]:
class Trainer(Trainable):

    @PublicAPI
    def compute_action(self,
                     observation,
                     state=None,
                     prev_action=None,
                     prev_reward=None,
                     info=None,
                     policy_id=DEFAULT_POLICY_ID,
                     full_fetch=False):
        """Computes an action for the specified policy.

      Note that you can also access the policy object through
      self.get_policy(policy_id) and call compute_actions() on it directly.

      Arguments:
          observation (obj): observation from the environment.
          state (list): RNN hidden state, if any. If state is not None,
                        then all of compute_single_action(...) is returned
                        (computed action, rnn state, logits dictionary).
                        Otherwise compute_single_action(...)[0] is
                        returned (computed action).
          prev_action (obj): previous action value, if any
          prev_reward (int): previous reward, if any
          info (dict): info object, if any
          policy_id (str): policy to query (only applies to multi-agent).
          full_fetch (bool): whether to return extra action fetch results.
              This is always set to true if RNN state is specified.

      Returns:
          Just the computed action if full_fetch=False, or the full output
          of policy.compute_actions() otherwise.
      """

        if state is None:
            state = []
        preprocessed = self.workers.local_worker().preprocessors[policy_id].transform(observation)
        filtered_obs = self.workers.local_worker().filters[policy_id](preprocessed, update=False)
        if state:
            return self.get_policy(policy_id).compute_single_action(
              filtered_obs,
              state,
              prev_action,
              prev_reward,
              info,
              clip_actions=self.config["clip_actions"])
        res = self.get_policy(policy_id).compute_single_action(
          filtered_obs,
          state,
          prev_action,
          prev_reward,
          info,
          clip_actions=self.config["clip_actions"])
        if full_fetch:
            return res
        else:
            return res[0]  # backwards compatibility

### Accessing policy states
It is common to need to access a trainer’s internal state, e.g., to set or get internal weights. In RLlib trainer state is replicated across multiple rollout workers (Ray actors) in the cluster. However, you can easily get and update this state between calls to `train()` via `trainer.workers.foreach_worker()` or `trainer.workers.foreach_worker_with_index()`. These functions take a lambda function that is applied with the worker as an arg. You can also return values from these functions and those will be returned as a list.

You can also access just the “master” copy of the trainer state through `trainer.get_policy()` or `trainer.workers.local_worker()`, but note that updates here may not be immediately reflected in remote replicas if you have configured `num_workers > 0`. For example, to access the weights of a local TF policy, you can run `trainer.get_policy().get_weights()`. This is also equivalent to `trainer.workers.local_worker().policy_map["default_policy"].get_weights()`:

In [None]:
# Get weights of the default local policy
trainer.get_policy().get_weights()

# Same as above
trainer.workers.local_worker().policy_map["default_policy"].get_weights()

# Get list of weights of each worker, including remote replicas
trainer.workers.foreach_worker(lambda ev: ev.get_policy().get_weights())

# Same as above
trainer.workers.foreach_worker_with_index(lambda ev, i: ev.get_policy().get_weights())

### Accessing Model State
Similar to accessing policy state, you may want to get a reference to the underlying neural network model being trained. For example, you may want to pre-train it separately, or otherwise update its weights outside of RLlib. This can be done by accessing the `model` of the policy:


**Example**: Preprocessing observations for feeding into a model
```
>>> import gym
>>> env = gym.make("Pong-v0")

# RLlib uses preprocessors to implement transforms such as one-hot encoding
# and flattening of tuple and dict observations.
>>> from ray.rllib.models.preprocessors import get_preprocessor
>>> prep = get_preprocessor(env.observation_space)(env.observation_space)
<ray.rllib.models.preprocessors.GenericPixelPreprocessor object at 0x7fc4d049de80>

# Observations should be preprocessed prior to feeding into a model
>>> env.reset().shape
(210, 160, 3)
>>> prep.transform(env.reset()).shape
(84, 84, 3)
```

**Example**: Querying a policy’s action distribution
```
# Get a reference to the policy
>>> from ray.rllib.agents.ppo import PPOTrainer
>>> trainer = PPOTrainer(env="CartPole-v0", config={"eager": True, "num_workers": 0})
>>> policy = trainer.get_policy()
<ray.rllib.policy.eager_tf_policy.PPOTFPolicy_eager object at 0x7fd020165470>

# Run a forward pass to get model output logits. Note that complex observations
# must be preprocessed as in the above code block.
>>> logits, _ = policy.model.from_batch({"obs": np.array([[0.1, 0.2, 0.3, 0.4]])})
(<tf.Tensor: id=1274, shape=(1, 2), dtype=float32, numpy=...>, [])

# Compute action distribution given logits
>>> policy.dist_class
<class_object 'ray.rllib.models.tf.tf_action_dist.Categorical'>
>>> dist = policy.dist_class(logits, policy.model)
<ray.rllib.models.tf.tf_action_dist.Categorical object at 0x7fd02301d710>

# Query the distribution for samples, sample logps
>>> dist.sample()
<tf.Tensor: id=661, shape=(1,), dtype=int64, numpy=..>
>>> dist.logp([1])
<tf.Tensor: id=1298, shape=(1,), dtype=float32, numpy=...>

# Get the estimated values for the most recent forward pass
>>> policy.model.value_function()
<tf.Tensor: id=670, shape=(1,), dtype=float32, numpy=...>

>>> policy.model.base_model.summary()
Model: "model"
_____________________________________________________________________
Layer (type)               Output Shape  Param #  Connected to
=====================================================================
observations (InputLayer)  [(None, 4)]   0
_____________________________________________________________________
fc_1 (Dense)               (None, 256)   1280     observations[0][0]
_____________________________________________________________________
fc_value_1 (Dense)         (None, 256)   1280     observations[0][0]
_____________________________________________________________________
fc_2 (Dense)               (None, 256)   65792    fc_1[0][0]
_____________________________________________________________________
fc_value_2 (Dense)         (None, 256)   65792    fc_value_1[0][0]
_____________________________________________________________________
fc_out (Dense)             (None, 2)     514      fc_2[0][0]
_____________________________________________________________________
value_out (Dense)          (None, 1)     257      fc_value_2[0][0]
=====================================================================
Total params: 134,915
Trainable params: 134,915
Non-trainable params: 0
_____________________________________________________________________
```

**Example: Getting Q values from a DQN model**
```
# Get a reference to the model through the policy
>>> from ray.rllib.agents.dqn import DQNTrainer
>>> trainer = DQNTrainer(env="CartPole-v0", config={"eager": True})
>>> model = trainer.get_policy().model
<ray.rllib.models.catalog.FullyConnectedNetwork_as_DistributionalQModel ...>

# List of all model variables
>>> model.variables()
[<tf.Variable 'default_policy/fc_1/kernel:0' shape=(4, 256) dtype=float32>, ...]

# Run a forward pass to get base model output. Note that complex observations
# must be preprocessed. An example of preprocessing is examples/saving_experiences.py
>>> model_out = model.from_batch({"obs": np.array([[0.1, 0.2, 0.3, 0.4]])})
(<tf.Tensor: id=832, shape=(1, 256), dtype=float32, numpy=...)

# Access the base Keras models (all default models have a base)
>>> model.base_model.summary()
Model: "model"
_______________________________________________________________________
Layer (type)                Output Shape    Param #  Connected to
=======================================================================
observations (InputLayer)   [(None, 4)]     0
_______________________________________________________________________
fc_1 (Dense)                (None, 256)     1280     observations[0][0]
_______________________________________________________________________
fc_out (Dense)              (None, 256)     65792    fc_1[0][0]
_______________________________________________________________________
value_out (Dense)           (None, 1)       257      fc_1[0][0]
=======================================================================
Total params: 67,329
Trainable params: 67,329
Non-trainable params: 0
______________________________________________________________________________

# Access the Q value model (specific to DQN)
>>> model.get_q_value_distributions(model_out)
[<tf.Tensor: id=891, shape=(1, 2)>, <tf.Tensor: id=896, shape=(1, 2, 1)>]

>>> model.q_value_head.summary()
Model: "model_1"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
model_out (InputLayer)       [(None, 256)]             0
_________________________________________________________________
lambda (Lambda)              [(None, 2), (None, 2, 1), 66306
=================================================================
Total params: 66,306
Trainable params: 66,306
Non-trainable params: 0
_________________________________________________________________

# Access the state value model (specific to DQN)
>>> model.get_state_value(model_out)
<tf.Tensor: id=913, shape=(1, 1), dtype=float32>

>>> model.state_value_head.summary()
Model: "model_2"
_________________________________________________________________
Layer (type)                 Output Shape              Param #
=================================================================
model_out (InputLayer)       [(None, 256)]             0
_________________________________________________________________
lambda_1 (Lambda)            (None, 1)                 66049
=================================================================
Total params: 66,049
Trainable params: 66,049
Non-trainable params: 0
_________________________________________________________________
```

This is especially useful when used with [custom model classes](https://ray.readthedocs.io/en/latest/rllib-models.html).

## [Advanced Python API](https://ray.readthedocs.io/en/latest/rllib-training.html#advanced-python-apis)

### Custom Training Workflows
In the [basic training example](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_env.py), Tune will call `train()` on your trainer once per training iteration and report the new training results. Sometimes, it is desirable to have full control over training, but still run inside Tune. Tune supports [custom trainable functions](https://ray.readthedocs.io/en/latest/tune-usage.html#trainable-api) that can be used to implement [custom training workflows (example)](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_train_fn.py).

For even finer-grained control over training, you can use RLlib’s lower-level [building blocks](https://ray.readthedocs.io/en/latest/rllib-concepts.html) directly to implement [fully customized training workflows](https://github.com/ray-project/ray/blob/master/rllib/examples/rollout_worker_custom_workflow.py).

### Global Coordination
read more details [here](https://ray.readthedocs.io/en/latest/rllib-training.html#global-coordination)

### Callbacks and Custom Metrics
You can provide callback functions to be called at points during policy evaluation. These functions have access to an info dict containing state for the current [episode](https://github.com/ray-project/ray/blob/master/rllib/evaluation/episode.py). Custom state can be stored for the episode in the `info["episode"].user_data` dict, and custom scalar metrics reported by saving values to the `info["episode"].custom_metrics` dict. These custom metrics will be aggregated and reported as part of training results. The following example (full code [here](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_metrics_and_callbacks.py)) logs a custom metric from the environment:

In [None]:
def on_episode_start(info):
    print(info.keys())  # -> "env", 'episode"
    episode = info["episode"]
    print("episode {} started".format(episode.episode_id))
    episode.user_data["pole_angles"] = []

def on_episode_step(info):
    episode = info["episode"]
    pole_angle = abs(episode.last_observation_for()[2])
    episode.user_data["pole_angles"].append(pole_angle)

def on_episode_end(info):
    episode = info["episode"]
    pole_angle = np.mean(episode.user_data["pole_angles"])
    print("episode {} ended with length {} and pole angles {}".format(
        episode.episode_id, episode.length, pole_angle))
    episode.custom_metrics["pole_angle"] = pole_angle

def on_train_result(info):
    print("trainer.train() result: {} -> {} episodes".format(
        info["trainer"].__name__, info["result"]["episodes_this_iter"]))

def on_postprocess_traj(info):
    episode = info["episode"]
    batch = info["post_batch"]  # note: you can mutate this
    print("postprocessed {} steps".format(batch.count))

ray.init()
analysis = tune.run(
    "PG",
    config={
        "env": "CartPole-v0",
        "callbacks": {
            "on_episode_start": on_episode_start,
            "on_episode_step": on_episode_step,
            "on_episode_end": on_episode_end,
            "on_train_result": on_train_result,
            "on_postprocess_traj": on_postprocess_traj,
        },
    },
)

**Visualization**


### Customized Exploration Behavior (Training and Evaluation)
RLlib offers a unified top-level API to configure and customize an agent’s exploration behavior, including the decisions (how and whether) to sample actions from distributions (stochastically or deterministically). The setup can be done via using built-in Exploration classes (see this [package](https://github.com/ray-project/ray/blob/master/rllib/utils/exploration/)), which are specified (and further configured) inside `Trainer.config["exploration_config"]`. Besides using built-in classes, one can sub-class any of these built-ins, add custom behavior to it, and use that new class in the config instead.

Every policy has-an instantiation of one of the Exploration (sub-)classes. This Exploration object is created from the Trainer’s `config[“exploration_config”]` dict, which specifies the class to use via the special “type” key, as well as constructor arguments via all other keys, e.g.:
```
# in Trainer.config:
"exploration_config": {
    "type": "StochasticSampling",  # <- Special `type` key provides class information
    "[c'tor arg]" : "[value]",  # <- Add any needed constructor args here.
    # etc
}
# ...
```
The following table lists all built-in Exploration sub-classes and the agents that currently used these by default:
![exploration_api](rllib-exploration-api-table.svg)

An Exploration class implements the `get_exploration_action` method, in which the exact exploratory behavior is defined. It takes the model’s output, the action distribution class, the model itself, a timestep (the global env-sampling steps already taken), and an `explore` switch and outputs a tuple of 1) action and 2) log-likelihood:

In [None]:
def get_exploration_action(self,
                           distribution_inputs,
                           action_dist_class,
                           model=None,
                           explore=True,
                           timestep=None):
    """Returns a (possibly) exploratory action and its log-likelihood.

    Given the Model's logits outputs and action distribution, returns an
    exploratory action.

    Args:
        distribution_inputs (any): The output coming from the model,
            ready for parameterizing a distribution
            (e.g. q-values or PG-logits).
        action_dist_class (class): The action distribution class
            to use.
        model (ModelV2): The Model object.
        explore (bool): True: "Normal" exploration behavior.
            False: Suppress all exploratory behavior and return
                a deterministic action.
        timestep (int): The current sampling time step. If None, the
            component should try to use an internal counter, which it
            then increments by 1. If provided, will set the internal
            counter to the given value.

    Returns:
        Tuple:
        - The chosen exploration action or a tf-op to fetch the exploration
          action from the graph.
        - The log-likelihood of the exploration action.
    """
    pass

On the highest level, the `Trainer.compute_action` and `Policy.compute_action(s)` methods have a boolean explore switch, which is passed into `Exploration.get_exploration_action`. If None, the value of `Trainer.config[“explore”]` is used. Hence `config[“explore”]` describes the default behavior of the policy and e.g. allows switching off any exploration easily for evaluation purposes (see [Customized Evaluation During Training](https://ray.readthedocs.io/en/latest/rllib-training.html#customevaluation)).

The following are example excerpts from different Trainers’ configs (see rllib/agents/trainer.py) to setup different exploration behaviors:

In [None]:
# All of the following configs go into Trainer.config.

# 1) Switching *off* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its `explore`
# param will result in no exploration.
# However, explicitly calling `compute_action(s)` with `explore=True` will
# still(!) result in exploration (per-call overrides default).
"explore": False,

# 2) Switching *on* exploration by default.
# Behavior: Calling `compute_action(s)` without explicitly setting its
# explore param will result in exploration.
# However, explicitly calling `compute_action(s)` with `explore=False`
# will result in no(!) exploration (per-call overrides default).
"explore": True,

# 3) Example exploration_config usages:
# a) DQN: see rllib/agents/dqn/dqn.py
"explore": True,
"exploration_config": {
   "type": "EpsilonGreedy",  # <- Exploration sub-class by name or full path to module+class
                             # (e.g. “ray.rllib.utils.exploration.epsilon_greedy.EpsilonGreedy”)
   # Parameters for the Exploration class' constructor:
   "initial_epsilon": 1.0,
   "final_epsilon": 0.02,
   "epsilon_timesteps": 10000,  # Timesteps over which to anneal epsilon.
},

# b) DQN Soft-Q: In order to switch to Soft-Q exploration, do instead:
"explore": True,
"exploration_config": {
   "type": "SoftQ",
   # Parameters for the Exploration class' constructor:
   "temperature": 1.0,
},

# c) PPO: see rllib/agents/ppo/ppo.py
# Behavior: The algo samples stochastically by default from the
# model-parameterized distribution. This is the global Trainer default
# setting defined in trainer.py and used by all PG-type algos.
"explore": True,
"exploration_config": {
   "type": "StochasticSampling",
},

### Customized Evaluation During Training
RLlib will report online training rewards, however in some cases you may want to compute rewards with different settings (e.g., with exploration turned off, or on a specific set of environment configurations). You can evaluate policies during training by setting one or more of the `evaluation_interval`, `evaluation_num_episodes`, `evaluation_config`, `evaluation_num_workers`, and `custom_eval_function` configs (see trainer.py for further documentation).

By default, exploration is left as-is within `evaluation_config`. However, you can switch off any exploration behavior for the evaluation workers via:
```
# Switching off exploration behavior for evaluation workers
# (see rllib/agents/trainer.py)
"evaluation_config": {
   "explore": False
}
```
**IMPORTANT NOTE**: Policy gradient algorithms are able to find the optimal policy, even if this is a stochastic one. Setting `“explore=False”` above will result in the evaluation workers not using this optimal policy.

There is an end to end example of how to set up custom online evaluation in [custom_eval.py](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_eval.py). Note that if you only want to eval your policy at the end of training, you can set evaluation_interval: N, where N is the number of training iterations before stopping.

### Rewriting Trajectories
Note that in the `on_postprocess_traj` callback you have full access to the trajectory batch (`post_batch`) and other training state. This can be used to rewrite the trajectory, which has a number of uses including:

- Backdating rewards to previous time steps (e.g., based on values in `info`).
- Adding model-based curiosity bonuses to rewards (you can train the model with a custom model supervised loss).

### Curriculum Learning
Let’s look at two ways to use the above APIs to implement [curriculum learning](https://bair.berkeley.edu/blog/2017/12/20/reverse-curriculum/). In curriculum learning, the agent task is adjusted over time to improve the learning process. Suppose that we have an environment class with a `set_phase()` method that we can call to adjust the task difficulty over time:

**Approach 1**: Use the Trainer API and update the environment between calls to `train()`. This example shows the trainer being run inside a Tune function:

In [None]:
import ray
from ray import tune
from ray.rllib.agents.ppo import PPOTrainer

def train(config, reporter):
    trainer = PPOTrainer(config=config, env=YourEnv)
    while True:
        result = trainer.train()
        reporter(**result)
        if result["episode_reward_mean"] > 200:
            phase = 2
        elif result["episode_reward_mean"] > 100:
            phase = 1
        else:
            phase = 0
        trainer.workers.foreach_worker(
            lambda ev: ev.foreach_env(
                lambda env: env.set_phase(phase)))

ray.init()
tune.run(
    train,
    config={
        "num_gpus": 0,
        "num_workers": 2,
    },
    resources_per_trial={
        "cpu": 1,
        "gpu": lambda spec: spec.config.num_gpus,
        "extra_cpu": lambda spec: spec.config.num_workers,
    },
)

**Approach 2**: Use the callbacks API to update the environment on new training results:

In [None]:
import ray
from ray import tune

def on_train_result(info):
    result = info["result"]
    if result["episode_reward_mean"] > 200:
        phase = 2
    elif result["episode_reward_mean"] > 100:
        phase = 1
    else:
        phase = 0
    trainer = info["trainer"]
    trainer.workers.foreach_worker(
        lambda ev: ev.foreach_env(
            lambda env: env.set_phase(phase)))

ray.init()
tune.run(
    "PPO",
    config={
        "env": YourEnv,
        "callbacks": {
            "on_train_result": on_train_result,
        },
    },
)

## Debugging

### Gym Monitor
The `"monitor": true` config can be used to save Gym episode videos to the result dir. For example:

```
rllib train --env=PongDeterministic-v4 \
    --run=A2C --config '{"num_workers": 2, "monitor": true}'

# videos will be saved in the ~/ray_results/<experiment> dir, for example
openaigym.video.0.31401.video000000.meta.json
openaigym.video.0.31401.video000000.mp4
openaigym.video.0.31403.video000000.meta.json
openaigym.video.0.31403.video000000.mp4
```

### Eager Mode
Policies built with `build_tf_policy` (most of the reference algorithms are) can be run in eager mode by setting the `"eager": True` / `"eager_tracing": True` config options or using `rllib train --eager [--trace]`. This will tell RLlib to execute the model forward pass, action distribution, loss, and stats functions in eager mode.

Eager mode makes debugging much easier, since you can now use line-by-line debugging with breakpoints or Python `print()` to inspect intermediate tensor values. However, eager can be slower than graph mode unless tracing is enabled.

### Episode Traces
You can use the data output API to save episode traces for debugging. For example, the following command will run PPO while saving episode traces to `/tmp/debug`:
```
rllib train --run=PPO --env=CartPole-v0 \
    --config='{"output": "/tmp/debug", "output_compress_columns": []}'

# episode traces will be saved in /tmp/debug, for example
output-2019-02-23_12-02-03_worker-2_0.json
output-2019-02-23_12-02-04_worker-1_0.json
```

### Log Verbosity
You can control the trainer log level via the `"log_level"` flag. Valid values are `“DEBUG”`, `“INFO”`, `“WARN”` (default), and `“ERROR”`. This can be used to increase or decrease the verbosity of internal logging. You can also use the `-v` and `-vv` flags.  
For example, the following two commands are about equivalent:

```
rllib train --env=PongDeterministic-v4 --run=A2C --config '{"num_workers": 2, "log_level": "DEBUG"}'

rllib train --env=PongDeterministic-v4 --run=A2C --config '{"num_workers": 2}' -vv
```

### Stack Traces
You can use the `ray stack` command to dump the stack traces of all the Python workers on a single node. This can be useful for debugging unexpected hangs or performance issues.

## REST API
In some cases (i.e., when interacting with an externally hosted simulator or production environment) it makes more sense to interact with RLlib as if were an independently running service, rather than RLlib hosting the simulations itself. This is possible via RLlib’s external agents [interface](https://ray.readthedocs.io/en/latest/rllib-env.html#interfacing-with-external-agents)

For a full client / server example that you can run, see the example [client script](https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_client.py) and also the corresponding [server script](https://github.com/ray-project/ray/blob/master/rllib/examples/serving/cartpole_server.py), here configured to serve a policy for the toy CartPole-v0 environment.

# Batch RL ([Offline Datasets](https://ray.readthedocs.io/en/latest/rllib-offline.html#rllib-offline-datasets))

RLlib’s offline dataset APIs enable working with experiences read from offline storage (e.g., disk, cloud storage, streaming systems, HDFS). For example, you might want to read experiences saved from previous training runs, or gathered from policies deployed in [web applications](https://arxiv.org/abs/1811.00260). You can also log new agent experiences produced during online training for future use.

RLlib represents trajectory sequences (i.e., (s, a, r, s', ...) tuples) with [SampleBatch](https://github.com/ray-project/ray/blob/master/rllib/policy/sample_batch.py) objects. Using a batch format enables efficient encoding and compression of experiences. During online training, RLlib uses [policy evaluation](https://ray.readthedocs.io/en/latest/rllib-concepts.html#policy-evaluation) actors to generate batches of experiences in parallel using the current policy. RLlib also uses this same batch format for reading and writing experiences to offline storage.


**Example: Training on previously saved experiences**
For custom models and enviroments, you’ll need to use the [Python API](https://ray.readthedocs.io/en/latest/rllib-training.html#basic-python-api).

In this example, we will save batches of experiences generated during online training to disk, and then leverage this saved data to train a policy offline using DQN. First, we run a simple policy gradient algorithm for 100k steps with `"output": "/tmp/cartpole-out"` to tell RLlib to write simulation outputs to the `/tmp/cartpole-out` directory.

```
$ rllib train
    --run=PG \
    --env=CartPole-v0 \
    --config='{"output": "/tmp/cartpole-out", "output_max_file_size": 5000000}' \
    --stop='{"timesteps_total": 100000}'
```

The experiences will be saved in compressed JSON batch format:

```
$ ls -l /tmp/cartpole-out
total 11636
-rw-rw-r-- 1 eric eric 5022257 output-2019-01-01_15-58-57_worker-0_0.json
-rw-rw-r-- 1 eric eric 5002416 output-2019-01-01_15-59-22_worker-0_1.json
-rw-rw-r-- 1 eric eric 1881666 output-2019-01-01_15-59-47_worker-0_2.json
```

Then, we can tell DQN to train using these previously generated experiences with `"input": "/tmp/cartpole-out"`. We disable exploration since it has no effect on the input:

```
$ rllib train \
    --run=DQN \
    --env=CartPole-v0 \
    --config='{
        "input": "/tmp/cartpole-out",
        "input_evaluation": [],
        "explore": false}'
```
**Off-Policy estimation**
Since the input experiences are not from running simulations, RLlib cannot report the true policy performance during training. However, you can use tensorboard `--logdir=~/ray_results` to monitor training progress via other metrics such as estimated Q-value. Alternatively, off-policy estimation can be used, which requires both the source and target action probabilities to be available (i.e., the `action_prob` batch key). For DQN, this means enabling soft Q learning so that actions are sampled from a probability distribution:

```
$ rllib train \
    --run=DQN \
    --env=CartPole-v0 \
    --config='{
        "input": "/tmp/cartpole-out",
        "input_evaluation": ["is", "wis"],
        "exploration_config": {
            "type": "SoftQ",
            "temperature": 1.0,
        }'
```
This example plot shows the Q-value metric in addition to importance sampling (IS) and weighted importance sampling (WIS) gain estimates (>1.0 means there is an estimated improvement over the original policy):
![offline_q](offline-q.png)

**Estimator Python API**: For greater control over the evaluation process, you can create off-policy estimators in your Python code and call `estimator.estimate(episode_batch)` to perform counterfactual estimation as needed. The estimators take in a policy object and gamma value for the environment:

In [None]:
trainer = DQNTrainer(...)
...  # train policy offline

from ray.rllib.offline.json_reader import JsonReader
from ray.rllib.offline.wis_estimator import WeightedImportanceSamplingEstimator

estimator = WeightedImportanceSamplingEstimator(trainer.get_policy(), gamma=0.99)
reader = JsonReader("/path/to/data")
for _ in range(1000):
    batch = reader.next()
    for episode in batch.split_by_episode():
        print(estimator.estimate(episode))

**Simulation-based estimation**: If true simulation is also possible (i.e., your env supports `step()`), you can also set `"input_evaluation": ["simulation"]` to tell RLlib to run background simulations to estimate current policy performance. The output of these simulations will not be used for learning. Note that in all cases you still need to specify an environment object to define the action and observation spaces. However, you don’t need to implement functions like reset() and step().

## experience files
as written above, the experience buffers of each worker are saved in a json file using the [JsonWriter](https://github.com/ray-project/ray/blob/master/rllib/offline/json_writer.py) class.

If we want to have a look at the content of this json file we can use the [JsonReader](https://github.com/ray-project/ray/blob/master/rllib/offline/json_reader.py) class as follows:


In [1]:
from ray.rllib.offline import JsonReader
jsons_path = '/home/guy/share/Data/MLA/ray/cartpole-out/*.json'
reader=JsonReader(jsons_path)
batch=reader.next()
batch.keys()

dict_keys(['t', 'eps_id', 'agent_index', 'obs', 'actions', 'rewards', 'prev_actions', 'prev_rewards', 'dones', 'infos', 'new_obs', 'action_prob', 'action_logp', 'unroll_id', 'advantages', 'value_targets'])

In [6]:
{k:(type(v),v.shape) for k,v in batch.items()}

{'t': (numpy.ndarray, (200,)),
 'eps_id': (numpy.ndarray, (200,)),
 'agent_index': (numpy.ndarray, (200,)),
 'obs': (numpy.ndarray, (200, 4)),
 'actions': (numpy.ndarray, (200,)),
 'rewards': (numpy.ndarray, (200,)),
 'prev_actions': (numpy.ndarray, (200,)),
 'prev_rewards': (numpy.ndarray, (200,)),
 'dones': (numpy.ndarray, (200,)),
 'infos': (numpy.ndarray, (200,)),
 'new_obs': (numpy.ndarray, (200, 4)),
 'action_prob': (numpy.ndarray, (200,)),
 'action_logp': (numpy.ndarray, (200,)),
 'unroll_id': (numpy.ndarray, (200,)),
 'advantages': (numpy.ndarray, (200,)),
 'value_targets': (numpy.ndarray, (200,))}

**Example: Converting external experiences to batch format**
When the env does not support simulation (e.g., it is a web application), it is necessary to generate the `*.json` experience batch files outside of RLlib. This can be done by using the [JsonWriter](https://github.com/ray-project/ray/blob/master/rllib/offline/json_writer.py) class to write out batches. This [runnable example](https://github.com/ray-project/ray/blob/master/rllib/examples/saving_experiences.py) shows how to generate and save experience batches for CartPole-v0 to disk:

In [None]:
import gym
import numpy as np

from ray.rllib.models.preprocessors import get_preprocessor
from ray.rllib.evaluation.sample_batch_builder import SampleBatchBuilder
from ray.rllib.offline.json_writer import JsonWriter

if __name__ == "__main__":
    batch_builder = SampleBatchBuilder()  # or MultiAgentSampleBatchBuilder
    writer = JsonWriter("/tmp/demo-out")

    # You normally wouldn't want to manually create sample batches if a
    # simulator is available, but let's do it anyways for example purposes:
    env = gym.make("CartPole-v0")

    # RLlib uses preprocessors to implement transforms such as one-hot encoding
    # and flattening of tuple and dict observations. For CartPole a no-op
    # preprocessor is used, but this may be relevant for more complex envs.
    prep = get_preprocessor(env.observation_space)(env.observation_space)
    print("The preprocessor is", prep)

    for eps_id in range(100):
        obs = env.reset()
        prev_action = np.zeros_like(env.action_space.sample())
        prev_reward = 0
        done = False
        t = 0
        while not done:
            action = env.action_space.sample()
            new_obs, rew, done, info = env.step(action)
            batch_builder.add_values(
                t=t,
                eps_id=eps_id,
                agent_index=0,
                obs=prep.transform(obs),
                actions=action,
                action_prob=1.0,  # put the true action probability here
                rewards=rew,
                prev_actions=prev_action,
                prev_rewards=prev_reward,
                dones=done,
                infos=info,
                new_obs=prep.transform(new_obs))
            obs = new_obs
            prev_action = action
            prev_reward = rew
            t += 1
        writer.write(batch_builder.build_and_reset())

**On-policy algorithms and experience postprocessing**  
RLlib assumes that input batches are of [postprocessed experiences](https://github.com/ray-project/ray/blob/b8a9e3f1064c6f8d754884fd9c75e0b2f88df4d6/rllib/policy/policy.py#L103). This isn’t typically critical for off-policy algorithms (e.g., DQN’s [post-processing](https://github.com/ray-project/ray/blob/b8a9e3f1064c6f8d754884fd9c75e0b2f88df4d6/rllib/agents/dqn/dqn_policy.py#L514) is only needed if `n_step > 1` or `worker_side_prioritization: True`). For off-policy algorithms, you can also safely set the `postprocess_inputs: True` config to auto-postprocess data.

However, for on-policy algorithms like PPO, you’ll need to pass in the extra values added during policy evaluation and postprocessing to `batch_builder.add_values()`, e.g., `logits`, `vf_preds`, `value_target`, and `advantages` for PPO. This is needed since the calculation of these values depends on the parameters of the behaviour policy, which RLlib does not have access to in the offline setting (in online training, these values are automatically added during policy evaluation).

Note that for on-policy algorithms, you’ll also have to throw away experiences generated by prior versions of the policy. This greatly reduces sample efficiency, which is typically undesirable for offline training, but can make sense for certain applications.

**Mixing simulation and offline data**  
RLlib supports multiplexing inputs from multiple input sources, including simulation. For example, in the following example we read 40% of our experiences from `/tmp/cartpole-out`, 30% from `hdfs:/archive/cartpole`, and the last 30% is produced via policy evaluation. Input sources are multiplexed using `np.random.choice`:

```
$ rllib train \
    --run=DQN \
    --env=CartPole-v0 \
    --config='{
        "input": {
            "/tmp/cartpole-out": 0.4,
            "hdfs:/archive/cartpole": 0.3,
            "sampler": 0.3,
        },
        "explore": false}'
```

**Scaling I/O throughput**  
Similar to scaling online training, you can scale offline I/O throughput by increasing the number of RLlib workers via the `num_workers` config. Each worker accesses offline storage independently in parallel, for linear scaling of I/O throughput. Within each read worker, files are chosen in random order for reads, but file contents are read sequentially.

**Input Pipeline for Supervised Losses**
You can also define supervised model losses over offline data. This requires defining a [custom model loss](https://ray.readthedocs.io/en/latest/rllib-models.html#supervised-model-losses). We provide a convenience function, `InputReader.tf_input_ops()`, that can be used to convert any input reader to a TF input pipeline. For example:
See [custom_loss.py](https://github.com/ray-project/ray/blob/master/rllib/examples/custom_loss.py) for a runnable example of using these TF input ops in a custom loss.

In [None]:
def custom_loss(self, policy_loss):
    input_reader = JsonReader("/tmp/cartpole-out")
    # print(input_reader.next())  # if you want to access imperatively

    input_ops = input_reader.tf_input_ops()
    print(input_ops["obs"])  # -> output Tensor shape=[None, 4]
    print(input_ops["actions"])  # -> output Tensor shape=[None]

    supervised_loss = some_function_of(input_ops)
    return policy_loss + supervised_loss

**Input API**  
You can configure experience input for an agent using the following options:
```
    # Specify how to generate experiences:
    #  - "sampler": generate experiences via online simulation (default)
    #  - a local directory or file glob expression (e.g., "/tmp/*.json")
    #  - a list of individual file paths/URIs (e.g., ["/tmp/1.json",
    #    "s3://bucket/2.json"])
    #  - a dict with string keys and sampling probabilities as values (e.g.,
    #    {"sampler": 0.4, "/tmp/*.json": 0.4, "s3://bucket/expert.json": 0.2}).
    #  - a function that returns a rllib.offline.InputReader
    "input": "sampler",
    # Specify how to evaluate the current policy. This only has an effect when
    # reading offline experiences. Available options:
    #  - "wis": the weighted step-wise importance sampling estimator.
    #  - "is": the step-wise importance sampling estimator.
    #  - "simulation": run the environment in the background, but use
    #    this data for evaluation only and not for learning.
    "input_evaluation": ["is", "wis"],
    # Whether to run postprocess_trajectory() on the trajectory fragments from
    # offline inputs. Note that postprocessing will be done using the *current*
    # policy, not the *behavior* policy, which is typically undesirable for
    # on-policy algorithms.
    "postprocess_inputs": False,
    # If positive, input batches will be shuffled via a sliding window buffer
    # of this number of batches. Use this if the input data is not in random
    # enough order. Input is delayed until the shuffle buffer is filled.
    "shuffle_buffer_size": 0,
```


**Output API**  
You can configure experience output for an agent using the following options:
```
    # Specify where experiences should be saved:
    #  - None: don't save any experiences
    #  - "logdir" to save to the agent log dir
    #  - a path/URI to save to a custom output directory (e.g., "s3://bucket/")
    #  - a function that returns a rllib.offline.OutputWriter
    "output": None,
    # What sample batch columns to LZ4 compress in the output data.
    "output_compress_columns": ["obs", "new_obs"],
    # Max output file size before rolling over to a new file.
    "output_max_file_size": 64 * 1024 * 1024,
```

# RLlib Models, Preprocessors and Action Distributions 
see [documentation](https://ray.readthedocs.io/en/latest/rllib-models.html#rllib-models-preprocessors-and-action-distributions)

# [RLlib Environments](https://ray.readthedocs.io/en/latest/rllib-env.html#rllib-environments)

# [RLlib Algorithms](https://ray.readthedocs.io/en/latest/rllib-algorithms.html#rllib-algorithms)
Description of the various algorithms already implemented and their default configuration

# RLlib [Examples](https://ray.readthedocs.io/en/latest/rllib-examples.html#rllib-examples)
This includes many useful links to various examples of usages

# RLlib [Package Reference](https://ray.readthedocs.io/en/latest/rllib-package-ref.html#rllib-package-reference)

# Reference
1. [Rllib Documentation](https://ray.readthedocs.io/en/latest/rllib.html)
1. [Blog post : Functional RL with Keras and Tensorflow Eager](https://medium.com/riselab/functional-rl-with-keras-and-tensorflow-eager-7973f81d6345)