# `rlplay`-ing around

In [None]:
import torch
import numpy

import matplotlib.pyplot as plt
%matplotlib inline

<br>

## Rollout collection

Rollout collection is designed to be as much `plug-n-play` as possible, i.e. it
supports **arbitrarily structured nested containers** of arrays or tensors for
environment observations and actions. The actor, however, should **expose**
certain API (described below).

In [None]:
from rlplay.engine import collect  # the collector's core

# print(collect.__doc__)

It's role is to serve as a *middle-man* between the **actor-environment** pair
and the **training loop**: to track the trajectory of the actor in the environment,
and properly record it into the data buffer.

For example, it is not responsible for seeding or randomization of environments
(i'm looking at you, `AtariEnv`), and datatype casting (except for rewards,
which are cast to `fp32` automatically). In theory, there is **no need** for
special data preporcessing, except for, perhaps, casting data to proper dtypes,
like from `numpy.float64` observations to `float32` in `CartPole`.

#### Semantics

The collector just carefully records the trajectory following the
**REACT** $\rightarrow$ **STEP+EMIT** process:

* **REACT** The actor performs the following update
$$
    (x_t, a_{t-1}, r_t, d_t, h_t)
        \overset{\mathrm{Actor}}{\longrightarrow}
        (a_t, h_{t+1})
    \,. $$


* **STEP+EMIT** The environment updates it's unobserved state and emits observed data
$$
    (s_t, a_t)
        \overset{\mathrm{Env}}{\longrightarrow}
        (s_{t+1}, x_{t+1}, r_{t+1}, d_{t+1})
    \,, $$

where $d_t = \top$ if $s_t$ is terminal, else $\bot$, $r_t$ is a scalar reward
(auxiliary actor's and env's info dicts are not shown in this diagram).

##### Requirements

* all nested containers **must be** built from pure python `dicts`, `lists`, `tuples` or `namedtuples`

* the environment communicates either in **numpy arrays** or in python **scalars**, but not in data types that are incompatible with pytorch (such as `str` or `bytes`)

```python
# example
obs = {
    'camera': {
        'rear': numpy.zeros(3, 320, 240),
        'front': numpy.zeros(3, 320, 240),
    },
    'proximity': (+0.1, +0.2, -0.1, +0.0,),
    'other': {
        'fuel_tank': 78.5,
        'passenger': False,
    },
}
```

* the actor communicates in torch tensors **only**

* the environment produces **float scalar** rewards (other data may be communicated through auxiliary environment info-dicts)

### Creating the actors

Rollout collection relies on the following API of the actor:
* `.reset(j, hx)` reset the recurrent state of the j-th environment in the batch (if applicable)
  * `hx` contains tensors with shape `n_lstm_layers x batch x hidden`, or is an empty tuple

* `.step(obs, act, rew, fin, hx)` get the next action $a_t$, the recurrent state $h_{t+1}$, and
the **extra info** in response to $x_t$, $a_{t-1}$, $r_t$, $d_t$, and $h_t$, respectively.
  * extra info `dict` **should** include `value` key with a `T x B` tensor of value estimates
  * MUST allocate new `hx` if the recurrent state is updated
  * MUST NOT change the inputs in-place

* `.value(obs, act, rew, fin, hx)` compute the value-function estimate $
      v(s_t) \approx G_t = \mathbb{E} \sum_{j\geq t} \gamma^{j-t} r_{j+1}
  $.

In [None]:
from rlplay.engine import BaseActorModule

# BaseActorModule??

`BaseActorModule` is essentially a thin sub-class of `torch.nn.Module`,
that implements the API through `.forward(obs, act, rew, fin, hx)`, which
should return three things:

1. `actions` prescribed actions in the environment, with data of shape `n_steps x batch x ...`
  * can be a nested container of dicts, lists, and tuples


2. `hx` data with shape `n_steps x batch x ...`
  * can be a nested container of dicts, lists, and tuples
  * **if an actor is not recurrent**, then must return an empty tuple `()`


3. `info` dict with extra `n_steps x batch x ...` data
  * `value` -- the value function estimates
  * `logits` -- the policy logits (if applicable)
  * and other stuff

Here is an example actor, that wraps a simple MLP policy.

In [None]:
from rlplay.utils.common import multinomial

class nonRecurrentPolicyWrapper(BaseActorModule):
    """Example wrapper for a non-recurrent policy.
    
    Details
    -------
    This example assumes flat `Discrete(n)` action space, and
    simple non-structured observation space, e.g. a python scalar
    or a `numpy.array`.
    """

    def __init__(self, policy, *, epsilon=0.1):
        super().__init__()
        self.policy, self.epsilon = policy, epsilon

    def forward(self, obs, act=None, rew=None, fin=None, *, hx=None, stepno=None):
        # Everything is  [T x B x ...]
        logits, hx = self.policy(obs, act, rew), ()

        # value must not have any trailing dims, i.e. T x B
        value = logits.new_zeros(fin.shape)

        # XXX eps-greedy?
        if self.training:
            unif = torch.tensor(1. / logits.shape[-1])

            prob = logits.detach().exp()
            prob.mul_(1 - self.epsilon)
            prob.add_(unif, alpha=self.epsilon)

            actions = multinomial(prob)

        else:
            actions = logits.argmax(dim=-1)

        return actions, hx, value, dict(logits=logits)

<br>

### Rollout collection (same-process)

Collect rollouts within the current process

In [None]:
from rlplay.engine.rollout import same

The parameters have the following meaning
```python
n_envs = 16     # the number of envs in the batch
n_steps = 51    # the length of each rollout fragment
sticky = False  # whether to stop interacting if an env resets mid-fragment
device = None   # specifies the device to put the actor's inputs onto
```


`rollout()` returns an iterator, which does the following, roughly.

Prepare the run-time context for the specified `actor` and the environments
```python
# spawn multiple envs
envs = [env_factory() for _ in n_envs]

# initialize a buffer for one rollout fragment (optionally pinned)
buffer = prepare(envs[0], actor, n_steps, len(envs),
                 pinned=pinned, device=device)

# the running context tor the actor and the envs
ctx, fragment = startup(envs, actor, buffer, pinned=pinned)
```

Now within the infinite loop it does the following
```python
# collect the fragment
collect(envs, actor, fragment, ctx, sticky=sticky, device=device)

# fragment.pyt -- torch tensors, fragment.npy -- numpy arrays (aliased)
# copy fragment.pyt onto `device`, and yield it to the user
```

The user has to manually limit the number of iterations using, for example,

```python
it = same.rollout(...)

for b, batch in zip(range(100), it):
    # train on batch
    pass

it.close()
```

<br>

### Rollout collection (single-process)

Single-actor rollout sampler running in a parallel process (double-buffered).

In [None]:
from rlplay.engine.rollout import single

Under the hood the functions creates **two** rollout fragment buffers, maintains
a reference to the specified `actor`, makes a shared copy of it (on the host), and
then spawns one worker process.

The worker, in turn, makes its own local copy of the actor on the specified device,
initializes the environments and the running context. During collection it altrenates
between the buffers, into which it records the rollout fragments it collects. Except
for double buffering, the logic is identical to `rollout`.

The local copies of the actor are **automatically updated** from the maintained reference.

```python
it = single.rollout(
    factory,              # the environment factory
    actor,                # the actor reference, used to update the local actors

    n_steps,              # the duration of a rollout fragment
    n_envs,               # the number of independent environments in the batch

    sticky=False,         # do we freeze terminated environments until the end of the rollout?
                          #  required if we wish to leverage cudnn's fast RNN implementations,
                          #  instead of manually stepping through the RNN core.

    clone=True,           # should the worker use a local clone of the reference actor

    close=True,           # should we `.close()` the environments when cleaning up?
                          #  some envs are very particular about this, e.g. nle

    start_method='fork',  # `fork` in notebooks, `spawn` in linux/macos and if we interchange
                          #  cuda tensors between processes (we DO NOT do that: we exchange indices
                          #  to host-shapred tensors)

    device=None,          # the device on which to collect rollouts (the local actor is moved
                          #  onto this device)
)

# ...

it.close()
```

<br>

### Rollout collection (multi-process)

A more load-balanced multi-actor milti-process sampler

In [None]:
from rlplay.engine.rollout import multi

This version of the rollout collector allocates several buffers and spawns
many parallel workers. Each worker creates it own local copy of the actor,
instantiates `n_envs` local environments and allocates a running context for
all of them. The rollout collection in each worker is **hardcoded to run on
the host device**.

```python
it = multi.rollout(
    factory,              # the environment factory
    actor,                # the actor reference, used to update the local actors

    n_steps,              # the duration of each rollout fragment

    n_actors,             # the number of parallel actors
    n_per_actor,          # the number of independent environments run in each actor
    n_buffers,            # the size of the pool of buffers, into which rollout
                          #  fragments are collected. Should not be less than `n_actors`.
    n_per_batch,          # the number of fragments collated into a batch

    sticky=False,         # do we freeze terminated environments until the end of the rollout?
                          #  required if we wish to leverage cudnn's fast RNN implementations,
                          #  instead of manually stepping through the RNN core.

    pinned=False,

    clone=True,           # should the parallel actors use a local clone of the reference actor

    close=True,           # should we `.close()` the environments when cleaning up?
                          #  some envs are very particular about this, e.g. nle

    device=None,          # the device onto which to move the rollout batches

    start_method='fork',  # `fork` in notebooks, `spawn` in linux/macos and if we interchange
                          #  cuda tensors between processes (we DO NOT do that: we exchange indices
                          #  to host-shared tensors)
)

# ...

it.close()
```

<br>

### Evaluation (same-process)

In order to evaluate an actor in a batch of environments, one can use `evaluate`.

In [None]:
from rlplay.engine import evaluate as core_evaluate

The function *does not* collect the rollout data, except for the rewards.
Below is the intended use case.
* **NB** this is run in the same process, hence blocks until completion, which
might take considerable time (esp. if `n_steps` is unbounded)

In [None]:
# same process
def same_evaluate(
    factory, actor, n_envs=4,
    *, n_steps=None, close=True, render=False, device=None
):
    # spawn a batch of environments
    envs = [factory() for _ in range(n_envs)]

    try:
        while True:
            rewards, bootstrap = core_evaluate(
                envs, actor, n_steps=n_steps,
                render=render, device=device)

            # get the accumulated rewards (gamma=1)
            yield sum(rewards)

    finally:
        if close:
            for e in envs:
                e.close()

<br>

### Evaluation (parallel process)

Like rollout collection, evaluation can (and probably shoulb) be performed in
a parallel process, so that it does not burden the main thread with computations
not related to training.

In [None]:
from rlplay.engine.rollout.evaluate import evaluate

<br>

## CartPole with REINFORCE or A2C

In [None]:
import gym

# hotfix for gym's unresponsive viz (spawns gl threads!)
import rlplay.utils.integration.gym

The environment factory

In [None]:
import time
from rlplay.zoo.env import NarrowPath


class FP32Observation(gym.ObservationWrapper):
    def observation(self, observation):
        obs = observation.astype(numpy.float32)
        obs[0] = 0.  # mask the position info
        return obs

#     def step(self, action):
#         obs, reward, done, info = super().step(action)
#         reward -= abs(obs[1]) / 10  # punish for non-zero speed
#         return obs, reward, done, info

class OneHotObservation(gym.ObservationWrapper):
    def observation(self, observation):
        return numpy.eye(1, self.env.observation_space.n,
                         k=observation, dtype=numpy.float32)[0]

def factory():
    return FP32Observation(gym.make("CartPole-v0").unwrapped)
#     return gym.make("Taxi-v3").unwrapped
    # return OneHotObservation(NarrowPath())

Service functions for the pg algorithms

In [None]:
from rlplay.algo.returns import pyt_returns
from rlplay.engine.utils.plyr import suply, getitem

The reinforce PG algo

In [None]:
def reinforce(batch, module, *, gamma=0.99, C_entropy=1e-2,
              c_rho=float('inf')):
    r"""The REINFORCE algorithm (importance-weighted off-policy).

    The basic policy-gradient alogrithm with a baseline $b_t$:
    $$
        \nabla_\theta J(s_t)
            = \mathbb{E}_{a \sim \beta(a\mid s_t)}
                \frac{\pi(a\mid s_t)}{\beta(a\mid s_t)}
                    \bigl( r_{t+1} + \gamma G_{t+1} - b_t \bigr)
                \nabla_\theta \log \pi(a\mid s_t)
        \,. $$
    """

    # XXX `state[t]` = (x_t, a_{t-1}, r_t, d_t), t=0..T-1
    state = suply(getitem, batch.state, index=slice(None, -1))

    # XXX `state_next[t]` = (x_{t+1}, a_t, r_{t+1}, d_{t+1}), t=0..T-1
    state_next = suply(getitem, batch.state, index=slice(1, None))

    # REACT: (state[t], h_t) \to (\hat{a}_t, h_{t+1}, \hat{A}_{t+1})
    _, _, _, info = module(
        state.obs, state.act, state.rew, state.fin,
        hx=batch.hx, stepno=state.stepno)

    # The present value of the future rewards following `state[t]`:
    #    G_t = r_{t+1} + \gamma G_{t+1}
    ret = pyt_returns(state_next.rew, state_next.fin,
                      gamma=gamma, bootstrap=torch.tensor(0.))

    # Assume .act is unstructured: `act[t]` = a_{t+1} -->> T x B x 1
    act = state_next.act.unsqueeze(-1)

    # \pi is the target policy, \mu is the behaviour policy
    log_pi, log_mu = info['logits'], batch.actor['logits']

    # the importance weights
    log_pi_a = log_pi.gather(-1, act).squeeze(-1)
    log_mu_a = log_mu.gather(-1, act).squeeze(-1)
    rho = log_mu_a.sub_(log_pi_a.detach())\
                  .neg_().exp_().clamp_(max=c_rho)

    # the policy surrogate score
    #    \frac1T \sum_t \rho_t (G_t - b_t) \log \pi(a_t \mid s_t)
    reinfscore = log_pi_a.mul(ret.sub(ret.mean(dim=0)).mul_(rho)).mean()

    # the policy entropy score (neg entropy)
    #   - H(\pi(•\mid s)) = - (-1) \sum_a \pi(a\mid s) \log \pi(a\mid s)
    f_min = torch.finfo(log_pi.dtype).min
    negentropy = log_pi.exp().mul(log_pi.clamp(min=f_min)).sum(dim=-1).mean()

    # maximize the entropy and the reinforce score
    # \ell := - \frac1T \sum_t G_t \log \pi(a_t \mid s_t)
    #         - C \mathbb{H} \pi(\cdot \mid s_t)
    loss = C_entropy * negentropy - reinfscore
    return loss.mean(), dict(entropy=-float(negentropy),
                             policy_score=float(reinfscore),)

Actor-critic algo

In [None]:
import torch.nn.functional as F

def a2c(batch, module, *, gamma=0.99, C_entropy=1e-2, C_value=0.25, c_rho=1.0):
    r"""The Advantage Actor-Critic algorithm (importance-weighted off-policy).

    Close to REINFORCE, but uses spearate baseline estimate to compute
    advantages in the policy grad.
    $$
        \nabla_\theta J(s_t)
            = \mathbb{E}_{a \sim \beta(a\mid s_t)}
                \frac{\pi(a\mid s_t)}{\beta(a\mid s_t)}
                    \bigl( r_{t+1} + \gamma G_{t+1} - v(s_t) \bigr)
                \nabla_\theta \log \pi(a\mid s_t)
        \,, $$
    where the critic estimates the value function under the current policy
    $
    v(s_t) \approx \mathbb{E}_{\pi_{\geq t}}
                    G_t(a_t, s_{t+1}, a_{t+1}, ... \mid s_t)
    $.
    """
    # XXX `state[t]` = (x_t, a_{t-1}, r_t, d_t), t=0..T-1
    state = suply(getitem, batch.state, index=slice(None, -1))

    # XXX `state_next[t]` = (x_{t+1}, a_t, r_{t+1}, d_{t+1}), t=0..T-1
    state_next = suply(getitem, batch.state, index=slice(1, None))

    # REACT: (state[t], h_t) \to (\hat{a}_t, h_{t+1}, \hat{A}_{t+1})
    _, _, value, info = module(
        state.obs, state.act, state.rew, state.fin,
        hx=batch.hx, stepno=state.stepno)
    # value = V(`.state[t]`)
    #       <<-->> v(x_t)
    #       \approx \mathbb{E}( G_t \mid x_t)
    #       \approx \mathbb{E}( r_{t+1} + \gamma r_{t+2} + ... \mid x_t)
    #       <<-->> npv(`.state[t+1:]`)
    # info['logits'] = \log \pi(... | .state[t] ) <<-->> \log \pi( \cdot \mid x_t)

    # Future rewards following `.state[t]` are recorded in `.state[t+1:]`
    # ret[t] = rew[t] + gamma * (1 - fin[t]) * (ret[t+1] or bootstrap)
    #     `bootstrap` <<-->> `.value[-1]` = V(`.state[-1]`)
    ret = pyt_returns(state_next.rew, state_next.fin,
                      gamma=gamma, bootstrap=batch.value[-1])
    # XXX post-mul by `1 - \gamma` fails to train, but seems appropriate
    # for the continuation/survival interpretation of the discount factor.
    #   <<-- but who says this is a good interpretation?
    # ret.mul_(1 - gamma)

    # Assume `.act` is unstructured: `act[t]` = a_{t+1} -->> T x B x 1
    act = state_next.act.unsqueeze(-1)

    # \pi is the target policy, \mu is the behaviour policy
    log_pi, log_mu = info['logits'], batch.actor['logits']

    # the importance weights
    log_pi_a = log_pi.gather(-1, act).squeeze(-1)
    log_mu_a = log_mu.gather(-1, act).squeeze(-1)
    rho = log_mu_a.sub_(log_pi_a.detach())\
                  .neg_().exp_().clamp_(max=c_rho)

    # the critic's score (negative mse)
    #  \frac1{2 T} \sum_t (G_t - v(s_t))^2
    critic_mse = 0.5 * F.mse_loss(value, ret, reduction='mean')
    # v(x_t) \approx \mathbb{E}( G_t \mid x_t )
    #        \approx G_t (one-point estimate)
    #        <<-->> ret[t]

    # the policy surrogate score
    #    \frac1T \sum_t \rho_t (G_t - v_t) \log \pi(a_t \mid s_t)
    a2c_score = log_pi_a.mul(ret.sub(value.detach()).mul_(rho)).mean()

    # the policy entropy score (neg entropy)
    #   - H(\pi(•\mid s)) = - (-1) \sum_a \pi(a\mid s) \log \pi(a\mid s)
    f_min = torch.finfo(log_pi.dtype).min
    negentropy = log_pi.exp().mul(log_pi.clamp(min=f_min)).sum(dim=-1).mean()

    # maximize the entropy and the reinforce score, minimize the critic loss
    objective = C_entropy * negentropy + C_value * critic_mse - a2c_score
    return objective.mean(), dict(entropy=-float(negentropy),
                                  policy_score=float(a2c_score),
                                  value_loss=float(critic_mse))

In [None]:
# pyt_gae(batch.state.rew, batch.state.fin, batch.actor['value'], gamma=0.99, bootstrap=batch.bootstrap[0])

The policy of the actor

A more sophisticated recurrent learner:

In [None]:
from torch.nn.utils.rnn import pack_padded_sequence
from torch.nn.utils.rnn import pad_packed_sequence


class CartPoleActor(BaseActorModule):
    def __init__(self, epsilon=0.1, lstm=False):
        super().__init__()
        self.epsilon, self.lstm = epsilon, lstm

        self.features = torch.nn.ModuleDict(dict(
            obs=torch.nn.Sequential(
                torch.nn.Linear(4, 64),
                torch.nn.ReLU(),
            ),
            act=torch.nn.Embedding(2, 4),
            rew=torch.nn.Sequential(
                torch.nn.Linear(1, 4),
                torch.nn.ReLU(),
            ),
        ))

        n_features = 64 + 4 + 4
        if not self.lstm:
            self.core = torch.nn.Sequential(
                torch.nn.Linear(n_features, 64),
                torch.nn.ReLU(),
            )

        else:
            self.core = torch.nn.GRU(n_features, 64, 1)

        self.policy = torch.nn.Sequential(
            torch.nn.Linear(64, 2),
            torch.nn.LogSoftmax(dim=-1),
        )

        self.baseline = torch.nn.Sequential(
            torch.nn.Linear(64, 1),
        )

    def forward(self, obs, act, rew, fin, *, hx=None, stepno=None):
        # Everything is  [T x B x ...]
        inputs = torch.cat([
            self.features['obs'](obs),
            self.features['act'](act),
            self.features['rew'](rew.unsqueeze(-1)),
        ], dim=-1)
        
        if not self.lstm:
            output, hx = self.core(inputs), ()

        elif False:
            # sequence padding (MUST have sampling with `sticky=True`)
            n_steps, n_env, *_ = fin.shape
            if n_steps > 1:
                # we assume sticky=True
                lengths = 1 + (~fin[1:]).sum(0).cpu()
                input = pack_padded_sequence(input, lengths, enforce_sorted=False)

            output, hx = self.core(input, hx)
            if n_steps > 1:
                output, lens = pad_packed_sequence(
                    output, batch_first=False, total_length=n_steps)

        else:
            # inputs is T x B x F, hx is either None, or a proper recurrent state
            outputs = []
            # manually step through the RNN core
            for input, mask in zip(inputs.unsqueeze(1), fin.unsqueeze(-1)):
                # zero if f indicates reset: multiplying by zero stops grad
                if hx is not None:
                    # stop hx grads if `reset` (mul-by-zero)
                    hx = suply(torch.Tensor.mul, hx, other=~mask)

                output, hx = self.core(input, hx)
                outputs.append(output)

            output = torch.cat(outputs, dim=0)

        # value must not have any trailing dims, i.e. T x B
        logits = self.policy(output)
        value = self.baseline(output).squeeze(-1)

        # XXX eps-greedy?
        if self.training:
            unif = torch.tensor(1. / logits.shape[-1])

            prob = logits.detach().exp()
            prob.mul_(1 - self.epsilon)
            prob.add_(unif, alpha=self.epsilon)

            actions = multinomial(prob)

        else:
            actions = logits.argmax(dim=-1)

        return actions, hx, value, dict(
            logits=logits,
            # entropy
            # logit_at_the_chosen_action
        )

Initialize the learner

In [None]:
# learner, sticky = nonRecurrentPolicyWrapper(policy()), False
learner = CartPoleActor(lstm=False)
sticky = False  # learner.lstm

learner.train()
device_ = torch.device('cpu')  # torch.device('cuda:0')
learner.to(device=device_)

# prepare the optimizer for the learner
optim = torch.optim.Adam(learner.parameters(), lr=1e-3, weight_decay=1e-3)

Load a better trained agent

Initialize the sampler

In [None]:
# T, B = 120, 20
# T, B = 120, 4
T, B = 21, 4

Pick one collector
* NetHack environment `nle`, does not like `fork` method, so we should use `spawn`, which is not notebook friendly :(
  * essentially it is better to prototype in notebook with `same.rollout`, then write a submodule non-interactive script with `multi.rollout`

In [None]:
# generator of rollout batches
batchit = multi.rollout(
    factory,
    learner,
    n_steps=T,
    n_actors=8,
    n_per_actor=B,
    n_buffers=16,
    n_per_batch=2,
    sticky=sticky,  # so that we can leverage cudnn's fast RNN implementations
    pinned=False,
    clone=False,
    close=False,
    device=device_,
    start_method='fork',  # fork in notebook for macos, spawn in linux
)

Implement your favourite training method

In [None]:
import tqdm
from torch.nn.utils import clip_grad_norm_

gamma = 0.99
losses, rewards = [], []

# generator of evaluation rewards
# test_it = test(factory, learner, n_envs=4, n_steps=500, device=device_)
test_it = evaluate(factory, learner, n_envs=4, n_steps=500,
                   clone=False, device=device_, start_method='fork')

# the training loop
exclude = {'returns'}
ewm, alpha = None, 0.5
for epoch in tqdm.tqdm(range(400)):
    for j, batch in zip(range(100), batchit):

        optim.zero_grad()
        loss, info = a2c(batch, learner, gamma=gamma, c_rho=1.5)
        loss.backward()
        grad_norm = clip_grad_norm_(learner.parameters(), max_norm=1e2)
        optim.step()

        losses.append({
            k: float(v) for k, v in info.items() if k not in exclude
        })
        losses[-1].update({'grad': float(grad_norm)})

    rewards.append(next(test_it)[0])
    
    # track minimal reward
    if ewm is None:
        ewm = rewards[-1].min()
    else:
        ewm += alpha * (rewards[-1].min() - ewm)

    if ewm > 498:
        break

# close the generators
batchit.close()
test_it.close()

In [None]:
batchit.close()

<br>

In [None]:
def collate(records):
    """collate identically keyed dicts"""
    out, n_records = {}, 0
    for record in records:
        for k, v in record.items():
            out.setdefault(k, []).append(v)
    
    return out

In [None]:
data = {k: numpy.array(v) for k, v in collate(losses).items()}

In [None]:
if 'value_loss' in data:
    plt.semilogy(data['value_loss'])

In [None]:
plt.plot(data['entropy'])

In [None]:
plt.plot(data['policy_score'])

In [None]:
plt.semilogy(data['grad'])

In [None]:
rewards = numpy.stack(rewards, axis=0)

In [None]:
rewards

In [None]:
m, s = numpy.median(rewards, axis=-1), rewards.std(axis=-1)

In [None]:
fi, ax = plt.subplots(1, 1, figsize=(5, 3), dpi=300)

ax.plot(numpy.mean(rewards, axis=-1))
ax.plot(numpy.median(rewards, axis=-1))
ax.plot(numpy.min(rewards, axis=-1))
ax.plot(numpy.std(rewards, axis=-1))
# ax.plot(m+s * 1.96)
# ax.plot(m-s * 1.96)

plt.show()

In [None]:
with factory() as env:
    learner.eval()
    rewards, bootstrap = core_evaluate([
        env
    ], learner, render=True, n_steps=1e4, device=device_)

print(sum(rewards), bootstrap[0])

In [None]:
from rlplay.algo.returns import npy_returns
plt.plot(
    npy_returns(rewards, numpy.zeros_like(rewards, dtype=bool),
                gamma=gamma, bootstrap=bootstrap[0]))

<br>

In [None]:
assert False

<br>

In [None]:
import matplotlib.pyplot as plt

p_l, v_l, ent = zip(*losses)

plt.plot(p_l)
plt.plot(ent)

In [None]:
plt.plot(v_l)

Run in the environment

In [None]:
plt.plot([
    sum(evaluate(factory, learner, render=False))
    for _ in range(200)
])

<br>

In [None]:
assert False

In [None]:
class Bar:
    def __init__(self, parent):
        self.parent = parent
        self._range = range(self.parent.n)
        self._it = iter(self._range)

    def __iter__(self):
        return self

    def __next__(self):
        return next(self._it)

class Foo:
    def __init__(self, n=10):
        self.n = n

    def __iter__(self):
        return Bar(self)


In [None]:
list(Foo())

In [None]:
class Bar:
    def __init__(self, parent):
        self.parent = parent

    def __iter__(self):
        yield from range(self.parent.n)

class Foo:
    def __init__(self, n=10):
        self.n = n

    def __iter__(self):
        return iter(Bar(self))


<br>