# Outlook

This notebook is designed to understand how to use a gymnasium environment as a BBRL agent in practice, using autoreset=True.
It is part of the [BBRL documentation](https://github.com/osigaud/bbrl/docs/index.html).

If this is your first contact with BBRL, you may start be having a look at [this more basic notebook](01-basic_concepts.student.ipynb) and [the one using autoreset=False](02-multi_env_noautoreset.student.ipynb).

## Installation and Imports

The BBRL library is [here](https://github.com/osigaud/bbrl).

Below, we import standard python packages, pytorch packages and gymnasium environments.

In [1]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[classic_control]")

[easypip] Installing bbrl_gymnasium>=0.2.0
[easypip] Installing bbrl_gymnasium[classic_control]


In [2]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)

In [3]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class
# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1, agent2, agent3, ...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached

from bbrl.agents import Agents, TemporalAgent
from bbrl.agents.gymnasium import ParallelGymAgent, make_env

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

## Definition of agents

We reuse the RandomAgent already used in the autoreset=False case.

In [4]:
class RandomAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.randint(0, self.action_dim, (len(obs), ))
        self.set(("action", t), action)

As before, we create an Agent representing [the CartPole-v1 gym environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).
This is done using the [ParallelGymAgent](https://github.com/osigaud/bbrl/blob/40fe0468feb8998e62c3cd6bb3a575fef88e256f/src/bbrl/agents/gymnasium.py#L261) class.

### Single environment case

We start with a single instance of the CartPole environment

In [5]:
# We deal with 1 environment (random seed 2139)

env_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1', autoreset=True), num_envs=1).seed(2139)
obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")

# Each agent is run in the order given when constructing Agents

agents = Agents(env_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)

Environment: observation space in R^4 and action space R^2


Let us have a closer look at the content of the workspace

In [16]:
# Creates a new workspace
workspace = Workspace() 
epoch_size = 15
t_agents(workspace, n_steps=epoch_size)

In [17]:
for key in workspace.variables.keys():
    print(key, workspace[key].shape, workspace[key])

env/env_obs torch.Size([15, 3, 4]) tensor([[[-2.4851e-03,  1.3973e-02, -1.4816e-02, -4.2812e-03],
         [-8.0932e-03,  1.2465e-02, -4.5459e-02,  1.7060e-03],
         [-2.9592e-02,  8.0873e-03,  4.2124e-02, -4.6108e-02]],

        [[-2.2056e-03, -1.8093e-01, -1.4901e-02,  2.8369e-01],
         [-7.8439e-03,  2.0821e-01, -4.5425e-02, -3.0497e-01],
         [-2.9430e-02,  2.0258e-01,  4.1202e-02, -3.2521e-01]],

        [[-5.8243e-03, -3.7584e-01, -9.2275e-03,  5.7164e-01],
         [-3.6798e-03,  1.3762e-02, -5.1524e-02, -2.6948e-02],
         [-2.5379e-02,  3.9709e-01,  3.4698e-02, -6.0462e-01]],

        [[-1.3341e-02, -1.8059e-01,  2.2052e-03,  2.7606e-01],
         [-3.4045e-03,  2.0958e-01, -5.2063e-02, -3.3543e-01],
         [-1.7437e-02,  5.9171e-01,  2.2605e-02, -8.8617e-01]],

        [[-1.6953e-02, -3.7574e-01,  7.7264e-03,  5.6944e-01],
         [ 7.8716e-04,  4.0541e-01, -5.8772e-02, -6.4407e-01],
         [-5.6027e-03,  7.8652e-01,  4.8820e-03, -1.1717e+00]],

        [[

In [19]:

# We get the transitions: each tensor is transformed so that: 
# - we have the value at time step t and t+1 (so all the tensors first dimension have a size of 2)
# - there is no distinction between the different environments (here, there is just one environment to make it easy)
transitions = workspace.get_transitions()

display("Observations (first 4)", workspace["env/env_obs"][:4])

display("Transitions (first 3)")
for t in range(4):
    display(f'(s_{t}, s_{t+1})')
    # We ignore the first dimension as it corresponds to [t, t+1]
    display(transitions["env/env_obs"][:, t])

'Observations (first 4)'

tensor([[[-0.0025,  0.0140, -0.0148, -0.0043],
         [-0.0081,  0.0125, -0.0455,  0.0017],
         [-0.0296,  0.0081,  0.0421, -0.0461]],

        [[-0.0022, -0.1809, -0.0149,  0.2837],
         [-0.0078,  0.2082, -0.0454, -0.3050],
         [-0.0294,  0.2026,  0.0412, -0.3252]],

        [[-0.0058, -0.3758, -0.0092,  0.5716],
         [-0.0037,  0.0138, -0.0515, -0.0269],
         [-0.0254,  0.3971,  0.0347, -0.6046]],

        [[-0.0133, -0.1806,  0.0022,  0.2761],
         [-0.0034,  0.2096, -0.0521, -0.3354],
         [-0.0174,  0.5917,  0.0226, -0.8862]]])

'Transitions (first 3)'

'(s_0, s_1)'

tensor([[-0.0025,  0.0140, -0.0148, -0.0043],
        [-0.0022, -0.1809, -0.0149,  0.2837]])

'(s_1, s_2)'

tensor([[-0.0081,  0.0125, -0.0455,  0.0017],
        [-0.0078,  0.2082, -0.0454, -0.3050]])

'(s_2, s_3)'

tensor([[-0.0296,  0.0081,  0.0421, -0.0461],
        [-0.0294,  0.2026,  0.0412, -0.3252]])

'(s_3, s_4)'

tensor([[-0.0022, -0.1809, -0.0149,  0.2837],
        [-0.0058, -0.3758, -0.0092,  0.5716]])

You can see that each transition in the workspace corresponds to a pair of observations.

### Transitions as a workspace

A transition workspace is still a workspace... this is quite
 handy since each transition can be seen as a mini-episode of two time steps;
 we can use our agents on it.

It is often the case in BBRL that we have to apply an agent to an already existing workspace
as shown below.

In [7]:
for key in transitions.variables.keys():
    print(key, transitions[key])

t_random_agent = TemporalAgent(RandomAgent(action_dim))
t_random_agent(transitions, t=0, n_steps=2)

# Here, the action tensor will have been overwritten by the new actions
print(f"new action, {transitions['action']}")

env/env_obs tensor([[[-0.0471,  0.0265,  0.0220, -0.0336],
         [-0.0466, -0.1689,  0.0214,  0.2660],
         [-0.0500, -0.3643,  0.0267,  0.5653],
         [-0.0572, -0.5598,  0.0380,  0.8663],
         [-0.0684, -0.7554,  0.0553,  1.1707]],

        [[-0.0466, -0.1689,  0.0214,  0.2660],
         [-0.0500, -0.3643,  0.0267,  0.5653],
         [-0.0572, -0.5598,  0.0380,  0.8663],
         [-0.0684, -0.7554,  0.0553,  1.1707],
         [-0.0836, -0.5611,  0.0787,  0.8959]]])
env/terminated tensor([[False, False, False, False, False],
        [False, False, False, False, False]])
env/truncated tensor([[False, False, False, False, False],
        [False, False, False, False, False]])
env/done tensor([[False, False, False, False, False],
        [False, False, False, False, False]])
env/reward tensor([[0., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])
env/cumulated_reward tensor([[0., 1., 2., 3., 4.],
        [1., 2., 3., 4., 5.]])
env/timestep tensor([[0, 1, 2, 3, 4],
        [1,

### Multiple environment case

Now we are using 3 environments.
Given the organization of transitions, to find the transitions of a particular environment
we have to watch in the transition every 3 lines, since transitions are stored one environment after the other.

In [8]:
# We deal with 3 environments at a time (random seed 2139)

multienv_agent = ParallelGymAgent(partial(make_env, env_name='CartPole-v1', autoreset=True), num_envs=3).seed(2139)
obs_size, action_dim = multienv_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space R^{action_dim}")

agents = Agents(multienv_agent, RandomAgent(action_dim))
t_agents = TemporalAgent(agents)
workspace = Workspace() 
t_agents(workspace, n_steps=epoch_size)
transitions = workspace.get_transitions()

display("Observations (first 4)", workspace["env/env_obs"][:4])

display("Transitions (first 3)")
for t in range(3):
    display(f'(s_{t}, s_{t+1})')
    display(transitions["env/env_obs"][:, t])

Environment: observation space in R^4 and action space R^2


'Observations (first 4)'

tensor([[[-8.5048e-03, -4.2718e-02, -4.8940e-02,  2.1523e-02],
         [ 5.4922e-04,  2.4692e-03, -4.9253e-02, -4.0183e-02],
         [ 8.0318e-03,  2.0348e-02, -2.2937e-03, -8.5254e-03]],

        [[-9.3592e-03, -2.3711e-01, -4.8510e-02,  2.9837e-01],
         [ 5.9860e-04,  1.9826e-01, -5.0056e-02, -3.4799e-01],
         [ 8.4387e-03, -1.7474e-01, -2.4643e-03,  2.8343e-01]],

        [[-1.4101e-02, -4.1327e-02, -4.2542e-02, -9.2070e-03],
         [ 4.5638e-03,  3.9406e-01, -5.7016e-02, -6.5603e-01],
         [ 4.9439e-03, -3.6983e-01,  3.2044e-03,  5.7534e-01]],

        [[-1.4928e-02,  1.5438e-01, -4.2726e-02, -3.1500e-01],
         [ 1.2445e-02,  5.8993e-01, -7.0137e-02, -9.6611e-01],
         [-2.4526e-03, -5.6499e-01,  1.4711e-02,  8.6903e-01]]])

'Transitions (first 3)'

'(s_0, s_1)'

tensor([[-0.0085, -0.0427, -0.0489,  0.0215],
        [-0.0094, -0.2371, -0.0485,  0.2984]])

'(s_1, s_2)'

tensor([[ 0.0005,  0.0025, -0.0493, -0.0402],
        [ 0.0006,  0.1983, -0.0501, -0.3480]])

'(s_2, s_3)'

tensor([[ 0.0080,  0.0203, -0.0023, -0.0085],
        [ 0.0084, -0.1747, -0.0025,  0.2834]])

You can see how the transitions are organized in the workspace relative to the 3 environments.
You first get the first transition from the first environment.
Then the first transition from the second environment.
Then the first transition from the third environment.
Then the second transition from the first environment, etc.

## The replay buffer

Differently from the previous case, we use a replace buffer that stores
a set of transitions $(s_t, a_t, r_t, s_{t+1})$
Finally, the replay buffer keeps slices [:, i, ...] of the transition
workspace (here at most 80 transitions)

In [9]:
rb = ReplayBuffer(max_size=80)

# We add the transitions to the buffer....
rb.put(transitions)

# And sample from them here we get 3 tuples (s_t, s_{t+1})
rb.get_shuffled(3)["env/env_obs"]

tensor([[[ 0.0046,  0.3941, -0.0570, -0.6560],
         [-0.0025, -0.5650,  0.0147,  0.8690],
         [ 0.0049, -0.3698,  0.0032,  0.5753]],

        [[ 0.0124,  0.5899, -0.0701, -0.9661],
         [-0.0138, -0.3701,  0.0321,  0.5810],
         [-0.0025, -0.5650,  0.0147,  0.8690]]])

## Collecting several epochs into the same workspace

In the code below, the workspace only contains one epoch at a time.
The content of these different epochs are concatenated into the replay buffer

In [10]:
nb_steps = 0
max_steps = 100
epoch_size = 10

while nb_steps < max_steps:
    # Execute the agent in the workspace
    if nb_steps == 0:
        # In the first epoch, we start with t=0
        t_agents(workspace, t=0, n_steps=epoch_size)
    else:
        # Clear all gradient graphs from the workspace
        workspace.zero_grad()
        # Here we duplicate the last column of the previous epoch into the first column of the next epoch
        workspace.copy_n_last_steps(1)

        # In subsequent epochs, we start with t=1 so as to avoid overwriting the first column we just duplicated
        t_agents(workspace, t=1, n_steps=epoch_size)

    transition_workspace = workspace.get_transitions()

    # The part below counts the number of steps: it ignores action performed during transition from one episode to the next,
    # as they have been discarded by the get_transitions() function

    action = transition_workspace["action"]
    nb_steps += action[0].shape[0]
    print(f"collecting new epoch, already performed {nb_steps} steps")

    if nb_steps > 0 or epoch_size  > 1:
        rb.put(transition_workspace)
    print(f"replay buffer size: {rb.size()}")

collecting new epoch, already performed 27 steps
replay buffer size: 42
collecting new epoch, already performed 57 steps
replay buffer size: 72
collecting new epoch, already performed 85 steps
replay buffer size: 80
collecting new epoch, already performed 114 steps
replay buffer size: 80


## Exercise

Create a stupid agent that always outputs action 1, run it for 10 epochs of 100 steps over 2 instances of the CartPole-v1 environment.
Put the data into a replay buffer of size 5000.

Then do the following:
- Count the number of episodes the agent performed in each environment by counting the number of "done=True" elements in the workspace before applying the `get_transitions()` function
- Count the total number of episodes performed by the agent by measuring the difference between the size of the replay buffer and the number of steps performed by the agent.
- Make sure both counts are consistent

Can we count the number of episodes performed in one environment using the second method? Why?