# Outlook

This notebook is designed to understand how to use a gymnasium environment as a BBRL agent in practice, in the autoreset=False mode.
It is part of the [BBRL documentation](https://github.com/osigaud/bbrl/docs/index.html).

If this is your first contact with BBRL, you may start be having a look at [this more basic notebook](01-basic_concepts.student.ipynb).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

We use OmegaConf to that makes it possible that by just defining the `def
run_dqn(cfg):` function and then executing a long `params = {...}` variable at
the bottom of this colab, the code is run with the parameters without calling
an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

Below, we import standard python packages, pytorch packages and gymnasium
environments.

In [1]:
# Installs the necessary Python and system libraries
try:
    from easypip import easyimport, easyinstall, is_notebook
except ModuleNotFoundError as e:
    get_ipython().run_line_magic("pip", "install easypip")
    from easypip import easyimport, easyinstall, is_notebook

easyinstall("bbrl>=0.2.2")
easyinstall("swig")
easyinstall("bbrl_gymnasium>=0.2.0")
easyinstall("bbrl_gymnasium[box2d]")
easyinstall("bbrl_gymnasium[classic_control]")
easyinstall("tensorboard")
easyinstall("moviepy")
easyinstall("box2d-kengz")

[easypip] Installing bbrl_gymnasium>=0.2.0
[easypip] Installing bbrl_gymnasium[box2d]
[easypip] Installing bbrl_gymnasium[classic_control]


In [2]:
import os
import sys
from pathlib import Path
import math

from moviepy.editor import ipython_display as video_display
import time
from tqdm.auto import tqdm
from typing import Tuple, Optional
from functools import partial

from omegaconf import OmegaConf
import torch
import bbrl_gymnasium

import copy
from abc import abstractmethod, ABC
import torch.nn as nn
import torch.nn.functional as F
from time import strftime
OmegaConf.register_new_resolver(
    "current_time", lambda: strftime("%Y%m%d-%H%M%S"), replace=True
)


testing_mode = os.environ.get("TESTING_MODE", None) == "ON"

In [3]:
# Imports all the necessary classes and functions from BBRL
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class
# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# ParallelGymAgent is an agent able to execute a batch of gymnasium environments
# with auto-resetting. These agents produce multiple variables in the workspace:
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/terminated’,
# 'env/truncated', 'env/done', ’env/cumulated_reward’, ... 
# 
# When called at timestep t=0, the environments are automatically reset. At
# timestep t>0, these agents will read the ’action’ variable in the workspace at
# time t − 1
from bbrl.agents.gymnasium import GymAgent, ParallelGymAgent, make_env, record_video

# Replay buffers are useful to store past transitions when training
from bbrl.utils.replay_buffer import ReplayBuffer

In [4]:
# Utility function for launching tensorboard
# For Colab - otherwise, it is easier and better to launch tensorboard from
# the terminal
def setup_tensorboard(path):
    path = Path(path)
    answer = ""
    if is_notebook():
        if get_ipython().__class__.__module__ == "google.colab._shell":
            answer = "y"
        while answer not in ["y", "n"]:
                answer = input(f"Do you want to launch tensorboard in this notebook [y/n] ").lower()

    if answer == "y":
        get_ipython().run_line_magic("load_ext", "tensorboard")
        get_ipython().run_line_magic("tensorboard", f"--logdir {path.absolute()}")
    else:
        import sys
        import os
        import os.path as osp
        print(f"Launch tensorboard from the shell:\n{osp.dirname(sys.executable)}/tensorboard --logdir={path.absolute()}")

In [5]:
if not is_notebook():
    print("Not displaying video (hidden since not in a notebook)", file=sys.stderr)
    def video_display(*args, **kwargs):
        pass
    def display(*args, **kwargs):
        print(*args, **kwargs) 
    
testing_mode = os.environ.get("TESTING_MODE", None) == "ON"

## Definition of agents

We first create an Agent representing [the CartPole-v1 gym environment](https://gymnasium.farama.org/environments/classic_control/cart_pole/).
This is done using the [ParallelGymAgent](https://github.com/osigaud/bbrl/blob/40fe0468feb8998e62c3cd6bb3a575fef88e256f/src/bbrl/agents/gymnasium.py#L261) class.
We are working with batches (i.e. several episodes at the same time), so here our Agent uses n_envs = 3 environments.

In [6]:
# We run episodes over 3 environments at a time
n_envs = 3
env_agent = ParallelGymAgent(partial(make_env, 'CartPole-v1', autoreset=False), n_envs, reward_at_t=False)
# The random seed is set to 2139
env_agent.seed(2139)

obs_size, action_dim = env_agent.get_obs_and_actions_sizes()
print(f"Environment: observation space in R^{obs_size} and action space {{1, ..., {action_dim}}}")

Environment: observation space in R^4 and action space {1, ..., 2}


In [7]:
# Creates a new workspace
workspace = Workspace() 

# Execute the first step
env_agent(workspace, t=0)

# Our first set of observations. The size of the observation space is 4, and we have 3 environments.
obs = workspace.get("env/env_obs", 0)
print("Observation", obs)

Observation tensor([[-0.0085, -0.0427, -0.0489,  0.0215],
        [ 0.0005,  0.0025, -0.0493, -0.0402],
        [ 0.0080,  0.0203, -0.0023, -0.0085]])


### Random action without agent
We first set an action directly without using an agent

In [8]:
# Sets the next action
action = torch.randint(0, action_dim, (n_envs, ))
workspace.set("action", 0, action)
print(action)
env_agent(workspace, t=1)

# And perform one step
workspace.get("env/env_obs", 1)

tensor([1, 1, 0])


tensor([[-0.0094,  0.1531, -0.0485, -0.2862],
        [ 0.0006,  0.1983, -0.0501, -0.3480],
        [ 0.0084, -0.1747, -0.0025,  0.2834]])

Let us now see the workspace

In [9]:
for key in workspace.variables.keys():
    print(key, workspace[key])

env/env_obs tensor([[[-0.0085, -0.0427, -0.0489,  0.0215],
         [ 0.0005,  0.0025, -0.0493, -0.0402],
         [ 0.0080,  0.0203, -0.0023, -0.0085]],

        [[-0.0094,  0.1531, -0.0485, -0.2862],
         [ 0.0006,  0.1983, -0.0501, -0.3480],
         [ 0.0084, -0.1747, -0.0025,  0.2834]]])
env/terminated tensor([[False, False, False],
        [False, False, False]])
env/truncated tensor([[False, False, False],
        [False, False, False]])
env/done tensor([[False, False, False],
        [False, False, False]])
env/reward tensor([[0., 0., 0.],
        [1., 1., 1.]])
env/cumulated_reward tensor([[0., 0., 0.],
        [1., 1., 1.]])
env/timestep tensor([[0, 0, 0],
        [1, 1, 1]])
action tensor([[1, 1, 0]])


You can observe that we have two time steps for each variable that are stored
within tensors where the first dimension is time. 

### Random agent

The process above can be
automatized with `Agents` and `TemporalAgent` as shown below - but first we have
to create an agent that selects the actions (here, random).

In [10]:
class RandomAgent(Agent):
    def __init__(self, action_dim):
        super().__init__()
        self.action_dim = action_dim

    def forward(self, t: int, choose_action=True, **kwargs):
        """An Agent can use self.workspace"""
        obs = self.get(("env/env_obs", t))
        action = torch.randint(0, self.action_dim, (len(obs), ))
        self.set(("action", t), action)

# Each agent is run in the order given when constructing Agents
agents = Agents(env_agent, RandomAgent(action_dim))

# And the TemporalAgent allows to run through time
t_agents = TemporalAgent(agents)

In [11]:
# We can now run the agents throught time with a simple call...

workspace = Workspace()
t_agents(workspace, t=0, stop_variable="env/done", stochastic=True)

### Termination

`env/done` tells us if the episode was finished or not
here, with NoAutoReset, (1) we wait that all episodes are "done",
and when an episode is finished the flag remains True.
Note that when an environment is done before the others, its content is copied until the termination of all environments.
This is convenient for collecting the final reward.

In [12]:
workspace["env/done"].shape, workspace["env/done"][-10:]

(torch.Size([38, 3]),
 tensor([[False, False, False],
         [False, False, False],
         [False, False,  True],
         [ True, False,  True],
         [ True, False,  True],
         [ True, False,  True],
         [ True, False,  True],
         [ True, False,  True],
         [ True, False,  True],
         [ True,  True,  True]]))

The resulting tensor of observations, with the last two observations

In [13]:
workspace["env/env_obs"].shape, workspace["env/env_obs"][-2:]

(torch.Size([38, 3, 4]),
 tensor([[[ 0.1225, -0.1720, -0.2107, -0.4356],
          [-0.0660, -0.8039,  0.2033,  1.6900],
          [-0.0261, -0.7726,  0.2220,  1.4842]],
 
         [[ 0.1225, -0.1720, -0.2107, -0.4356],
          [-0.0821, -0.6117,  0.2371,  1.4669],
          [-0.0261, -0.7726,  0.2220,  1.4842]]]))

The resulting tensor of rewards, with the last 8 rewards

In [14]:
workspace["env/reward"].shape, workspace["env/reward"][-8:]

(torch.Size([38, 3]),
 tensor([[1., 1., 1.],
         [1., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.],
         [0., 1., 0.]]))

The resulting tensor of actions, with the last two actions

In [15]:
workspace["action"].shape, workspace["action"][-2:]

(torch.Size([38, 3]),
 tensor([[1, 1, 0],
         [1, 1, 1]]))