# BBRL in practice: the interaction loop

## Outlook

In this notebook, we start practicing with the BBRL model, which is explained in [this notebook](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing). We just implement a simple interaction loop.


What you will see here is very close to what Ludovic Denoyer shows in [this video](https://www.youtube.com/watch?v=CSkkoq_k5zU).

# Installation

Just run the following cell.

Note the trick: we first try to import, if it fails we install the github repository and import again.

In [1]:
try:
  import bbrl
except ImportError:
  !pip install git+https://github.com/osigaud/bbrl.git
  import bbrl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/bbrl.git
  Cloning https://github.com/osigaud/bbrl.git to /tmp/pip-req-build-mbzhenyd
  Running command git clone --filter=blob:none --quiet https://github.com/osigaud/bbrl.git /tmp/pip-req-build-mbzhenyd
  Resolved https://github.com/osigaud/bbrl.git to commit 4d19640b3c9fc794ff5f65b55675f1001d6a1742
  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
  Installing backend dependencies ... [?25l[?25hdone
  Preparing metadata (pyproject.toml) ... [?25l[?25hdone
Building wheels for collected packages: bbrl
  Building wheel for bbrl (pyproject.toml) ... [?25l[?25hdone
  Created wheel for bbrl: filename=bbrl-0.1.11-py3-none-any.whl size=57146 sha256=4e83ac606b32cdddd835e2f4aaf7b48ae3589a284c503a0cb0c292245129595e
  Stored in directory: /tmp/pip-ephem-wheel-cache-wo83a10d/wheels/67/8e/1c

In [2]:
import torch # just used to get a random Tensor


## BBRL imports

As explained in [the white paper](https://arxiv.org/pdf/2110.07910.pdf), everything in SaLinA (and also in BBRL) is an Agent.

This construct is defined in [the bbrl/agents/agent.py](https://github.com/osigaud/bbrl/blob/master/bbrl/agents/agent.py) file as the Agent class.

Any Agent class should come with a `forward(self, t, **kwargs)` method where t represents a time step.

Some of the comments below are just copy-pasted from the paper or from the code.

In [3]:
from bbrl.workspace import Workspace

from bbrl.agents.agent import Agent

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent (e.g an Agent) over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# GymAgent (resp. AutoResetGymAgent) are agents able to execute a batch of gym environments
# without (resp. with) auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gyma import AutoResetGymAgent, NoAutoResetGymAgent

Remember that a workspace contains tensors, so everything written into a workspace should be a tensor. In the examples below the agents will first write random tensors.

# Creating and running agents

To play with the BBRL model, we first create a simple ActionAgent

In [4]:
class ActionAgent(Agent):
    # Create the action agent
    # This is a fake agent for illustration purpose
    # In a standard ActionAgent, there should be an architecture 
    # to compute the action given the observation
    def __init__(self):
        super().__init__()

    def forward(self, t, **kwargs):
        obs = self.get(("obs", t))
        action = torch.rand(1) # here should be function of the obs 

        self.set(("action", t), action)

Then we create an EnvAgent

In [5]:
class EnvAgent(Agent):
  # Create the environment agent
  # This is a fake agent for illustration purpose
  # A standard EnvAgent would inherit from a GymAgent 
  def __init__(self):
    super().__init__()

  def forward(self, t, **kwargs):
    if t==0:
      # If we are in the first step, the agent has not acted yet
      # A real GymAgent would call obs = reset()
      obs = torch.rand(2)      
      reward = torch.randint(low=0, high=5, size=[1])     
      done = torch.zeros(1, dtype=torch.bool)
    else:
      # Here, a real GymAgent would call obs, reward, done, info = step(action)
      action = self.get(("action", t-1)) # beware, we take the previous action
      obs = torch.rand(2)           
      reward = torch.randint(low=0, high=5, size=[1])       
      done = torch.zeros(1, dtype=torch.bool)
    self.set(("obs", t), obs)
    self.set(("reward", t), reward)
    self.set(("done", t), done)


We bind them together into a TemporalAgent

In [6]:
action_agent = ActionAgent()
env_agent = EnvAgent()

# Compose both previous agents
composed_agent = Agents(env_agent, action_agent)
  
# Get a temporal agent that can be executed in a workspace
t_agent = TemporalAgent(composed_agent)

And finally we execute it in the workspace

In [7]:
# We create a workspace
workspace = Workspace()

# The temporal agent will be run for 10 steps on this workspace
t_agent(workspace, t=0, n_steps=10)

# We retrieve the information as they are stored into the workspace
obs, action, reward, done = workspace["obs", "action", "reward", "done"]

# And we print them
print("obs:", obs)
print("action:", action)
print("reward:", reward)
print("done:", done)
# You should see that each variable has been recorded for the number of specified 
# time steps...

obs: tensor([[0.5653, 0.9000],
        [0.1912, 0.8366],
        [0.5505, 0.3652],
        [0.1115, 0.5979],
        [0.7808, 0.6478],
        [0.7699, 0.6776],
        [0.0700, 0.6624],
        [0.0648, 0.5195],
        [0.8674, 0.6475],
        [0.0628, 0.4697]])
action: tensor([[0.2723],
        [0.5645],
        [0.0146],
        [0.3624],
        [0.7831],
        [0.0625],
        [0.9255],
        [0.3427],
        [0.1136],
        [0.0562]])
reward: tensor([[0],
        [4],
        [2],
        [4],
        [4],
        [1],
        [2],
        [4],
        [3],
        [1]])
done: tensor([[False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False],
        [False]])


## What's next?

In [the next notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) we will replace these simple random agents with real agents based on neural networks and a real environnement: we will use a neural network ActionAgent and an RL environment from gym to write an elementary RL loop.