# Outlook

In this notebook, we will implement a simple version of the A2C algorithm using BBRL. To understand this code, you need [to know more about BBRL](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing). You should first have a look at [the BBRL interaction model](https://colab.research.google.com/drive/1gSdkOBPkIQi_my9TtwJ-qWZQS0b2X7jt?usp=sharing), then [a first example](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) and, most importantly, details about the [NoAutoResetGymAgent](https://colab.research.google.com/drive/1EX5O03mmWFp9wCL_Gb_-p08JktfiL2l5?usp=sharing).

The A2C algorithm is explained in [this video](https://www.youtube.com/watch?v=BUmsTlIgrBI) and you can also read [the corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/a2c.pdf).

## Installation and Imports

### Installation

In [None]:
!pip install importlib-metadata==4.13.0

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/


The BBRL library is [here](https://github.com/osigaud/bbrl).

This is OmegaConf that makes it possible that by just defining the `def run_a2c(cfg):` function and then executing a long `params = {...}` variable at the bottom of this colab, the code is run with the parameters without calling an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_a2c(config)`

at the very bottom of the colab, after starting tensorboard.

In [None]:
import functools
import time
!pip install omegaconf
from omegaconf import OmegaConf

import gym
!pip install git+https://github.com/osigaud/my_gym.git
!pip install git+https://github.com/osigaud/bbrl.git

import bbrl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/my_gym.git
  Cloning https://github.com/osigaud/my_gym.git to /tmp/pip-req-build-dsqnu5gp
  Running command git clone -q https://github.com/osigaud/my_gym.git /tmp/pip-req-build-dsqnu5gp
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/bbrl.git
  Cloning https://github.com/osigaud/bbrl.git to /tmp/pip-req-build-vpo23qpc
  Running command git clone -q https://github.com/osigaud/bbrl.git /tmp/pip-req-build-vpo23qpc


### Imports

Below, we import standard python packages, pytorch packages and gym environments.

[OpenAI gym](https://gym.openai.com/) is a collection of benchmark environments to evaluate RL algorithms.

In [None]:
import copy
import time

import torch
import torch.nn as nn
import torch.nn.functional as F

import gym

### BBRL imports

In [None]:
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, RemoteAgent, TemporalAgent

# AutoResetGymAgent is an agent able to execute a batch of gym environments
# with auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gymb import AutoResetGymAgent

## Definition of agents

The [A2C](http://proceedings.mlr.press/v48/mniha16.pdf) algorithm is an actor-critic algorithm. Thus we need an Actor agent, a Critic agent and an Environment agent. 
The actor agent is built on an intermediate ProbAgent, see [this notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) for explanations about the  ProbaAgent, the ActorAgent and the environment agent.

In [None]:
class ProbAgent(Agent):
    def __init__(self, observation_size, hidden_size, n_actions):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(observation_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions),
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        scores = self.model(observation)
        probs = torch.softmax(scores, dim=-1)
        self.set(("action_probs", t), probs)

In [None]:
class ActorAgent(Agent):
    def __init__(self):
        super().__init__()

    def forward(self, t, stochastic, **kwargs):
        probs = self.get(("action_probs", t))
        if stochastic:
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = probs.argmax(1)

        self.set(("action", t), action)

In [None]:
def make_env(env_name):
    return gym.make(env_name)

### CriticAgent

A CriticAgent is a one hidden layer neural network which takes an observation as input and whose output is the value of this observation. It thus implements a $V(s)$ function. It would be straightforward to define another CriticAgent (call it a CriticQAgent by contrast to a CriticVAgent) that would take an observation and an action as input.

 The `squeeze(-1)` removes the last dimension of the tensor. TODO: explain why we need it

In [None]:
class CriticAgent(Agent):
    def __init__(self, observation_size, hidden_size):
        super().__init__()
        layers_size = [observation_size] + list(hidden_size) + [1]
        model = [nn.Linear(layers_size[i], layers_size[i+1])]
        for i in range(len(layers_size) - 2):

        self.critic_model = nn.Sequential(
            nn.Linear(observation_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, 1),
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        critic = self.critic_model(observation).squeeze(-1)
        self.set(("critic", t), critic)

### Create the A2C agent

The code below is rather straightforward. Note that we have not defined anything about data collection, using a RolloutBuffer or something to store the n_step return so far. This will come inside the training loop below.

Interestingly, the loop between the policy and the environment is first defined as a collection of agents, and then embedded into a single TemporalAgent.

We delete the environment (not the environment agent) with `del env_agent.env` once we do not need it anymore just to avoid mistakes afterwards.

In [None]:
# Create the A2C Agent
def create_a2c_agent(cfg, env_agent):
  observation_size,  n_actions = env_agent.get_obs_and_actions_sizes()
  prob_agent = ProbAgent(
      observation_size, cfg.algorithm.architecture.actor_hidden_size, n_actions
  )
  action_agent = ActorAgent()
  critic_agent = CriticAgent(
    observation_size, cfg.algorithm.architecture.critic_hidden_size)

  # Combine env and policy agents
  agent = Agents(env_agent, prob_agent, action_agent)
  # Get an agent that is executed on a complete workspace
  agent = TemporalAgent(agent)
  agent.seed(cfg.algorithm.seed)
  return agent, prob_agent, critic_agent

### The Logger class

The logger class below is not generic, it is specifically designed in the context of this A2C colab.

The logger parameters are defined below in `params = { "logger":{ ...`

In this colab, the logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation (see the parameters part below).
Note that the salina Logger is also saving the log in a readable format such that you can use `Logger.read_directories(...)` to read multiple logs, create a dataframe, and analyze many experiments afterward in a notebook for instance. 

The code for the different kinds of loggers is available in the [bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/bbrl/utils/logger.py) file.

Having logging provided under the hood is one of the features where using RL libraries like BBRL will allow you to save time.

`instantiate_class` is an inner BBRL mechanism. The `instantiate_class`function is available in the [`bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/bbrl/__init__.py) file.

In [None]:
class Logger():

  def __init__(self, cfg):
    self.logger = instantiate_class(cfg.logger)

  def add_log(self, log_string, loss, epoch):
    self.logger.add_scalar(log_string, loss.item(), epoch)

  # Log losses
  def log_losses(self, cfg, epoch, critic_loss, entropy_loss, a2c_loss):
    self.add_log("critic_loss", critic_loss, epoch)
    self.add_log("entropy_loss", entropy_loss, epoch)
    self.add_log("a2c_loss", a2c_loss, epoch)


### Setup the optimizer

We use a single optimizer to tune the parameters of the actor (in the prob_agent part) and the critic (in the critic_agent part). It would be possible to have two optimizers which would work separately on the parameters of each component agent, but it would be more complicated because updating the actor requires the gradient of the critic.

In [None]:
# Configure the optimizer over the a2c agent
def setup_optimizer(cfg, prob_agent, critic_agent):
  optimizer_args = get_arguments(cfg.optimizer)
  parameters = nn.Sequential(prob_agent, critic_agent).parameters()
  optimizer = get_class(cfg.optimizer)(parameters, **optimizer_args)
  return optimizer

### Execute agent

This is the tricky part with BBRL, the one we need to understand in detail. The difficulty lies in the copy of the last step and the way to deal with the n_steps return.

The call to `agent(workspace, t=1, n_steps=cfg.algorithm.n_timesteps - 1, stochastic=True)` makes the agent run a number of steps in the workspace. In practice, it calls the [`__call__(...)`](https://github.com/osigaud/bbrl/blob/master/bbrl/agents/agent.py#L54) function which makes a forward pass of the agent network using the workspace data and updates the workspace accordingly.

Now, if we start at the first epoch (`epoch=0`), we start from the first step (`t=0`). But when subsequently we perform the next epochs (`epoch>0`), we must not forget to cover the transition at the border between the previous epoch and the current epoch. To avoid this risk, we copy the information from the last time step of the previous epoch into the first time step of the next epoch. This is explained in more details in [a previous notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

In [None]:
def execute_agent(cfg, epoch, workspace, agent):
  if epoch > 0:
      workspace.zero_grad()
      workspace.copy_n_last_steps(1)
      agent(
        workspace, t=1, n_steps=cfg.algorithm.n_timesteps - 1, stochastic=True
      )
  else:
    agent(workspace, t=0, n_steps=cfg.algorithm.n_timesteps, stochastic=True)

### Compute critic loss

Note the `critic[1:].detach()` in the computation of the temporal difference target. The idea is that we compute this target as a function of $V(s_{t+1})$, but we do not want to apply gradient descent on this $V(s_{t+1})$, we will only apply gradient descent to the $V(s_t)$ according to this target value.

In practice, `x.detach()` detaches a computation graph from a tensor, so it avoids computing a gradient over this tensor.

Note also the trick to deal with terminal states. If the state is terminal, $V(s_{t+1})$ does not make sense. Thus we need to ignore this term. So we multiply the term by `must_bootstrap`: if `must_bootstrap` is True (converted into an int, it becomes a 1), we get the term. If `must_bootstrap` is False (=0), we are at a terminal state, so we ignore the term. This trick is used in many RL libraries, e.g. SB3.

In [None]:
def compute_advantage_loss(cfg, reward, must_bootstrap, critic):
  # Compute temporal difference
  target = reward[:-1] + cfg.algorithm.discount_factor * critic[1:].detach() * must_bootstrap.int()
  advantage = target - critic[:-1]

  # Compute critic loss
  td_error = advantage ** 2
  critic_loss = td_error.mean()
  return critic_loss, advantage

## Main training loop

Note that everything about the shared workspace between all the agents is completely hidden under the hood. This results in a gain of productivity, at the expense of having to dig into the BBRL code if you want to understand the details, change the multiprocessing model, etc.

This version uses an AutoResetGymAgent. If you haven't done so yet, read  [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5?usp=sharing) which explains a lot of details. In particular, read it to understand the `execute_agents(...)` function, the `transition_workspace = train_workspace.get_transitions()` line. Read also [the notebook about TimeLimits](https://colab.research.google.com/drive/1erLbRKvdkdDy0Zn1X_JhC01s1QAt4BBj?usp=sharing) to know more about the computation of `must_bootstrap`.

Note that we `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()` lines. Several things need to be explained here.
- `optimizer.zero_grad()` is necessary to cancel all the gradients computed at the previous iterations
- note that we sum all the losses, both for the critic and the actor, before applying back-propagation with `loss.backward()`. At first glance, summing these losses may look weird, as the actor and the critic receive different updates with different parts of the loss. This mechanism relies on the central property of tensor manipulation libraries like TensorFlow and pytorch. In pytorch, each loss tensor comes with its own graph of computation for back-propagating the gradient, in such a way that when you back-propagate the loss, the adequate part of the loss is applied to the adequate parameters.
These mechanisms are partly explained [here](https://pytorch.org/tutorials/beginner/blitz/neural_networks_tutorial.html).
- since the optimizer has been set to work with both the actor and critic parameters, `optimizer.step()` will optimize both agents and pytorch ensure that each will receive its own part of the gradient.

In [None]:
def run_a2c(cfg):
  # 1)  Build the  logger
  logger = Logger(cfg)
  
  # 2) Create the environment agent
  env_agent = AutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        cfg.algorithm.n_envs,
        cfg.algorithm.seed,
    )

  # 3) Create the A2C Agent
  a2c_agent, prob_agent, critic_agent = create_a2c_agent(cfg, env_agent)

  # 4) Create the temporal critic agent to compute critic values over the workspace
  tcritic_agent = TemporalAgent(critic_agent)

  # 5) Configure the workspace to the right dimension
  # Note that no parameter is needed to create the workspace. 
  # In the training loop, calling the agent() and critic_agent() 
  # will take the workspace as parameter
  train_workspace = Workspace()

  # 6) Configure the optimizer over the a2c agent
  optimizer = setup_optimizer(cfg, prob_agent, critic_agent)
  
  # 7) Training loop
  for epoch in range(cfg.algorithm.max_epochs):
    # Execute the agent in the workspace
    execute_agent(cfg, epoch, train_workspace, a2c_agent)

    # Compute the critic value over the whole workspace
    tcritic_agent(train_workspace, n_steps=cfg.algorithm.n_timesteps)

    transition_workspace = train_workspace.get_transitions()

    # Get relevant tensors (size are timestep x n_envs x ....)

    critic, done, reward, action, action_probs, truncated = transition_workspace[
                "critic", "env/done", "env/reward", "action", "action_probs", "env/truncated"]

    # Determines whether values of the critic should be propagated
    # True if the episode reached a time limit or if the task was not done
    # See https://colab.research.google.com/drive/1erLbRKvdkdDy0Zn1X_JhC01s1QAt4BBj
    must_bootstrap = torch.logical_or(~done[1], truncated[1])

    # Compute critic loss (see function above)
    critic_loss, advantage = compute_advantage_loss(cfg, reward, must_bootstrap, critic)

    # Take the log probability of the actions performed, after some reorganization
    action_logp = action_probs[0].gather(1, action[0].view(-1, 1)).squeeze().log()

    # Compute the policy gradient loss based on the log probability of the actions performed
    a2c_loss = action_logp * advantage.detach()
    a2c_loss = a2c_loss.mean()

    # Compute entropy loss
    entropy_loss = torch.distributions.Categorical(action_probs).entropy().mean()

    # Store the losses for tensorboard display
    logger.log_losses(cfg, epoch, critic_loss, entropy_loss, a2c_loss)

    # Compute the total loss
    loss = (
      -cfg.algorithm.entropy_coef * entropy_loss
      + cfg.algorithm.critic_coef * critic_loss
      - cfg.algorithm.a2c_coef * a2c_loss
    )

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()


    # Compute the cumulated reward on the final states
    creward = train_workspace["env/cumulated_reward"]
    done = train_workspace["env/done"]
    creward = creward[done]
    if creward.size()[0] > 0:
      # print(creward)
      logger.add_log("reward", creward.mean(), epoch)

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation.

In [None]:
params={
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tmp/" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 432,
    "n_envs": 2,
    "n_timesteps": 16,
    "max_epochs": 7000,
    "discount_factor": 0.95,
    "entropy_coef": 0.001,
    "critic_coef": 1.0,
    "a2c_coef": 0.1,
    "architecture":{
      "actor_hidden_size": 32,
      "critic_hidden_size": [24, 36],
    },
  },

  "gym_env":{
    "classname": "__main__.make_env",
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 0.01,
  }
}

### Launching tensorboard to visualize the results

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./tmp

<IPython.core.display.Javascript object>

In [None]:
config=OmegaConf.create(params)
torch.manual_seed(config.algorithm.seed)
run_a2c(config)

TypeError: ignored

With the parameters provided in this colab, you should observe that the reward is collapsing after 6K time steps.

## What's next?

The simple version of A2C above suffers from several limitations:
- During training, the cumulated reward is measured from the training agent itself while it is changing. It is a better practice to stop training and perform a few evaluations on the trained agent from time to time.
- separating the ProbAgent and the ActionAgent is nice for illustrating the properties of SaLinA, but it is not so convenient, for instance when one wants to know the action of the agent for any state without calling upon a workspace.
- The code above only illustrates A2C with discrete actions, though the algorithm can also deal with continuous actions. Doing so requires defining new Agent classes and uniformizing the way they are used to avoid using "if discrete/continuous" parts of codes.

We will perform the improvements corresponding to removing all these limitations in [this notebook](https://colab.research.google.com/drive/1C_mgKSTvFEF04qNc_Ljj0cZPucTJDFlO?usp=sharing). We will also add a few features, such as saving and loading agents or drawing pictures of the policy and critic agents.