# Outlook

In [this previous notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh#scrollTo=X3_nZLp8wfjQ), we have seen how to create agents representing gym environments using the NoAutoResetGymAgent class from BBRL. We now explain how to do the same with the AutoResetGymAgent class. 

This first parts of this notebook are the same as in [the previous one](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh#scrollTo=X3_nZLp8wfjQ), you do not need to read everything again.

## Installation and imports

The BBRL library is [here](https://github.com/osigaud/bbrl).

Note that we install the `my_gym` library to make sure to import the gym version 0.21.0. The interface of the later versions has been modified and maybe incompatible with previous code.

In [1]:
import functools
import time

!pip install omegaconf
import omegaconf

import gym
!pip install git+https://github.com/osigaud/my_gym.git

try:
  import bbrl
except ImportError:
  from IPython.display import clear_output 
  !pip install git+https://github.com/osigaud/bbrl.git
  clear_output()
  import bbrl

### BBRL imports

As in the previous notebook, we import BBRL agents

In [2]:
from bbrl.workspace import Workspace

from bbrl.agents.agent import Agent

from bbrl import get_class, get_arguments, instantiate_class

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent (e.g an Agent) over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# GymAgent (resp. AutoResetGymAgent) are agents able to execute a batch of gym environments
# without (resp. with) auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gyma import AutoResetGymAgent, NoAutoResetGymAgent

# Not present in the A2C version...
from bbrl.utils.logger import TFLogger

### Other Imports

In [3]:
import copy
import time

import torch
import torch.nn as nn
import torch.nn.functional as F

import my_gym

In [4]:
from omegaconf import OmegaConf

## Creating Neural RL agents

We will build two types of agents:
- First, a stochastic actor that outputs a stochastic discrete action given a state. This agent will be made of two parts, one for generating probabilities over actions, and the other one for choosing an action according to these probabilities.
- Second, a deterministic critic that outputs Q-values for discrete actions given a state. Here we will introduce more general functions to build neural networks from a set of specified layer sizes. 

### A probabilistic actor in two parts

Here we replace the simple ActionAgent of the previous notebook with a combination of two agents: the probabilistic agent, which contains the neural network, and the action agent, which selects an action based on the probabilities resulting from the probabilistic agent.

#### Probabilistic Agent

A ProbAgent is a one hidden layer neural network which takes an observation as input and whose output is a probability given by a final softmax layer.

The first layers are built in the `__init__()` function using the simple `nn.Sequential(...)` model from pytorch. We will do something more sophisticated to deal with an arbitrary number of layers later in another notebook.

Now, let us have a look at the `forward()` function, which is called each time the agent performs a step in the environment.

To get the input observation from the environment we call
`observation = self.get(("env/env_obs", t))`
and that to perform an action in the environment we call
`self.set(("action_probs", t), probs)`. In between, we call `torch.softmax()` to get probabilities from the output layer of the network.

In [5]:
class ProbAgent(Agent):
    def __init__(self, observation_size, hidden_size, n_actions):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(observation_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions),
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        scores = self.model(observation)
        probs = torch.softmax(scores, dim=-1)
        self.set(("action_probs", t), probs)

#### Actor Agent

The ActorAgent takes action probabilities as input (coming from the ProbAgent) and outputs an action. In the deterministic case it takes the argmax, in the stochastic case it samples from the Categorical distribution. This agent does not have a neural network, it just takes a decision from the output of the ProbAgent.

In [6]:
class ActorAgent(Agent):
    def __init__(self):
        super().__init__()

    def forward(self, t, stochastic, **kwargs):
        probs = self.get(("action_probs", t))
        if stochastic:
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = probs.argmax(1)

        self.set(("action", t), action)

Note that instead of having a ProbaAgent then an ActorAgent, we could have built a single agent containing both. We are doing this mainly to illustrate the capabilities of BBRL to combine agents.

Note also that this pair of agents is adequate for environment with discrete actions, but we need something different if the environment takes continuous actions. In that case, instead of a categorical distribution, we would rather use a Gaussian distribution. We will build other versions of these agents later in another notebook.

### A deterministic critic agent

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list. We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [7]:
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

The `DiscreteQAgent` class implements a critic such the one used in DQN. It has one output neuron per action and its output is the Q-value of these actions given the state. 

Note that as any BBRL agent, it has a forward function that takes a time state as input. This forward function outputs the Q-values at the corresponding time step. Additionally, if the critic is used to choose an action, it also outputs the chosen action at the same time step.

Besides, it is also useful to get the network output (as a Q-value or as an action) given a state rather than a time step. This is what the `predict_action` and `predict_value` functions are used for.

In [8]:
class DiscreteQAgent(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.is_q_function = True
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [action_dim], activation=nn.ReLU()
        )

    def forward(self, t, choose_action=True, **kwargs):
        obs = self.get(("env/env_obs", t))
        q_values = self.model(obs).squeeze(-1)
        self.set(("q_values", t), q_values)
        if choose_action:
            action = q_values.argmax(1)
            self.set(("action", t), action)

    def predict_action(self, obs, stochastic):
        q_values = self.model(obs).squeeze(-1)
        if stochastic:
            probs = torch.softmax(q_values, dim=-1)
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = q_values.argmax(0)
        return action

    def predict_value(self, obs, action):
        q_values = self.model(obs).squeeze(-1)
        return q_values[0][action]

## Creating the environment agent

Now that we have an RL agent, let us create an environment. In the previous notebook, we were doing something simple building on the Agent class. Now we do something more sophisticated, building on the `GymAgent` class and encapsulating an OpenAI gym environment.

### Using a gym environment

The function below creates the environment. In OpenAI gym, an environment can be known by its name, here the string `env_name`. The environment is generally given a maximum number of steps, which is enforced by the `TimeLimit` wrapper. Therefore, we we do not need to add our own TimeLimit: this may break the episode termination behavior.

In [9]:
def make_env(env_name):
    return gym.make(env_name)

To call the above function, we will use a reflexive instantiation mechanism and get the parameters of the function from the `params` dictionary, in the `"env":{
      "classname": "__main__.make_env",
      "env_name": "CartPole-v1",
    }` part.

Using this instantiation approach from a function is useful if you define a new env for instance i.e you just change the 'classname' and put the arguments of the constructor directly and everything will work fine. This may be not natural a first sight, but if you start to use it, you will never go back again :) 

The `instantiate_class`, `get_class` and `get_arguments` functions are available in the [`main/bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/bbrl/__init__.py) file. The `get_class` function reads the `classname` in the parameters to create the appropriate type of object, and the `get_arguments` function reads the local paremeters and their values to set them into the corresponding object. 

## Running several episodes split into epochs with an AutoReset environment

The `NoAutoResetGymAgent` is the easiest environment to use. Let us now consider the `AutoResetGymAgent`, that we will use to run several episodes split into epochs. This more complicated type of environment is explained in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5?usp=sharing).

### Creating the environment agent

This is as before, but we just use an `AutoResetGymAgent` instead of a `NoAutoResetGymAgent` one.

In [10]:
def get_env(cfg):
    env_agent = AutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        cfg.algorithm.n_envs,
        cfg.algorithm.seed,
    )
    return env_agent

### Building the temporal agent

Creating the temporal agent is exactly as before.

In the parameters, we add a number of epochs (here 5), a number of steps per epoch (here 10) and this time we use several environments in parallel (here 3).

In [11]:
params2={
  "algorithm":{
    "seed": 432,
    "nb_epochs": 5,
    "n_steps": 10,
    "n_envs": 3,
    "architecture":{"hidden_size": 32},
  },
  "gym_env":{
    "classname": "__main__.make_env",
    "env_name": "CartPole-v1",
  },
}

In [12]:
config = OmegaConf.create(params2)

env_agent = get_env(config)
observation_size, n_actions = env_agent.get_obs_and_actions_sizes()
prob_agent = ProbAgent(observation_size, config.algorithm.architecture.hidden_size, n_actions)
action_agent = ActorAgent()
composed_agent = Agents(env_agent, prob_agent, action_agent)
  
# Get a temporal agent that can be executed in a workspace
t_agent = TemporalAgent(composed_agent)

### Running the main loop

We now write the main loop to run a number of epochs specified in the params. There are three parts inside the loop over epochs.

- In the first part, the temporal agent is run in the workspace for each epoch. Note that the first epoch is different, as in the next epochs we need to copy the last step of the previous epoch to avoid missing a transition. This is explained in detail in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

- The second part consists in collecting the interaction data from the workspace. Note that to properly filter out the transitions from an episode to the next, we have to rearrange the data using the `get_transitions()` function. As a result, the data structures we get back from the workspace are a little more complicated. Again, this is explained in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

- The third part simply consists in counting the steps. One must not forget to multiply by the number of environments. One cannot simply use cfg.algorithm.n_step` as transition from an episode to the next are filtered out, as explained in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

In [16]:
workspace = Workspace()

nb_steps = 0
for epoch in range(config.algorithm.nb_epochs):
        if epoch == 0:
            t_agent(workspace, t=0, n_steps=config.algorithm.n_steps, stochastic=True)
        else:
            workspace.zero_grad()
            workspace.copy_n_last_steps(1)
            t_agent(workspace, t=1, n_steps=config.algorithm.n_steps - 1, stochastic=True)


        transition_workspace = workspace.get_transitions()

        # We retrieve the information as they are stored into the workspace
        obs, done, truncated, reward, action = transition_workspace[
            "env/env_obs", "env/done", "env/truncated", "env/reward", "action"
        ]
        nb_steps += action[0].shape[0]
        # And we print them
        print("obs:", obs)
        print("action:", action)
        print("reward:", reward)
        print("done:", done)

27
obs: tensor([[[ 7.2509e-03,  4.4853e-02,  4.8312e-02,  3.3784e-04],
         [-4.5774e-02, -4.5540e-02, -2.9604e-03, -3.0580e-02],
         [ 9.8080e-03, -5.0475e-03, -4.6294e-02,  6.3606e-03],
         [ 8.1480e-03,  2.3925e-01,  4.8318e-02, -2.7672e-01],
         [-4.6685e-02, -2.4062e-01, -3.5720e-03,  2.6117e-01],
         [ 9.7070e-03,  1.9071e-01, -4.6167e-02, -3.0056e-01],
         [ 1.2933e-02,  4.3473e-02,  4.2784e-02,  3.0803e-02],
         [-5.1497e-02, -4.3569e-01,  1.6513e-03,  5.5272e-01],
         [ 1.3521e-02,  3.8646e-01, -5.2178e-02, -6.0744e-01],
         [ 1.3802e-02, -1.5224e-01,  4.3400e-02,  3.3667e-01],
         [-6.0211e-02, -2.4059e-01,  1.2706e-02,  2.6056e-01],
         [ 2.1250e-02,  5.8227e-01, -6.4327e-02, -9.1609e-01],
         [ 1.0758e-02, -3.4795e-01,  5.0134e-02,  6.4272e-01],
         [-6.5023e-02, -4.5653e-02,  1.7917e-02, -2.8089e-02],
         [ 3.2896e-02,  7.7820e-01, -8.2649e-02, -1.2283e+00],
         [ 3.7988e-03, -1.5356e-01,  6.2988e-02

### Understanding the stored data

Exercises: 
- how do we get the reward of the current time step?
- how do we get the observation of the next time step?

## What's next?

We are now ready to write a version of the DQN algorithm with the AutoResetGymAgent. We do so in [this notebook](https://colab.research.google.com/drive/1H9_gkenmb_APnbygme1oEdhqMLSDc_bM?usp=sharing)

Alternatively, we can start implementing the A2C algorithm, using our probabilistic agent. We do so in [this notebook](https://colab.research.google.com/drive/1yAQlrShysj4Q9EBpYM8pBsp2aXInhP7x?usp=sharing)