# Outlook

In [a previous notebook](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing), we have seen the BBRL data collection approach and illustrated it with a simple pair of random agents. In this colab we explain how to create RL agents based on neural networks using pytorch and how to create agents representing gym environments using the GymAgent classes from BBRL.

## Installation and imports

The BBRL library is [here](https://github.com/osigaud/bbrl).

Note that we install the `my_gym` library to make sure to import the gym version 0.21.0. The interface of the later versions has been modified and maybe incompatible with previous code.

In [None]:
import functools
import time

!pip install omegaconf
import omegaconf

import gym
!pip install git+https://github.com/osigaud/my_gym.git

try:
  import bbrl
except ImportError:
  from IPython.display import clear_output 
  !pip install git+https://github.com/osigaud/bbrl.git
  clear_output()
  import bbrl

### BBRL imports

As in the previous notebook, we import BBRL agents

In [None]:
from bbrl.workspace import Workspace

from bbrl.agents.agent import Agent

from bbrl import get_class, get_arguments, instantiate_class

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent (e.g an Agent) over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, TemporalAgent

# GymAgent (resp. AutoResetGymAgent) are agents able to execute a batch of gym environments
# without (resp. with) auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gyma import AutoResetGymAgent, NoAutoResetGymAgent

# Not present in the A2C version...
from bbrl.utils.logger import TFLogger

### Other Imports

In [None]:
import copy
import time

import torch
import torch.nn as nn
import torch.nn.functional as F

import my_gym

In [None]:
from omegaconf import OmegaConf

## Creating Neural RL agents

We will build two types of agents:
- First, a stochastic actor that outputs a stochastic discrete action given a state. This agent will be made of two parts, one for generating probabilities over actions, and the other one for choosing an action according to these probabilities.
- Second, a deterministic critic that outputs Q-values for discrete actions given a state. Here we will introduce more general functions to build neural networks from a set of specified layer sizes. 

### A probabilistic actor in two parts

Here we replace the simple ActionAgent of the previous notebook with a combination of two agents: the probabilistic agent, which contains the neural network, and the action agent, which selects an action based on the probabilities resulting from the probabilistic agent.

#### Probabilistic Agent

A ProbAgent is a one hidden layer neural network which takes an observation as input and whose output is a probability given by a final softmax layer.

The first layers are built in the `__init__()` function using the simple `nn.Sequential(...)` model from pytorch. We will do something more sophisticated to deal with an arbitrary number of layers later in another notebook.

Now, let us have a look at the `forward()` function, which is called each time the agent performs a step in the environment.

To get the input observation from the environment we call
`observation = self.get(("env/env_obs", t))`
and that to perform an action in the environment we call
`self.set(("action_probs", t), probs)`. In between, we call `torch.softmax()` to get probabilities from the output layer of the network.

In [None]:
class ProbAgent(Agent):
    def __init__(self, observation_size, hidden_size, n_actions):
        super().__init__()
        self.model = nn.Sequential(
            nn.Linear(observation_size, hidden_size),
            nn.ReLU(),
            nn.Linear(hidden_size, n_actions),
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        scores = self.model(observation)
        probs = torch.softmax(scores, dim=-1)
        self.set(("action_probs", t), probs)

#### Actor Agent

The ActorAgent takes action probabilities as input (coming from the ProbAgent) and outputs an action. In the deterministic case it takes the argmax, in the stochastic case it samples from the Categorical distribution. This agent does not have a neural network, it just takes a decision from the output of the ProbAgent.

In [None]:
class ActorAgent(Agent):
    def __init__(self):
        super().__init__()

    def forward(self, t, stochastic, **kwargs):
        probs = self.get(("action_probs", t))
        if stochastic:
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = probs.argmax(1)

        self.set(("action", t), action)

Note that instead of having a ProbaAgent then an ActorAgent, we could have built a single agent containing both. We are doing this mainly to illustrate the capabilities of BBRL to combine agents.

Note also that this pair of agents is adequate for environment with discrete actions, but we need something different if the environment takes continuous actions. In that case, instead of a categorical distribution, we would rather use a Gaussian distribution. We will build other versions of these agents later in another notebook.

### A deterministic critic agent

The function below builds a multi-layer perceptron where the size of each layer is given in the `size` list. We also specify the activation function of neurons at each layer and optionally a different activation function for the final layer.

In [None]:
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

The `DiscreteQAgent` class implements a critic such the one used in DQN. It has one output neuron per action and its output is the Q-value of these actions given the state. 

Note that as any BBRL agent, it has a forward function that takes a time state as input. This forward function outputs the Q-values at the corresponding time step. Additionally, if the critic is used to choose an action, it also outputs the chosen action at the same time step.

Besides, it is also useful to get the network output (as a Q-value or as an action) given a state rather than a time step. This is what the `predict_action` and `predict_value` functions are used for.

In [None]:
class DiscreteQAgent(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.is_q_function = True
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [action_dim], activation=nn.ReLU()
        )

    def forward(self, t, choose_action=True, **kwargs):
        obs = self.get(("env/env_obs", t))
        q_values = self.model(obs).squeeze(-1)
        self.set(("q_values", t), q_values)
        if choose_action:
            action = q_values.argmax(1)
            self.set(("action", t), action)

    def predict_action(self, obs, stochastic):
        q_values = self.model(obs).squeeze(-1)
        if stochastic:
            probs = torch.softmax(q_values, dim=-1)
            action = torch.distributions.Categorical(probs).sample()
        else:
            action = q_values.argmax(0)
        return action

    def predict_value(self, obs, action):
        q_values = self.model(obs).squeeze(-1)
        return q_values[0][action]

## Creating the environment agent

Now that we have an RL agent, let us create an environment. In the previous notebook, we were doing something simple building on the Agent class. Now we do something more sophisticated, building on the `GymAgent` class and encapsulating an OpenAI gym environment.

### Using a gym environment

The function below creates the environment. In OpenAI gym, an environment can be known by its name, here the string `env_name`. The environment is generally given a maximum number of steps, which is enforced by the `TimeLimit` wrapper. Therefore, we we do not need to add our own TimeLimit: this may break the episode termination behavior.

In [None]:
def make_env(env_name):
    return gym.make(env_name)

To call the above function, we will use a reflexive instantiation mechanism and get the parameters of the function from the `params` dictionary, in the `"env":{
      "classname": "__main__.make_env",
      "env_name": "CartPole-v1",
    }` part.

Using this instantiation approach from a function is useful if you define a new env for instance i.e you just change the 'classname' and put the arguments of the constructor directly and everything will work fine. This may be not natural a first sight, but if you start to use it, you will never go back again :) 

The `instantiate_class`, `get_class` and `get_arguments` functions are available in the [`main/bbrl/__init__.py`](https://github.com/osigaud/bbrl/blob/master/bbrl/__init__.py) file. The `get_class` function reads the `classname` in the parameters to create the appropriate type of object, and the `get_arguments` function reads the local paremeters and their values to set them into the corresponding object. 

### Creating the environment agent

Now, let us create the agent representing the environment. We do so using the `NoAutoResetGymAgent` which inherits from the [GymAgent](https://github.com/osigaud/bbrl/blob/master/bbrl/agents/gyma.py#L76) class and is provided by BBRL.
Essential information about this class is given in [this notebook](https://colab.research.google.com/drive/1EX5O03mmWFp9wCL_Gb_-p08JktfiL2l5?usp=sharing).

In the `get_env(cfg)` function below, the `NoAutoResetGymAgent` is created taking as arguments the environment creation function (here the `make_env` function that we defined above) with its parameters, then the number of environments and a seed. This seed serves to initialize the random number generator so that using the same seed will generate the same numbers again.

These parameters are specified into a specific dictionary called `config` which is described below.

In [None]:
def get_env(cfg):
    env_agent = NoAutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        cfg.algorithm.n_envs,
        cfg.algorithm.seed,
    )
    return env_agent

## Running the agents along a single NoAutoReset episode

Before running everything, we need to get the parameters of the algorithm and the environment agent. To do this, we use the `omegaconf` package which builds a dictionary from a configuration string.

In [None]:
params={
  "algorithm":{
    "seed": 432,
    "n_envs": 1,
    "architecture":{"hidden_size": 32},
  },
  "gym_env":{
    "classname": "__main__.make_env",
    "env_name": "CartPole-v1",
  },
}

As in [this notebook](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing), once we have the RL agent and the environment agent, we bind them together into a TemporalAgent

We have to create the environnement agent first, because we need to know the size of the observation and action spaces to create the action agent which will interact with it.

In [None]:
config = OmegaConf.create(params)

env_agent = get_env(config)
observation_size, n_actions = env_agent.get_obs_and_actions_sizes()
prob_agent = ProbAgent(observation_size, config.algorithm.architecture.hidden_size, n_actions)
action_agent = ActorAgent()
composed_agent = Agents(env_agent, prob_agent, action_agent)
  
# Get a temporal agent that can be executed in a workspace
t_agent = TemporalAgent(composed_agent)

And finally we execute it in the workspace. Here, we run for 30 steps.

In [None]:
# We create a workspace
workspace = Workspace()

# The temporal agent will be run for 10 steps on this workspace
t_agent(workspace, t=0, n_steps=30, stochastic=True)

# We retrieve the information as they are stored into the workspace
obs, action, reward, done = workspace["env/env_obs", "action", "env/reward", "env/done"]

# And we print them
print("obs:", obs)
print("action:", action)
print("reward:", reward)
print("done:", done)
# You should see that each variable has been recorded for the number of specified 
# time steps...

If you run the above interaction loop for enough steps (say 30) and you look closely at the result on the above cell, you will see that after the task is done (that is, the pole falls down in our case), the workspace continues filling data with copies of the last time step.

We explain in details in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5?usp=sharing)  why this is so, together with several issues about data collection.

### Exercise

Just do the same using a DiscreteQAgent. Print the Q-values stored into the workspace.

In [None]:
# Your code here

## What's next?

We are now ready to write a first version of the DQN algorithm, using the `DiscreteQAgent` defined above. We do so in [this notebook](https://colab.research.google.com/drive/1H9_gkenmb_APnbygme1oEdhqMLSDc_bM?usp=sharing)


Or we can switch directly to using the AutoResetGymAgent class. We do so in [this notebook](https://colab.research.google.com/drive/1VJUoDGhxKv3mmFjTmLj_JDpappVw29xh?usp=sharing)