# Outlook

In this notebook, using BBRL, we code a simple version of the DQN algorithm without a replay buffer nor a target network so as to better understand the inner mechanisms. To understand this code, you need to know more about BBRL. You should first have a look at [the BBRL interaction model](https://colab.research.google.com/drive/1_yp-JKkxh_P8Yhctulqm0IrLbE41oK1p?usp=sharing), then [a first example](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing).

## Installation and Imports

### Installation

The BBRL library is [here](https://github.com/osigaud/bbrl).

This is OmegaConf that makes it possible that by just defining the `def run_dqn(cfg):` function and then executing a long `params = {...}` variable at the bottom of this colab, the code is run with the parameters without calling an explicit main.

More precisely, the code is run by calling

`config=OmegaConf.create(params)`

`run_dqn(config)`

at the very bottom of the colab, after starting tensorboard.

In [None]:
import os
import functools
import time
!pip install omegaconf
from omegaconf import OmegaConf

import gym
!pip install git+https://github.com/osigaud/my_gym.git
!pip install git+https://github.com/osigaud/bbrl.git

import bbrl

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting omegaconf
  Downloading omegaconf-2.2.2-py3-none-any.whl (79 kB)
[K     |████████████████████████████████| 79 kB 3.5 MB/s 
[?25hCollecting antlr4-python3-runtime==4.9.*
  Downloading antlr4-python3-runtime-4.9.3.tar.gz (117 kB)
[K     |████████████████████████████████| 117 kB 40.0 MB/s 
[?25hCollecting PyYAML>=5.1.0
  Downloading PyYAML-6.0-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (596 kB)
[K     |████████████████████████████████| 596 kB 13.9 MB/s 
[?25hBuilding wheels for collected packages: antlr4-python3-runtime
  Building wheel for antlr4-python3-runtime (setup.py) ... [?25l[?25hdone
  Created wheel for antlr4-python3-runtime: filename=antlr4_python3_runtime-4.9.3-py3-none-any.whl size=144575 sha256=04121ac5e4e357196d702da69d85ff8ff651bc78620993e47a2b44f9f25bb636
  Stored in directory: /root/.cache/pip/wheels/8

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/my_gym.git
  Cloning https://github.com/osigaud/my_gym.git to /tmp/pip-req-build-cqr48y73
  Running command git clone -q https://github.com/osigaud/my_gym.git /tmp/pip-req-build-cqr48y73
Collecting mazemdp@ git+https://github.com/osigaud/SimpleMazeMDP.git
  Cloning https://github.com/osigaud/SimpleMazeMDP.git to /tmp/pip-install-hjnf0pv3/mazemdp_16752283c31844baa70927bb2b86bb28
  Running command git clone -q https://github.com/osigaud/SimpleMazeMDP.git /tmp/pip-install-hjnf0pv3/mazemdp_16752283c31844baa70927bb2b86bb28
Collecting gym==0.21.0
  Downloading gym-0.21.0.tar.gz (1.5 MB)
[K     |████████████████████████████████| 1.5 MB 4.9 MB/s 
[?25hCollecting Box2D
  Downloading Box2D-2.3.10-cp37-cp37m-manylinux1_x86_64.whl (1.3 MB)
[K     |████████████████████████████████| 1.3 MB 38.1 MB/s 
Building wheels for collected packages: my-gym, mazemdp, g

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting git+https://github.com/osigaud/bbrl.git
  Cloning https://github.com/osigaud/bbrl.git to /tmp/pip-req-build-kk3vs10b
  Running command git clone -q https://github.com/osigaud/bbrl.git /tmp/pip-req-build-kk3vs10b
Collecting my_gym@ git+https://github.com/osigaud/my_gym.git
  Cloning https://github.com/osigaud/my_gym.git to /tmp/pip-install-wer0dl20/my-gym_d4a476767c3b4c98934772be79bb17e4
  Running command git clone -q https://github.com/osigaud/my_gym.git /tmp/pip-install-wer0dl20/my-gym_d4a476767c3b4c98934772be79bb17e4
Collecting protobuf==3.20.1
  Downloading protobuf-3.20.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.whl (1.0 MB)
[K     |████████████████████████████████| 1.0 MB 5.4 MB/s 
Collecting hydra-core
  Downloading hydra_core-1.2.0-py3-none-any.whl (151 kB)
[K     |████████████████████████████████| 151 kB 47.4 MB/s 
Collecting xformers>=0.0.3
  Downloading xfo

### Imports

Below, we import standard python packages, pytorch packages and gym environments.

[OpenAI gym](https://gym.openai.com/) is a collection of benchmark environments to evaluate RL algorithms.

In [None]:
import copy

import torch
import torch.nn as nn
import torch.nn.functional as F

import gym

### BBRL imports

In [None]:
from bbrl.agents.agent import Agent
from bbrl import get_arguments, get_class, instantiate_class

# The workspace is the main class in BBRL, this is where all data is collected and stored
from bbrl.workspace import Workspace

# Agents(agent1,agent2,agent3,...) executes the different agents the one after the other
# TemporalAgent(agent) executes an agent over multiple timesteps in the workspace, 
# or until a given condition is reached
from bbrl.agents import Agents, RemoteAgent, TemporalAgent

# AutoResetGymAgent is an agent able to execute a batch of gym environments
# with auto-resetting. These agents produce multiple variables in the workspace: 
# ’env/env_obs’, ’env/reward’, ’env/timestep’, ’env/done’, ’env/initial_state’, ’env/cumulated_reward’, 
# ... When called at timestep t=0, then the environments are automatically reset. 
# At timestep t>0, these agents will read the ’action’ variable in the workspace at time t − 1
from bbrl.agents.gymb import NoAutoResetGymAgent
# Not present in the A2C version...
from bbrl.utils.logger import TFLogger

## Definition of agents


See [this notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing) for previous explanations about agents and environment agents.

### The critic agent

The [DQN](https://daiwk.github.io/assets/dqn.pdf) algorithm is a critic only algorithm. Thus we just need a Critic agent (which will also be used to output actions) and an Environment agent. We reuse the `DiscreteQAgent` class that we have already explained in [this notebook](https://colab.research.google.com/drive/1Ui481r47fNHCQsQfKwdoNEVrEiqAEokh?usp=sharing).

In [None]:
def build_mlp(sizes, activation, output_activation=nn.Identity()):
    layers = []
    for j in range(len(sizes) - 1):
        act = activation if j < len(sizes) - 2 else output_activation
        layers += [nn.Linear(sizes[j], sizes[j + 1]), act]
    return nn.Sequential(*layers)

In [None]:
class DiscreteQAgent(Agent):
    def __init__(self, state_dim, hidden_layers, action_dim):
        super().__init__()
        self.model = build_mlp(
            [state_dim] + list(hidden_layers) + [action_dim], activation=nn.ReLU()
        )

    def forward(self, t, choose_action=True, **kwargs):
        obs = self.get(("env/env_obs", t))
        q_values = self.model(obs).squeeze(-1)
        self.set(("q_values", t), q_values)
        if choose_action:
            action = q_values.argmax(1)
            self.set(("action", t), action)

In [None]:
def make_env(env_name):
    return gym.make(env_name)

### Creating an Exploration method

As Q-learning, DQN needs some exploration to prevent too early convergence. Here we will use the simple $\epsilon$-greedy exploration method. The method is implemented as an agent which chooses an action based on the Q-values.

In [None]:
class EGreedyActionSelector(Agent):
    def __init__(self, epsilon):
        super().__init__()
        self.epsilon = epsilon

    def forward(self, t, **kwargs):
        q_values = self.get(("q_values", t))
        nb_actions = q_values.size()[1]
        size = q_values.size()[0]
        is_random = torch.rand(size).lt(self.epsilon).float()
        random_action = torch.randint(low=0, high=nb_actions, size=(size,))
        max_action = q_values.max(1)[1]
        action = is_random * random_action + (1 - is_random) * max_action
        action = action.long()
        self.set(("action", t), action)

### Training and evaluation environments

In actor-critic algorithms relying on a replay buffer, the actor can be trained at each step during an episode. Besides, the training signal is the reward obtained during these episodes. So it may seem natural to display a learning curve corresponding to the performance of the training agent along the set of training episodes.

But let us think of it. If the agent is changing during an episode, which agent are we truly evaluating? The one in the beginning of the episode? In the middle? In the end? We see that such evaluations based on an evolving agent makes no sense.

What makes more sense consists in training an agent for a number of steps, and then evaluating it on a few episode to determine the performance of that particular agent, then start again training. With this approach, the learning curve makes more sense, it shows the evolving performance of a succession of agents obtained after training sequences.

Separating training and evaluation provides additional opportunities. Often, we will train the agent using exploration, but we will evaluate it in a greedy, deterministic mode, as if the problem is truly an MDP, so deterministic policy can be optimal.

We build two environments: one for training and another one for evaluation. The same agent is connected to these two environments in two instances of TemporalAgent so that we train and evaluate the same network.

In the context of this notebook, we will only use the [NoAutoResetGymAgent](https://github.com/osigaud/bbrl/blob/96e58f6e01065f6a551039c4b9f7c1036b5523e6/bbrl/agents/gyma.py#L331) class, which is explained in [this notebook](https://colab.research.google.com/drive/1EX5O03mmWFp9wCL_Gb_-p08JktfiL2l5?usp=sharing).

In practice, it is more efficient to use an AutoResetGymAgent, as we do not want to waste time if the task is done in an environment sooner than in the others, but this is more involved so we keep this for [a later notebook](https://colab.research.google.com/drive/1H9_gkenmb_APnbygme1oEdhqMLSDc_bM?usp=sharing).

By contrast, for evaluation, we just need to perform a fixed number of episodes (for statistics), thus it is more convenient to use a NoAutoResetGymAgent with a set of environments and just run one episode in each environment. Thus we can use the `env/done` stop variable and take the average over the cumulated reward of all environments.

To keep the story simple, we use a single environment for training.

In [None]:
def get_env_agents(cfg):
    train_env_agent = NoAutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        1,
        cfg.algorithm.seed,
    )
    eval_env_agent = NoAutoResetGymAgent(
    get_class(cfg.gym_env),
    get_arguments(cfg.gym_env),
    cfg.algorithm.nb_evals,
    cfg.algorithm.seed,
    )
    return train_env_agent, eval_env_agent

### Create the DQN agent

Interestingly, the loop between the policy and the environment is first defined as a collection of agents, and then embedded into a single TemporalAgent.

In [None]:
def create_dqn_agent(cfg, train_env_agent, eval_env_agent):
    obs_size, act_size = train_env_agent.get_obs_and_actions_sizes()
    critic = DiscreteQAgent(obs_size, cfg.algorithm.architecture.hidden_size, act_size)
    explorer = EGreedyActionSelector(cfg.algorithm.epsilon)
    q_agent = TemporalAgent(critic)
    tr_agent = Agents(train_env_agent, critic, explorer)
    ev_agent = Agents(eval_env_agent, critic)

    # Get an agent that is executed on a complete workspace
    train_agent = TemporalAgent(tr_agent)
    eval_agent = TemporalAgent(ev_agent)
    train_agent.seed(cfg.algorithm.seed)
    return train_agent, eval_agent, q_agent

### The Logger class

The logger class below is not generic, it is specifically designed in the context of this notebook.

The logger parameters are defined below in `params = { "logger":{ ...`

In this notebook, the logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation (see the parameters part below).
Note that the BBRL Logger is also saving the log in a readable format such that you can use `Logger.read_directories(...)` to read multiple logs, create a dataframe, and analyze many experiments afterward in a notebook for instance. 

The code for the different kinds of loggers is available in the [bbrl/utils/logger.py](https://github.com/osigaud/bbrl/blob/master/bbrl/utils/logger.py) file.

Having logging provided under the hood is one of the features where using RL libraries like BBRL will allow you to save time.

`instantiate_class` is an inner BBRL mechanism. The `instantiate_class`function is available in the [bbrl/__init__.py](https://github.com/osigaud/bbrl/blob/master/bbrl/__init__.py) file.

In [None]:
class Logger():

  def __init__(self, cfg):
    self.logger = instantiate_class(cfg.logger)

  def add_log(self, log_string, loss, epoch):
    self.logger.add_scalar(log_string, loss.item(), epoch)

  # Log losses
  def log_losses(self, cfg, epoch, critic_loss, entropy_loss, a2c_loss):
    self.add_log("critic_loss", critic_loss, epoch)
    self.add_log("entropy_loss", entropy_loss, epoch)
    self.add_log("a2c_loss", a2c_loss, epoch)


## Heart of the algorithm

### Computing the critic loss

The role of the `compute_critic_loss` function is to implement the Bellman backup rule. In Q-learning, this rule was written:

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha [ r(s_t,a_t) + \gamma \max_a Q(s_{t+1},a) - Q(s_t,a_t)]$$

In DQN, the update rule $Q \leftarrow Q + \alpha [\delta] $ is replaced by a gradient descent step over the Q-network. 

We first compute a target value: $ target = r(s_t,a_t) + \gamma \max_a Q(s_{t+1},a)$ from a set of samples.

Then we get a TD error $\delta$ by substracting $Q(s_t,a_t)$ for these samples, 

and we use the squared TD error as a loss function: $ loss = (target - Q(s_t,a_t))^2$.

To implement the above calculation in BBRL, the difficulty consists in properly dealing with time indexes. We have left commented prints into the code so that you can have a look at the data structures during the computation.

The `compute_critic_loss` function receives rewards, q_values and actions as vectors (in practice, pytorch tensors) that have been computed over a complete episode.

We need to take `reward[:-1]`, which means all the rewards but the last one, because in BBRL, GymAgents repeat the reward at the last time step, as explained in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

Conversely, to get $\max_a Q(s_{t+1}, a)$, we need to ignore the first of the max_q values, using `max_q[1:]`.

Note the `max_q[0].detach()` in the computation of the temporal difference target. First, the max_q[0] is because the max function returns both the max and the indexes of the max. Second, about the .detach(), the idea is that we compute this target as a function of $\max_a Q(s_{t+1}, a)$, but we do not want to apply gradient descent on this $\max_a Q(s_{t+1}, a)$, we will only apply gradient descent to the $Q(s_t, a_t)$ according to this target value. In practice, `x.detach()` detaches a computation graph from a tensor, so it avoids computing a gradient over this tensor.

The `must_bootstrap` tensor is used as a trick to deal with terminal states. If the state is terminal, $Q(s_{t+1}, a)$ does not make sense. Thus we need to ignore this term. So we multiply the term by `must_bootstrap`: if `must_bootstrap` is True (converted into a float, it becomes a 1), we get the term. If `must_bootstrap` is False (=0), we are at a terminal state, so we ignore the term. This trick is used in many RL libraries, e.g. SB3. In [this notebook](https://colab.research.google.com/drive/1erLbRKvdkdDy0Zn1X_JhC01s1QAt4BBj?usp=sharing) we explain how to compute `must_bootstrap` so as to properly deal with time limits. In this version we use full episodes, thus `must_bootstrap` will always be True for all steps but the last one.

To compute $Q(s_t,a_t)$ we use the `torch.gather()` function. This function is a little tricky to use, see [this page](https://medium.com/analytics-vidhya/understanding-indexing-with-pytorch-gather-33717a84ebc4) for useful explanations.

In particular, the q_vals output that we get is not properly conditioned, hence the need for the `qval[:-1]` (we ignore the last dimension).

Finally we just need to compute the difference target - qvals, square it, take the mean and send it back as the loss.

In [None]:
# Compute the temporal difference loss from a dataset to update a critic

def compute_critic_loss(cfg, reward, must_bootstrap, q_values, action):
    # print(q_values)

    # We compute the max of Q-values over all actions
    max_q = q_values.max(-1)
    max_q = max_q[0].detach()
    # print(max_q)
    # print("r:", reward)

    # To get the max of Q(s_{t+1}, a), we take max_q[1:]
    # The same about must_bootstrap. 
    target = (
        reward[:-1] + cfg.algorithm.discount_factor * max_q[1:] * must_bootstrap[1:].int()
    )
    # print("t:", target, target.shape)
    # print(action, action.shape)

    # To get Q(s,a), we use the torch.gather() function which needs a specific data preparation
    vals = q_values.squeeze()
    # print("v", vals, vals.shape)
    qvals = torch.gather(vals, dim=1, index=action)
    qvals = qvals[:-1]
    # print("qvals", qvals, qvals.shape)
    td = target - qvals
    # print(td, td.shape)
    # Compute critic loss
    td_error = td**2
    critic_loss = td_error.mean()
    # print(critic_loss)
    return critic_loss


### Setting up the optimizer

The optimizer is used to tune the parameters of the DQN agent.

In [None]:
# Configure the optimizer over the q agent
def setup_optimizer(cfg, q_agent):
    optimizer_args = get_arguments(cfg.optimizer)
    parameters = q_agent.parameters()
    optimizer = get_class(cfg.optimizer)(parameters, **optimizer_args)
    return optimizer

## Main training loop

Note that everything about the shared workspace between all the agents is completely hidden under the hood. This results in a gain of productivity, at the expense of having to dig into the BBRL code if you want to understand the details, change the multiprocessing model, etc.

### Agent execution

This is the tricky part with BBRL, the one we need to understand in detail. The difficulty lies in the copy of the last step and the way to deal with the n_steps return.

The call to `train_agent(workspace, t=1, n_steps=cfg.algorithm.n_timesteps - 1, stochastic=True)` makes the agent run a number of steps in the workspace. In practice, it calls the [__call__()](https://github.com/osigaud/bbrl/blob/master/bbrl/agents/agent.py#L54) function which makes a forward pass of the agent network using the workspace data and updates the workspace accordingly.

Now, if we start at the first epoch (`epoch=0`), we start from the first step (`t=0`). But when subsequently we perform the next epochs (`epoch>0`), we must not forget to cover the transition at the border between the previous epoch and the current epoch. To avoid this risk, we copy the information from the last time step of the previous epoch into the first time step of the next epoch. This is explained in more details in [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5).

Note that we `optimizer.zero_grad()`, `loss.backward()` and `optimizer.step()` lines. 

`optimizer.zero_grad()` is necessary to cancel all the gradients computed at the previous iterations


In [None]:
def run_dqn(cfg):
     # 1)  Build the  logger
    logger = Logger(cfg)
    best_reward = -10e9

    # 2) Create the environment agent
    train_env_agent = NoAutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        1,
        cfg.algorithm.seed,
    )
    eval_env_agent = NoAutoResetGymAgent(
        get_class(cfg.gym_env),
        get_arguments(cfg.gym_env),
        cfg.algorithm.nb_evals,
        cfg.algorithm.seed,
    )

    # 3) Create the DQN-like Agent
    train_agent, eval_agent, q_agent = create_dqn_agent(
        cfg, train_env_agent, eval_env_agent
    )

    # Note that no parameter is needed to create the workspace.
    # In the training loop, calling the train_agent
    # will take the workspace as parameter

    # 6) Configure the optimizer
    optimizer = setup_optimizer(cfg, q_agent)
    nb_steps = 0
    tmp_steps = 0
    nb_measures = 0

    while nb_measures < cfg.algorithm.nb_measures:
        train_workspace = Workspace()  # Used for training
        train_agent(train_workspace, t=0, stop_variable="env/done", stochastic=True)

        q_values, done, truncated, reward, action = train_workspace[
            "q_values", "env/done", "env/truncated", "env/reward", "action"
        ]
        nb_steps += len(q_values)
        # Determines whether values of the critic should be propagated
        # True if the episode reached a time limit or if the task was not done
        # See https://colab.research.google.com/drive/1erLbRKvdkdDy0Zn1X_JhC01s1QAt4BBj
        must_bootstrap = torch.logical_or(~done, truncated)
        # Compute critic loss
        critic_loss = compute_critic_loss(cfg, reward, must_bootstrap, q_values, action)

        # Store the loss for tensorboard display
        logger.add_log("critic_loss", critic_loss, nb_steps)

        optimizer.zero_grad()
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            q_agent.parameters(), cfg.algorithm.max_grad_norm
        )
        optimizer.step()

        if nb_steps - tmp_steps > cfg.algorithm.eval_interval:
            nb_measures += 1
            tmp_steps = nb_steps
            eval_workspace = Workspace()  # Used for evaluation
            eval_agent(
                eval_workspace, t=0, stop_variable="env/done", choose_action=True
            )
            rewards = eval_workspace["env/cumulated_reward"][-1]
            mean = rewards.mean()
            logger.add_log("reward", mean, nb_steps)
            print(f"nb_steps: {nb_steps}, reward: {mean}")
            if cfg.save_best and mean > best_reward:
                best_reward = mean
                directory = "./dqn_critic/"
                if not os.path.exists(directory):
                    os.makedirs(directory)
                filename = directory + "dqn_" + str(mean.item()) + ".agt"
                eval_agent.save_model(filename)
                

## Definition of the parameters

The logger is defined as `bbrl.utils.logger.TFLogger` so as to use a tensorboard visualisation.

In [None]:
params={
  "save_best": False,
  "logger":{
    "classname": "bbrl.utils.logger.TFLogger",
    "log_dir": "./tmp/" + str(time.time()),
    "cache_size": 10000,
    "every_n_seconds": 10,
    "verbose": False,    
    },

  "algorithm":{
    "seed": 5,
    "max_grad_norm": 0.5,
    "epsilon": 0.02,
    "n_envs": 1,
    "n_steps": 100,
    "eval_interval": 500,
    "nb_measures": 200,
    "nb_evals": 10,
    "discount_factor": 0.99,
    "architecture":{"hidden_size": [256, 256]},
  },
  "gym_env":{
    "classname": "__main__.make_env",
    "env_name": "CartPole-v1",
  },
  "optimizer":
  {
    "classname": "torch.optim.Adam",
    "lr": 2e-3,
  }
}

### Launching tensorboard to visualize the results

In [None]:
%load_ext tensorboard
%tensorboard --logdir ./tmp
config=OmegaConf.create(params)
run_dqn(config)

The version used in this colab uses $< s_t, a_t, r_t, s_{t+1}>$ samples. As an exercise, you may switch to $< s_t, a_t, r_{t+1}, s_{t+1}>$ samples, going back to the standard SaLinA notation. For that, replace the import to `bbrl.agents.gyma` instead of `gymb`, and change the temporal difference update rule (in `compute_critic_loss(...)`) accordingly. See [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5) for more explanations.

## Exercise

The goal of the exercise is to add a target network to get closer to the true DQN algorithm. For that, you need to realize the following steps:
- import copy, to make copies of the Q-network
- in the `create_dqn_agent` function, initialize a target_critic as a copy of the initial critic using `copy.deepcopy(...)`. Then build the corresponding target_q_agent. The function should return this agent in addition to the previous ones.
- in the `compute_critic_loss` function, add `target_q_values` in the parameters, and make so that the target value is computed based on these values rather than on Q-values.
- in the `run_dqn` function, after running the q_agent on the training workspace, run the target_q_agent on the same workspace. Then get the recorded q_values as target_q_values
- each the number of time steps has increased more than `cfg.algorithm.target_critic_update`, copy again the current q_agent into the target_q_agent, using `copy.deepcopy(...)`
- add `cfg.algorithm.target_critic_update` in the parameters, for instance every 5000 steps

## What's next?

To get a full DQN, we need to do the following:
- Add a replay buffer. We can add a replay buffer independently from the target network. The version with a replay buffer and no target network corresponds to [the NQF algorithm](https://link.springer.com/content/pdf/10.1007/11564096_32.pdf).
- Before adding the replay buffer, we will first move to a version of DQN which uses the AutoResetGymAgent. For that, you need to first read the content of [this notebook](https://colab.research.google.com/drive/1W9Y-3fa6LsPeR6cBC1vgwBjKfgMwZvP5). Then you can move to [the notebook where we build a naked DQN using the AutoResetGymAgent](https://colab.research.google.com/drive/1H9_gkenmb_APnbygme1oEdhqMLSDc_bM).
- We should also add a few extra-mechanisms which are present in the full DQN version: starting to learn once the replay buffer is full enough, decreasing the exploration rate epsilon...
- We could also add visualization tools to visualize the learned Q network, by using the `plot_critic` function available in [bbrl.visu.plot_critics.py](https://github.com/osigaud/bbrl/blob/96e58f6e01065f6a551039c4b9f7c1036b5523e6/bbrl/visu/visu_critics.py#L13)


We may also easily code DDQN, by just changing one line in the `compute_critic_loss` function