 Copyright © Sorbonne University.

 This source code is licensed under the MIT license found in the LICENSE file
 in the root directory of this source tree.

# Outlook

In this notebook we code one version of the [Proximal Policy Optimization
(PPO)](https://arxiv.org/pdf/1707.06347.pdf) algorithms using BBRL. More
precisely, the version here is the one that clips the policy gradient.

The PPO algorithm is superficially explained in [this
video](https://www.youtube.com/watch?v=uRNL93jV2HE) and you can also read [the
corresponding slides](http://pages.isir.upmc.fr/~sigaud/teach/ps/10_ppo.pdf).

It is also a good idea to have a look at the [spinning up
documentation](https://spinningup.openai.com/en/latest/algorithms/ppo.html).

This version of PPO works, but it incorrectly samples minibatches randomly
from the rollouts without making sure that each sample is used once and only
once See:
https://iclr-blog-track.github.io/2022/03/25/ppo-implementation-details/ for a
full description of all the coding tricks that should be integrated

# Setting up the environment
We first need to setup the environment
Installs the necessary Python and system libraries

# Setting up the environment
We first need to setup the environment
Installs the necessary Python and system libraries

In [None]:
try:
    from easypip import easyimport
except ModuleNotFoundError:
    from subprocess import run
    assert run(["pip", "install", "easypip"]).returncode == 0, "Could not install easypip"
    from easypip import easyimport

easyimport("swig")
easyimport("bbrl_utils").setup()

# [[imports]]

# Learning environment

## Configuration

The learning environment is controlled by a configuration that define a few
important things as described in the example below. This configuration can
hold as many extra information as you need, the example below is the minimal
one.

```python
params = {
    # This defines the a path for logs and saved models
    "base_dir": "${gym_env.env_name}/myalgo_${current_time:}",

    # The Gymnasium environment
    "gym_env": {
        "env_name": "CartPoleContinuous-v1",
    },

    # Algorithm
    "algorithm": {
        # Seed used for the random number generator
        "seed": 1023,

        # Number of parallel training environments
        "n_envs": 8,
                
        # Minimum number of steps between two evaluations
        "eval_interval": 500,
        
        # Number of parallel evaluation environments
        "nb_evals": 10,

        # Number of epochs (loops)
        "max_epochs": 40000,

        # Number of steps (partial iteration)
        "n_steps": 100,
        
    },
}

# Creates the configuration object, i.e. cfg.algorithm.nb_evals is 10
cfg = OmegaConf.create(params)
```

## The RL algorithm

In this notebook, the RL algorithm is based on `EpisodicAlgo`, that defines
the algorithm environment when using episodes. To use such environment, we
just need to subclass `EpisodicAlgo` and to define two things, namely the
`train_policy` and the `eval_policy`. Both are BBRL agents that, given the
environment state, select the action to perform.

```py
  class MyAlgo(EpisodicAlgo):
      def __init__(self, cfg):
          super().__init__(cfg)

          # Define the train and evaluation policies
          # (the agents compute the workspace `action` variable)
          self.train_policy = MyPolicyAgent(...)
          self.eval_policy = MyEvalAgent(...)

algo = MyAlgo(cfg)
```

The `EpisodicAlgo` defines useful objects:

- `algo.cfg` is the configuration
- `algo.nb_steps` (integer) is the number of steps since the training began
- `algo.logger` is a logger that can be used to collect statistics during training:
    - `algo.logger.add_log("critic_loss", critic_loss, algo.nb_steps)` registers the `critic_loss` value on tensorboard
- `algo.evaluate()` evaluates the current `eval_policy` if needed, and keeps the
agent if it was the best so far (average cumulated reward);
- `algo.visualize_best()` runs the best agent on one episode, and displays the video



Besides, it also defines an `iter_episodes` that allows to iterate over partial
episodes (with `n_steps` from `n_envs` environments):

```python3
  # with partial episodes
  for workspace in algo.iter_partial_episodes():
      # workspace is a workspace containing 50 transitions
      # (with autoreset)
      ...
```

# Definition of PPO agents

## Critic agent

As A2C, PPO uses a value function $V(s)$. We thus call upon the `VAgent`
class,  which takes an observation as input and whose output is the value of
this observation.

In [None]:
class VAgent(Agent):
    def __init__(self, state_dim, hidden_layers, name="critic"):
        super().__init__(name)
        self.is_q_function = False
        self.model = build_ortho_mlp(
            [state_dim] + list(hidden_layers) + [1], activation=nn.ReLU()
        )

    def forward(self, t, **kwargs):
        observation = self.get(("env/env_obs", t))
        critic = self.model(observation).squeeze(-1)
        self.set((f"{self.prefix}v_values", t), critic)

## The DiscretePolicy

The DiscretePolicy was already used in A2C to deal with discrete actions, but
we have added the possibility to only predict the probability of an action
using the ```predict_proba``` variable in the ```forward()``` function. The
code is as follows.

In [None]:
class DiscretePolicy(Agent):
    def __init__(self, state_dim, hidden_size, n_actions, name="policy"):
        super().__init__(name=name)
        self.model = build_ortho_mlp(
            [state_dim] + list(hidden_size) + [n_actions], activation=nn.ReLU()
        )

    def dist(self, obs):
        scores = self.model(obs)
        probs = torch.softmax(scores, dim=-1)
        return torch.distributions.Categorical(probs)

    def forward(
        self,
        t,
        *,
        stochastic=True,
        predict_proba=False,
        compute_entropy=False,
        **kwargs,
    ):
        """
        Compute the action given either a time step (looking into the workspace)
        or an observation (in kwargs)
        """
        observation = self.get(("env/env_obs", t))
        scores = self.model(observation)
        probs = torch.softmax(scores, dim=-1)

        if predict_proba:
            action = self.get(("action", t))
            log_probs = probs[torch.arange(probs.size()[0]), action].log()
            self.set((f"{self.prefix}logprob_predict", t), log_probs)
        else:
            if stochastic:
                action = torch.distributions.Categorical(probs).sample()
            else:
                action = scores.argmax(1)
            self.set(("action", t), action)

        if compute_entropy:
            entropy = torch.distributions.Categorical(probs).entropy()
            self.set((f"{self.prefix}entropy", t), entropy)

### Main PPO agent

In the following, we create the PPO Agent, with one policy and one critic,
and their "delayed" versions (target network for the critic, and previous 
policy in the inner loop of the optimization).

In [None]:
class PPOClip(EpisodicAlgo):
    def __init__(self, cfg):
        super().__init__(cfg, autoreset=True)
        obs_size, act_size = self.train_env.get_obs_and_actions_sizes()

        self.train_policy = globals()[cfg.algorithm.policy_type](
            obs_size,
            cfg.algorithm.architecture.actor_hidden_size,
            act_size,
        ).with_prefix("current_policy/")

        self.eval_policy = KWAgentWrapper(
            self.train_policy, 
            stochastic=False,
            predict_proba=False,
            compute_entropy=False,
        )

        self.critic_agent = VAgent(
            obs_size, cfg.algorithm.architecture.critic_hidden_size
        ).with_prefix("critic/")
        self.old_critic_agent = copy.deepcopy(self.critic_agent).with_prefix("old_critic/")

        self.old_policy = copy.deepcopy(self.train_policy)
        self.old_policy.with_prefix("old_policy/")

        self.policy_optimizer = setup_optimizer(
            cfg.optimizer, self.train_policy
        )
        self.critic_optimizer = setup_optimizer(
            cfg.optimizer, self.critic_agent
        )

In the cell below, we optimize the policy loss for PPO-clip, i.e.

$$
L^{C L I P}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]
$$
where $$r_t(\theta) = \frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)}$$

Useful torch functions:
- [torch.clamp](https://pytorch.org/docs/stable/generated/torch.clamp.html) computes $\min(\max(x_i, m_i), M_i)$ where $m_i$ and $M_i$ are the lower and upper bounds respectively

In [None]:
def run(ppo_clip: PPOClip):
    cfg = ppo_clip.cfg

    t_policy = TemporalAgent(ppo_clip.train_policy)
    t_old_policy = TemporalAgent(ppo_clip.old_policy)
    t_critic = TemporalAgent(ppo_clip.critic_agent)
    t_old_critic = TemporalAgent(ppo_clip.old_critic_agent)

    for train_workspace in iter_partial_episodes(
        ppo_clip, cfg.algorithm.n_steps
    ):
        # Run the current policy and evaluate the proba of its action according
        # to the old policy The old_policy can be run after the train_agent on
        # the same workspace because it writes a logprob_predict and not an
        # action. That is, it does not determine the action of the old_policy,
        # it just determines the proba of the action of the current policy given
        # its own probabilities

        with torch.no_grad():
            t_old_policy(
                train_workspace,
                t=0,
                n_steps=cfg.algorithm.n_steps,
                # Just computes the probability of the old policy's action
                # to get the ratio of probabilities
                predict_proba=True,
                compute_entropy=False,
            )

        # Compute the critic value over the whole workspace
        t_critic(train_workspace, t=0, n_steps=cfg.algorithm.n_steps)
        with torch.no_grad():
            t_old_critic(train_workspace, t=0, n_steps=cfg.algorithm.n_steps)

        ws_terminated, ws_reward, ws_v_value, ws_old_v_value = train_workspace[
            "env/terminated",
            "env/reward",
            "critic/v_values",
            "old_critic/v_values",
        ]

        # the critic values are clamped to move not too far away from the values of the previous critic
        if cfg.algorithm.clip_range_vf > 0:
            # Clip the difference between old and new values
            # NOTE: this depends on the reward scaling
            ws_v_value = ws_old_v_value + torch.clamp(
                ws_v_value - ws_old_v_value,
                -cfg.algorithm.clip_range_vf,
                cfg.algorithm.clip_range_vf,
            )

        # Compute the advantage using the (clamped) critic values
        with torch.no_grad():
            advantage = gae(
                ws_reward[1:],
                ws_v_value[1:],
                ~ws_terminated[1:],
                ws_v_value[:-1],
                cfg.algorithm.discount_factor,
                cfg.algorithm.gae,
            )

        ppo_clip.critic_optimizer.zero_grad()
        target = ws_reward[1:] + cfg.algorithm.discount_factor * ws_old_v_value[1:].detach() * (1 - ws_terminated[1:].int())
        critic_loss = torch.nn.functional.mse_loss(ws_v_value[:-1], target) * cfg.algorithm.critic_coef
        critic_loss.backward()
        torch.nn.utils.clip_grad_norm_(
            ppo_clip.critic_agent.parameters(), cfg.algorithm.max_grad_norm
        )
        ppo_clip.critic_optimizer.step()

        # We store the advantage into the transition_workspace
        if cfg.algorithm.normalize_advantage and advantage.shape[1] > 1:
            advantage = (advantage - advantage.mean()) / (advantage.std() + 1e-8)
        train_workspace.set_full("advantage", torch.cat(
            (advantage, torch.zeros(1, advantage.shape[1]))
        ))
        transition_workspace = train_workspace.get_transitions()

        # Inner optimization loop: we sample transitions and use them to learn
        # the policy
        for opt_epoch in range(cfg.algorithm.opt_epochs):
            if cfg.algorithm.batch_size > 0:
                sample_workspace = transition_workspace.select_batch_n(
                    cfg.algorithm.batch_size
                )
            else:
                sample_workspace = transition_workspace

            # Compute the policy loss

            # Compute the probability of the played actions according to the current policy
            # We do not replay the action: we use the one stored into the dataset
            # Hence predict_proba=True
            # Note that the policy is not wrapped into a TemporalAgent, but we use a single step
            # Compute the ratio of action probabilities
            # Compute the policy loss
            # (using cfg.algorithm.clip_range and torch.clamp)
            assert False, 'Not implemented yet'


            loss_policy = -cfg.algorithm.policy_coef * policy_loss

            # Entropy loss favors exploration Note that the standard PPO
            # algorithms do not have an entropy term, they don't need it because
            # the KL term is supposed to deal with exploration So, to run the
            # standard PPO algorithm, you should set
            # cfg.algorithm.entropy_coef=0
            assert len(entropy) == 1, f"{entropy.shape}"
            entropy_loss = entropy[0].mean()
            loss_entropy = -cfg.algorithm.entropy_coef * entropy_loss

            # Store the losses for tensorboard display
            ppo_clip.logger.log_losses(
                critic_loss, entropy_loss, policy_loss, ppo_clip.nb_steps
            )
            ppo_clip.logger.add_log(
                "advantage", policy_advantage[0].mean(), ppo_clip.nb_steps
            )

            loss = loss_policy + loss_entropy

            ppo_clip.policy_optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(
                ppo_clip.train_policy.parameters(), cfg.algorithm.max_grad_norm
            )
            ppo_clip.policy_optimizer.step()

        # Copy parameters
        copy_parameters(ppo_clip.train_policy, ppo_clip.old_policy)
        copy_parameters(ppo_clip.critic_agent, ppo_clip.old_critic_agent)

        # Evaluates our current algorithm if needed
        ppo_clip.evaluate()

# Definition of the parameters

In [None]:
params = {
    "base_dir": "${gym_env.env_name}/ppo-clip-S${algorithm.seed}_${current_time:}",
    "save_best": False,
    "logger": {
        "classname": "bbrl.utils.logger.TFLogger",
        "cache_size": 10000,
        "every_n_seconds": 10,
        "verbose": False,
    },
    "algorithm": {
        "seed": 12,
        "max_grad_norm": 0.5,
        "n_envs": 8,
        "n_steps": 32,
        "eval_interval": 1000,
        "nb_evals": 10,
        "gae": 0.8,
        "discount_factor": 0.98,
        "normalize_advantage": False,
        "max_epochs": 5_000,
        "opt_epochs": 10,
        "batch_size": 256,
        "clip_range": 0.2,
        "clip_range_vf": 0,
        "entropy_coef": 2e-7,
        "policy_coef": 1,
        "critic_coef": 1.0,
        "policy_type": "DiscretePolicy",
        "architecture": {
            "actor_hidden_size": [64, 64],
            "critic_hidden_size": [64, 64],
        },
    },
    "gym_env": {
        "env_name": "CartPole-v1",
    },
    "optimizer": {
        "classname": "torch.optim.AdamW",
        "lr": 1e-3,
        "eps": 1e-5,
    },
}

## Launching tensorboard to visualize the results

In [None]:
# the terminal

setup_tensorboard("./outputs/tblogs")

In [None]:
ppo_clip = PPOClip(OmegaConf.create(params))
run(ppo_clip)

In [None]:
ppo_clip.visualize_best()