# Policy Gradients
<table align="left">
  <td>
    <a target="_blank" href="https://colab.research.google.com/github/matyama/deep-rl-hands-on/blob/main/11_policy_gradients.ipynb">
    <img src="https://www.tensorflow.org/images/colab_logo_32px.png" />
        Run in Google Colab
    </a>
  </td>
</table>

In [1]:
%%bash
!(stat -t /usr/local/lib/*/dist-packages/google/colab > /dev/null 2>&1) && exit

echo "Running on Google Colab, therefore installing dependencies..."
pip install ptan>=0.7 pytorch-ignite

In [None]:
%load_ext tensorboard
%tensorboard --logdir runs

## Values and Policy
Contrary to the value iteration methods (Q-Learning) which try to estimate the state values (state-action values), the *policy gradient* technique focus directly on the policy $\pi(s)$. 

Direct policy modeling has several advantages:
* From certain point of view, we don't care that much about the expected discounted rewards but rather the decision/action $\pi(s)$ to take in each state $s$
* As we saw earlier with the *Categorical DQN*, learning a distribution helps to better capture the underlying MDP (especially in stochastic environments)
* It becomes quite a hard to determine the best action to take when the action space is large or even continuous. The DQN model of $Q(s, a)$ is highly non-linear and the optimization problem $a^* = argmax_a Q(s, a)$ can be hard to solve.

In the value iteration case our DQN parametrized the state-action values as $DQN(s) \to Q_\mathbf{w}(s, \cdot)$. Similarly, we will represent the policy as a probability distribution over actions $\pi_\mathbf{w}(s)$ parametrized by the NN.

*Modelling the output as action (class) probabilities is a typical technique in classification tasks that gives us a smooth representation (intuitively, changing NN weights $\mathbf{w}$ a bit changes $\pi$ a bit as well - compared to the case with discrete action labels which would change in steps).*

## Gradients of the Policy

*Policy Gradient* methods are closely related to the *Cross-Entropy Method* introduced earlier. The gradient is a direction in which we want to change NN weights to maximize the accumulated reward and is proportional in scale to the $Q$ state-action value and in the direction to the log of action probabilities:
$$
\nabla J \approx \mathbb{E}[Q(s, a) \nabla \log(\pi(a | s))]
$$
where the expectation means that we average the gradient over several steps.

Equivalently we can say that we optimize the loss function $\mathcal{L} = -Q(s, a) \log(\pi(a | s))$ (Note: SGD minimizes the loss function but we want to maximize the gradient, therefore the minus sign).

Recall that in the *Cross-Entropy Method* we sampled the environment for few episodes and trained only on transitions from the above-average ones. This corresponds to having $Q(s, a) = 1$ for the good transitions and $Q(s, a) = 0$ otherwise. In general, policy gradient methods differ in the way how $Q$ values are treated but in any case we want to use $Q(s, a) \in [0, 1]$:
1. for better separation of episode
1. to incorporate the discount factor and thus the uncertainty about future rewards

## The REINFORCE method
The outline of the *REINFORCE* methods is the following:
1. Initialize NN weights randomly
1. Play $N$ full episode and collect experiences $(s, a, r, s')$
1. Compute actual $Q$ values for every played episode $k$ and step $t$: $Q_{k, t} = \sum_{i=0}^t \gamma^t r_t$
1. Compute the loss for all transitions: $\mathcal{L} = - \sum_{k, t} Q_{k, t} \log(\pi(s_{k, t}, a_{k, t}))$
1. Do one SGD step by minimizing the loss and update NN weights
1. Repeat from step 2. until convergence

Properties of the REINFORCE method:
* We **don't need an explicit exploration policy** because we explore automatically using the policy our NN outputs.
* **On-policy** method, therefore no ER buffer is needed because we can't train on the data from old policies. On the other hand, value methods typically converge faster (need less interations with the environment).
* We train on actual Q values and not estimated ones so we **don't need a target NN** to break experience correlations either.

### CartPole REINFORCE

In [2]:
# flake8: noqa: E402,I001

from dataclasses import dataclass
from typing import Iterable, List, Tuple

import gym
import numpy as np
import ptan
import torch
import torch.nn as nn
from ptan.experience import ExperienceFirstLast
from tensorboardX import SummaryWriter


class PGN(nn.Module):
    """
    Policy Gradient Network that consumes states (observations)
    and outputs action logits (scores).
      - Note: Logits should be manually converted to probabilities
        with `log_softmax` for better numerical stability and optimization.
    """

    def __init__(self, input_shape: Tuple[int, ...], n_actions: int) -> None:
        super().__init__()

        # Simple, not really deep, forward network that outputs action logits
        self.net = nn.Sequential(
            nn.Linear(input_shape, 128),
            nn.ReLU(),
            nn.Linear(128, n_actions),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


def compute_q_values(rewards: List[float], gamma: float) -> Iterable[float]:
    qs = []
    sum_r = 0.0

    for r in reversed(rewards):
        sum_r *= gamma
        sum_r += r
        qs.append(sum_r)

    return reversed(qs)


def train_reinforce(
    env_name: str,
    gamma: float = 0.99,
    learning_rate: float = 0.01,
    n_played_episodes: int = 4,
    reward_bound: int = 195,
    log_period: int = 10,
) -> None:

    # Crate the environment
    env = gym.make(env_name)

    # Create PG network
    net = PGN(
        input_shape=env.observation_space.shape[0],
        n_actions=env.action_space.n,
    )
    print(net)

    # Initialize an agent
    #  - Notice: We instruct it to apply softmax to the PGN output
    agent = ptan.agent.PolicyAgent(
        net,
        preprocessor=ptan.agent.float32_preprocessor,
        apply_softmax=True,
    )

    # Create experience source and optimizer

    exp_source = ptan.experience.ExperienceSourceFirstLast(
        env=env,
        agent=agent,
        gamma=gamma,
    )

    optimizer = torch.optim.Adam(net.parameters(), lr=learning_rate)

    with SummaryWriter(comment=f"-{env_name}-reinforce") as writer:

        done_episodes = 0
        batch_episodes = 0

        batch_states, batch_actions, batch_q_values = [], [], []

        episode_rewards = []
        total_rewards = []

        # Interact with the environment and consume experiences
        for i, exp in enumerate(exp_source):

            # Add the new experience to current batch
            batch_states.append(exp.state)
            batch_actions.append(int(exp.action))

            # Buffer immedieate rewards during each episode
            episode_rewards.append(exp.reward)

            # Compute Q values from immediate rewards when episode ends
            if exp.last_state is None:
                batch_q_values += compute_q_values(episode_rewards, gamma)
                episode_rewards.clear()
                batch_episodes += 1

            # Handle new rewards
            new_rewards = exp_source.pop_total_rewards()
            if new_rewards:

                done_episodes += 1

                # Collect total rewards
                reward = new_rewards[0]
                total_rewards.append(reward)

                # Compute the mean reward over last 100 episodes
                mean_rewards = float(np.mean(total_rewards[-100:]))

                # Log training progress
                if done_episodes % log_period == 0:
                    print(
                        f"{i}: reward: {reward:.2}, "
                        f"mean_100: {mean_rewards:.2}, "
                        f"episodes: {done_episodes}"
                    )

                # Record metrics for TensorBoard
                writer.add_scalar("reward", reward, i)
                writer.add_scalar("reward_100", mean_rewards, i)
                writer.add_scalar("episodes", done_episodes, i)

                # Check if the learned policy is good enough
                if mean_rewards > reward_bound:
                    print(f"Solved in {i} steps and {done_episodes} episodes!")
                    break

            # Play N episodes to accumulate Q values before training step
            if batch_episodes < n_played_episodes:
                continue

            n_states = len(batch_states)

            # Reset gradients
            optimizer.zero_grad()

            # Convert batch parts to tensors
            states = torch.FloatTensor(batch_states)
            actions = torch.LongTensor(batch_actions)
            q_values = torch.FloatTensor(batch_q_values)

            # Compute action scores (logits)
            #  - Note: There's just single pass through the PGN (DQN has 2)
            logits = net(states)

            # Compute the loss funciton defiend in previous section
            log_action_prob = nn.functional.log_softmax(logits, dim=1)
            exp_values = q_values * log_action_prob[range(n_states), actions]
            loss = -exp_values.mean()

            # Compute gradient of the loss function and make one SGD step
            loss.backward()
            optimizer.step()

            # Reset current batch
            batch_episodes = 0
            batch_states.clear()
            batch_actions.clear()
            batch_q_values.clear()


# Run REINFORCE to solve the CartPole environment
train_reinforce(env_name="CartPole-v0")

PGN(
  (net): Sequential(
    (0): Linear(in_features=4, out_features=128, bias=True)
    (1): ReLU()
    (2): Linear(in_features=128, out_features=2, bias=True)
  )
)
144: reward:  13.00, mean_100:  14.40, episodes: 10
319: reward:  17.00, mean_100:  15.95, episodes: 20


  Variable._execution_engine.run_backward(


537: reward:  15.00, mean_100:  17.90, episodes: 30
887: reward:  55.00, mean_100:  22.18, episodes: 40
1349: reward:  51.00, mean_100:  26.98, episodes: 50
1815: reward:  41.00, mean_100:  30.25, episodes: 60
2509: reward:  97.00, mean_100:  35.84, episodes: 70
3107: reward:  38.00, mean_100:  38.84, episodes: 80
3545: reward:  38.00, mean_100:  39.39, episodes: 90
3918: reward:  26.00, mean_100:  39.18, episodes: 100
4184: reward:  37.00, mean_100:  40.40, episodes: 110
4483: reward:  28.00, mean_100:  41.64, episodes: 120
4776: reward:  24.00, mean_100:  42.39, episodes: 130
5088: reward:  46.00, mean_100:  42.01, episodes: 140
5535: reward:  53.00, mean_100:  41.86, episodes: 150
6082: reward:  64.00, mean_100:  42.67, episodes: 160
6763: reward:  52.00, mean_100:  42.54, episodes: 170
7421: reward:  56.00, mean_100:  43.14, episodes: 180
8046: reward:  34.00, mean_100:  45.01, episodes: 190
8739: reward:  88.00, mean_100:  48.21, episodes: 200
9418: reward:  30.00, mean_100:  52.3