In [1]:
import gymnasium as gym
import numpy as np
from itertools import count
import matplotlib.pyplot as plt
import random
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.multiprocessing as mp
from tqdm import tqdm
import os
os.chdir('..')

# Deep deterministic policy gradient (DDPG)

[Deterministic policy gradient paper](https://proceedings.mlr.press/v32/silver14.pdf)

[Deep deterministic policy gradient paper](https://arxiv.org/pdf/1509.02971.pdf)

- **Only applicable to *continuous* action spaces**
- In continuous action spaces, greedy policy improvement using $\underset{a'}{\mathrm{argmax}}\;Q(s',a'; \theta)$, becomes very problematic as it requires a global maximisation at every step
- Alternatively, we can use a deterministic policy function to approximate the best action and move the policy in the direction of the gradient of $Q$:
  - The critic, $Q(s,a;\theta^Q)$ is learned using the Bellman equation like in Q-learning/DQN
  - The actor, $\mu(s;\theta^\mu)$ is updated by applying the chain rule to the gradient of $Q$. This is derived from the *Deterministic Policy Gradient Theorem*, [proven by Silver et al.](https://proceedings.mlr.press/v32/silver14.pdf):
    - $$\nabla_{\theta^\mu}\approx \mathbb{E}_{s\sim \rho^\beta}\left[\nabla_{\theta^\mu} Q(s,\mu(s;\theta^\mu);\theta^Q)\right]$$
    - $$=\mathbb{E}_{s\sim \rho^\beta}\left[\nabla_{a}Q(s,a|\theta^Q) \nabla_{\theta^\mu}\mu(s;\theta^\mu)\right]$$
- DDPG uses many of the same stabilizing techniques as in DQN
  - Experience replay
  - Target networks (for both actor and critic)
- DDPG typically learns off-policy, with an exploration policy $u'$ that adds noise sampled from a noise process $\cal{N}$
  - $$\mu'(s_t)=\mu(s_t|\theta_t^\mu) + \cal{N}$$

![ddpg](ddpg.png)