#  Basic Knolwedge of Reinforcement Learning

Written by [Junkun Yuan](https://junkunyuan.github.io/) (yuanjk0921@outlook.com).

See paper reading list and notes [here](https://junkunyuan.github.io/paper_reading_list/paper_reading_list.html).

Last updated on Sep 7, 2025; &nbsp; First committed on Sep 6, 2025.

**References**

- [**Hands-on RL**](https://github.com/boyu-ai/Hands-on-RL/blob/main/%E7%AC%AC2%E7%AB%A0-%E5%A4%9A%E8%87%82%E8%80%81%E8%99%8E%E6%9C%BA%E9%97%AE%E9%A2%98.ipynb)

**Contents**
- RL Fundamentals
- Temporal Difference & Q-Learning & Deep Q-Network (DQN)
- Policy Gradient Method
- Actor-Critic Algorithm
- Trust Region Policy Optimization (TRPO)
- Proximal Policy Optimization (PPO)
- Deep Deterministic Policy Gradient (DDPG)

## RL Fundamentals

There are some fundamental knowledge need to know.

The following concepts can be learned from the Multi-Armed Bandit problem, [here](https://github.com/junkunyuan/junkunyuan.github.io/blob/main/paper_reading_list/resource/jupyters/Multi-Armed%20Bandit.ipynb):

- **Agent**
- **Action**
- **State**
- **Reward**
- **Return**
- **Exploration vs. Exploitation**

The following concepts of can be learned [here](https://github.com/junkunyuan/junkunyuan.github.io/blob/main/paper_reading_list/resource/jupyters/Markov%20Decision%20Process.ipynb):

- **Markov Process**
- **Markov Reward Process**
- **Markov Decision Process**
- **Policy**
- **Value Function**
- **Q-function**
- **Monte-Carlo Method**

## Temporal Difference & Q-Learning & Deep Q-Network (DQN)


**Motivation.** Dynamic programming requires a known Markov decision process, where it can not be met in real-world scenarios.

[**Model-free reinforcement learning**](https://en.wikipedia.org/wiki/Model-free_(reinforcement_learning)) does not estimate the *transition probability distribution* and the *reward function* associated with the Markov decision process.

[**Temporal Difference (TD)**](https://en.wikipedia.org/wiki/Temporal_difference_learning) is a type of model-free reinforcement learning method which learns by bootstrapping from the current estimate of the *value function*.

TD is similar to Monte-Carlo method in learning from data, but different in update at each time step rather than a whole episode.

TD is similar to dynamic programming in employing Bellman equation, but different in model-free.

Given state value $V(s_t)$ at time step $t$ and return (cumulative reward) $G_t$, we have:

\begin{equation}
\begin{aligned}
V_{\pi}(s)
=&\mathbb{E}_{\pi}(G_t|S_t=s) \\
=&\mathbb{E}_{\pi}[\sum_{k=0}^{\infty}\gamma^k R_{t+k}|S_t=s] \\
=&\mathbb{E}_{\pi}[R_t + \gamma\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}|S_t=s] \\
=&\mathbb{E}_{\pi}[R_t + \gamma V_{\pi}(S_{t+1})|S_t
=s]
\end{aligned}
\end{equation}

Monte-Carlo uses the first equation to learn, so it needs a whole episode to update.

TD uses the last equation to learn, it has the following expression:

$$
V(s_t) \leftarrow V(s_t) + \alpha[r_t+\gamma V(s_{t+1})-V(s_t)].
$$



[**State–action–reward–state–action (Sarsa)**](https://en.wikipedia.org/wiki/State%E2%80%93action%E2%80%93reward%E2%80%93state%E2%80%93action) employs TD to evaluate action-value function $Q$:

$$
Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t+\gamma Q(s_{t+1},a_{t+1})-Q(s,a)].
$$

- Initialize $Q(s,a)$
- for $e=1\rightarrow E$ do
    - Init state $s$
    - Selecct action $a$ by $\epsilon$-greedy
    - for time step $t=1\rightarrow T$ do
        - Obtain feedback $r$, $s'$
        - Select action $a'$ by $\epsilon$-greedy
        - Update Q-function by $Q(s,a) \leftarrow Q(s,a) + \alpha[r+\gamma Q(s',a')-Q(s,a)]$
        - $s\leftarrow s'$, $a\leftarrow a'$
    - end for
- end for

**Multi-step Sarsa** employs the multi-step property of Monte-Carlo by:

$$
Q(s,a) \leftarrow Q(s,a) + \alpha[r_t+\gamma r_{t+1}+...+\gamma^n Q(s_{t+n},a_{t+n})-Q(s,a)].
$$

[**Q-Learning**](https://en.wikipedia.org/wiki/Q-learning) updates the Q-function by:

$$
Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_t+\gamma \max_{a}Q(s_{t+1},a)-Q(s,a)].
$$

**Behavior policy** is the policy to collect data. 

**Target policy** is the policy to be updated using the data.

**On-policy** means the behavior policy and target policy are the same policy; otherwise, it is **off-policy**.

Off-policy is based on history data, thus is more widely used.

Q-Learning can only be applied to discrete states and actions.

[**Deep Q-Networks (DQN)**](https://en.wikipedia.org/wiki/Q-learning#Deep_Q-learning) estimates continuous action value functions by function approximation. 

The model used for fitting is called **Q-Network**. It is optimized by

$$
\omega^*=\arg\min_{\omega}\frac{1}{2N}\sum_{i=1}^{N}[Q_{\omega}(s_i,a_i)-(r_i+\gamma\max_{a'}Q_{\omega(s_i',a')})]^2.
$$

**Experience replay** is employed by saving the obtained data, i.e., (state, action, reward, next state), and randomly sampling from the pool for training Q-networks. It is for two reasons: (1) making the each training data satisfy I.I.D. (2) incrasing the data efficient by reusing them.

**Target network** $Q_{\omega^-}$ is employed to stablize training.

The training process is:

- Initialize $Q_{\omega}(s,a)$, and copy it to initialize $Q_{\omega^-}$
- Initialize experience replay pool $R$
- for $e=1\rightarrow E$
    - Initialize state $s_1$
    - for $t=1\rightarrow T$ do
        - $Q_{\omega}(s,a)$ select an action $a_t$ by $\epsilon$-greedy
        - Get reward $r_t$ and $s_{t+1}$
        - Save $(s_t, a_t, r_t, s_{t+1})$ to $R$
        - Sample $N$ samples $\{(s_t^i, a_t^i, r_t^i, s_{t+1}^i)\}_{i=1, ..., N}$
        - Get target $y_i=r_i+\gamma\max_{a}Q_{\omega^-}(s_{i+1},a)$
        - Calculate loss and optimize $Q_{\omega}$
        - Optimize $Q_{\omega^-}$

## Policy Gradient Method

[**Policy gradient methods**](https://en.wikipedia.org/wiki/Policy_gradient_method) directly learn a policy that selects actions without consulting a value function.

The learning objective is to maximize the return: $J(\theta)=\mathbb{E}_{s_0}[V^{\pi_{\theta}}(s_0)]$.

Given state visitation distribution $\nu^{\pi}$, we have the gradient

\begin{equation}
\begin{aligned}
\nabla_{\theta}J(\theta)\propto&\sum_{s\in\mathcal{S}}\nu^{\pi_{\theta}}(s)\sum_{a\in\mathcal{A}}Q^{\pi_{\theta}}(s,a)\nabla_{\theta}\pi_{\theta}(a|s) \\
=&\sum_{s\in\mathcal{S}}\nu^{\pi_{\theta}}(s)\sum_{a\in\mathcal{A}}\pi_{\theta}(a|s)Q^{\pi_{\theta}}(s,a)\frac{\nabla_{\theta}\pi_{\theta}(a|s)}{\pi_{\theta}(a|s)} \\
=&\mathbb{E}_{\pi_{\theta}}[Q^{\pi_{\theta}}(s,a)\nabla_{\theta}(\log\pi_{\theta}(a|s))].
\end{aligned}
\end{equation}

We may use **REINFORCE** to achive it by

$$
\nabla_{\theta}J(\theta)=\mathbb{E}_{\pi_{\theta}}[\sum_{t=0}^{T}(\sum_{t'=t}^{T}\gamma^{t'-t}r_{t'})\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)].
$$

The process to implement REINFORCE:

- Initialize $\theta$
- for $e=1\rightarrow E$ do
    - Sample data using $\pi_{\theta}$: $\{s_1, a_1, r_1, s_2, a_2, r_2, ..., s_T, a_T, r_T\}$
    - Calculate return $\psi_t=\sum_{t'=t}^{T}\gamma^{t'-t}r_{t'}$
    - Update $\theta$ by $\theta=\theta+\alpha\sum_{t}^{T}\psi_{t}\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)$
- end for

## Actor-Critic Algorithm

[**Actor-Critic algorithm**](https://en.wikipedia.org/wiki/Actor-critic_algorithm) improves training stability of policy-based method by combining policy-based and value-based methods.

**Actor** determines which actions to take according to a policy function, a **critic** evaluates those actions according to a value function.

The actor is updated by following the policy gradient method, the critic is by

$$
\mathcal{L}_{\omega}=\frac{1}{2}(r+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_t))^2.
$$

The optimization process is:

- for $e=1\rightarrow E$ do
    - Sample data using $\pi_{\theta}$: $s_1,a_1,r_1,s_2,a_2,r_2,...$
    - Calculate $\delta_t=r_t+\gamma V_{\omega}(s_{t+1})-V_{\omega}(s_t)$
    - Update value $\omega=\omega+\alpha_{\omega}\sum_{t}\delta_t\nabla_{\omega}V_{\omega}(s_t)$
    - Update policy $\theta=\theta+\alpha_{\theta}\sum_{t}\delta_{t}\nabla_{\theta}\log\pi_{\theta}(a_t|s_t)$
- end for

## Trust Region Policy Optimization (TRPO)

**TRPO** improves policies by taking the largest possible update step while enforcing a trust-region constraint to ensure stable and monotonic performance improvement.

Given a policy $\pi_{\theta}$, suppose that we can find a better model $\theta'$ to make $J(\theta')\ge J(\theta)$, we have

\begin{equation}
\begin{aligned}
J(\theta)
=&\mathbb{E}_{s_0}[V^{\pi_{\theta}}(s_0)] \\
=&\mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t V^{\pi_{\theta}}(s_t) - \sum_{t=1}^{\infty}\gamma^t V^{\pi_{\theta}}(s_t)] \\
=&-\mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t (\gamma V^{\pi_{\theta}}(s_{t+1}) - V^{\pi_{\theta}}(s_t))].
\end{aligned}
\end{equation}

Define a temporal difference as a **advantage function** $A$, and state visitation distribution $\nu^{\pi}(s)=(1-\gamma)\sum_{t=0}^{\infty}\gamma^{t}P_{t}^{\pi}(s)$, we have the difference:

\begin{equation}
\begin{aligned}
J(\theta') - J(\theta)
=&\mathbb{E}_{s_0}[V^{\pi_{\theta'}}(s_0)]-\mathbb{E}_{s_0}[V^{\pi_{\theta}}(s_0)] \\
=&\mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t r(s_t, a_t)] + \mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t(\gamma V^{\pi_{\theta}}(s_{t+1})-V^{\pi_{\theta}}(s_t))] \\
=&\mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t[r(s_t,a_t)+\gamma V^{\pi_{\theta}}(s_{t+1})-V^{\pi_{\theta}}(s_t)]] \\
=&\mathbb{E}_{\pi_{\theta'}}[\sum_{t=0}^{\infty}\gamma^t A^{\pi_{\theta}}(s_t,a_t)] \\
=&\sum_{t=0}^{\infty}\gamma^{t}\mathbb{E}_{s_t\sim P_t^{\pi_{\theta'}}}\mathbb{E}_{a_t\sim\pi_{\theta'}(\cdot|s_t)}[A^{\pi_{\theta}}(s_t, a_t)] \\
=&\frac{1}{1-\gamma}\mathbb{E}_{s\sim\nu^{\pi_{\theta'}}}\mathbb{E}_{a\sim\pi_{\theta'}(\cdot|s)}[A^{\pi_{\theta}}(s,a)]
\end{aligned}
\end{equation}

Thus, if we find a strategy to make $\mathbb{E}_{s\sim\nu^{\pi_{\theta'}}}\mathbb{E}_{a\sim\pi_{\theta'}(\cdot|s)}[A^{\pi_{\theta}}(s,a)]\ge 0$, than the policy will have monotonic performance improvement.

However, since $\pi_{\theta'}$ is unknown, TRPO employs $\pi_{\theta}$ to approximate $\pi_{\theta'}$:

$$
L_{\theta}(\theta')=J(\theta)+\frac{1}{1-\gamma}\mathbb{E}_{s\sim\nu^{\pi_{\theta}}}\mathbb{E}_{a\sim\pi_{\theta'}(\cdot|s)}[A^{\pi_{\theta}}(s,a)].
$$

We apply importance sampling:

$$
L_{\theta}(\theta')=J(\theta)+\mathbb{E}_{s\sim\nu^{\pi_{\theta}}}\mathbb{E}_{a\sim\pi_{\theta'}(\cdot|s)}[\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)}A^{\pi_{\theta}}(s,a)].
$$

It then apply KL divergence to contrain the policy updated:

$$
\begin{equation}
\begin{aligned}
&\max_{\theta'}L_{\theta}(\theta') \\
&\mathrm{s.t.} \mathbb{E}_{s\sim\nu^{\pi_{\theta_k}}}[D_{\mathrm{KL}}(\pi_{\theta_k}(\cdot|s),\pi_{\theta'}(\cdot|s))] \le \delta
\end{aligned}
\end{equation}

We use **Taylor expansion** to approximate:

$$
\mathbb{E}_{s\sim\nu^{\pi_{\theta}}}\mathbb{E}_{a\sim\pi_{\theta_k}(\cdot|s)}[\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)}A^{\pi_{\theta}}(s,a)]\approx g^{T}(\theta'-\theta_k),
$$

$$
\mathbb{E}_{s\sim\nu^{\pi_{\theta_k}}}[D_{\mathrm{KL}}(\pi_{\theta_k}(\cdot|s),\pi_{\theta'}(\cdot|s))]\approx\frac{1}{2}(\theta'-\theta_k)^T H(\theta'-\theta_k),
$$

where $g=\nabla_{\theta'}\mathbb{E}_{s\sim\nu^{\pi_{\theta}}}\mathbb{E}_{a\sim\pi_{\theta_k}(\cdot|s)}[\frac{\pi_{\theta'}(a|s)}{\pi_{\theta}(a|s)}A^{\pi_{\theta}}(s,a)]\approx g^{T}(\theta'-\theta_k)$, and $H=\mathbf{H}[\mathbb{E}_{s\sim\nu^{\pi_{\theta_k}}}[D_{\mathrm{KL}}(\pi_{\theta_k}(\cdot|s),\pi_{\theta'}(\cdot|s))]]$ is the Hessian matrix.

Then, our objective is:

$$
\theta_{k+1}=\arg\max_{\theta'} g^T(\theta'-\theta_k) \ \ \mathrm{s.t.} \ \ \frac{1}{2}(\theta'-\theta_k)^T H(\theta'-\theta_k)\le \delta.
$$

We can use Karush-Kuhn-Tucker (KKT) to solve it:

$$
\theta_{k+1}=\theta_k+\sqrt{\frac{2\delta}{g^T H^{-1}g}}H^{-1}g.
$$

TRPO use **conjugate gradient method** to avoid calculating Hessian matrix, we have

$$
\theta_{k+1}=\theta_{k}+\sqrt{\frac{2\delta}{x^{T}Hx}}x.
$$

A common way to estimate the advantage is **Generalized Advantage Estimation (GAE)**. 

First, we have:

\begin{equation}
\begin{aligned}
&A_{t}^{(1)} = \delta_t = -V(s_t)+r_t+\gamma V(s_{t+1}) \\
&A_{t}^{(2)} = \delta_t + \gamma \delta_{t+1} = -V(s_t) + r_t + \gamma r_{t+1} + \gamma^2 V(s_{t+2}) \\
&... \\
&A_{t}^{(k)}=\sum_{l=0}^{k-1}\gamma^{l}\delta_{t+l}=-V(s_t)+r_t+\gamma r_{t+1}+...+\gamma^{k-1}r_{t+k-1}+\gamma^k V(s_{t+k}).
\end{aligned}
\end{equation}

Given a hyper-parameter $\lambda\in[0,1]$, GAE applies weighted average:

\begin{equation}
\begin{aligned}
A_{t}^{GAE}
=&(1-\lambda)(A_{t}^{(1)} + \lambda A_{t}^{(2)} + \lambda^2 A_{t}^{(3)} + ...)
=&(1-\lambda)(\delta_t + \lambda(\delta_t + \gamma \delta_{t+1}) + \lambda^2(\delta + \gamma\delta_{t+1} + \gamma^2\delta_{t+2}) + ...) \\
=&(1-\lambda)(\delta(1+\lambda+\lambda^2+...)+\gamma\delta_{t+1}(\lambda+\lambda^2+\lambda^3+...) + \gamma^2\delta_{t+2}(\lambda^2+\lambda^3+\lambda^4+...)+...) \\
=&(1-\lambda)(\delta_{t}\frac{1}{1-\lambda}+\gamma\delta_{t+1}\frac{\lambda}{1-\lambda}+\gamma^2\delta_{t+2}\frac{\lambda^2}{1-\lambda}+...) \\
=&\sum_{l=0}^{\infty}(\gamma\lambda)^l \delta_{t+1}.
\end{aligned}
\end{equation}



## Proximal Policy Optimization (PPO)

[**Proximal Policy Optimization (PPO)**](https://en.wikipedia.org/wiki/Proximal_policy_optimization) simplifies the calculation of TRPO.

**PPO-Penalty** applies Lagrange multiplier to the objective:

$$
\arg\max_{\theta}\mathbb{E}_{s\sim\nu^{\pi_{\theta_{k}}}}\mathbb{E}_{a\sim\pi_{\theta_k}(\cdot|s)}[\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a)-\beta D_{\mathrm{KL}}[\pi_{\theta_k}(\cdot|s), \pi_{\theta}(\cdot|s)]].
$$

Let $d_k=D_{\mathrm{KL}}^{\nu^{\pi_{\theta_k}}}(\pi_{\theta_k},\pi_{\theta})$, $\beta$ is updated by
- If $d_k  \delta / 1.5$, then $\beta_{k+1}=\beta /2$
- If $d_k > \delta \times 1.5$, then $\beta_{k+1}=\beta\times 2$
- Otherwise, $\beta_{k+1}=\beta_k$

**PPO-Clip** is a more famous method:

$$
\arg\max_{\theta}\mathbb{E}_{s\sim\nu^{\pi_{\theta_k}}}\mathbb{E}_{a\sim\pi_{\theta_k}(\cdot|s)}[\min(\frac{\pi_{\theta}(a|s)}{\pi_{\theta_k}(a|s)}A^{\pi_{\theta_k}}(s,a),\mathrm{clip}(\frac{\pi_{\theta}(a|s)}{\pi_{\theta}(a|s)}, 1-\epsilon, 1+\epsilon)A^{\pi_{\theta_k}}(s,a))]
$$

## Deep Deterministic Policy Gradient (DDPG)

TRPO and PPO are on-policy algorithms, which have low sample efficiency.

DDPG is another actor-critic method, but with deterministic policy $a=\mu_{\theta}(s)$, we can have **deterministic policy gradient theorem**:

$$
\nabla_{\theta}J(\pi_{\theta})=\mathbb{E}_{s\sim\nu^{\pi_{\beta}}}[\nabla_{\theta}\mu_{\theta}(s)\nabla_{a}Q_{\omega}^{\mu}(s,a)|_{a=\mu_{\theta}(s)}],
$$

where $\pi_{\beta}$ is the behavior strategy.