# State values and Bellman equation

```{note}
This section introduces a core concept and an important tool. The core concept
is the state value, which is defined as the average reward that an agent can obtain if
it follows a given policy. The greater the state value is, the better the corresponding
policy is. While state values are important, how can we analyze them? The answer is the
Bellman equation, which describes the relationships between the values of all states.
```

## State values

Starting from $t$, we can obtain a state-action-reward trajectory:

$$S_{t}\overset{A_{t}}{\longrightarrow}S_{t+1},R_{t+1} \overset{A_{t+1}}{\longrightarrow}S_{t+2},R_{t+2} \overset{A_{t+2}}{\longrightarrow}S_{t+3},R_{t+3}\dots$$

By definition, the discounted return along the trajectory is

$$G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots$$

where $\gamma\in(0,1)$ is the discount rate. Note that $G_{t}$ is a random variable since $R_{t+1},R_{t+2},\dots$
are all random variables. Since Gt is a random variable, we can calculate its expected value:

$$v_{\pi}(s) := \mathbb{E}[G_{t}|S_{t}=s]$$

Here, $v_{\pi}(s)$ is called the state-value function or simply the state value of $s$. Some important
remarks are given below.

* $v_{\pi}(s)$ depends on $s$.

* $v_{\pi}(s)$ depends on $\pi$.

* $v_{\pi}(s)$ does not depends on $t$.

## Bellman equation

We now introduce the Bellman equation, a set of linear equations that describe the
relationships between the values of all the states. Note that state value can be written as

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}[G_{t}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1}|S_{t}=s] + \gamma\mathbb{E}[G_{t+1}|S_{t}=s]
\end{aligned}
$$

The first term by using the law of total expectation:

$$
\begin{aligned}
\mathbb{E}[R_{t+1}|S_{t}=s] &= \sum_{a\in\mathcal{A}}\pi(a|s)\mathbb{E}[R_{t+1}|S_{t}=s,A_{t}=a]\\
&= \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r
\end{aligned}
$$

The second term can be calculated as:

$$
\begin{aligned}
\mathbb{E}[G_{t+1}|S_{t}=s] &= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t}=s,S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p(s'|s,a)\pi(a|s)
\end{aligned}
$$

This leads:

$$
\begin{aligned}
v_{\pi}(s) &=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p(s'|s,a)\pi(a|s)\\
&=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]
\end{aligned}
$$

Bellman equation can be written in the matrix form:

$$v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}$$

In terms of the action values:

$$
\begin{aligned}
q_{\pi}(s, a) &= \sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\\
&= \sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)\sum_{a'\in\mathcal{A}}\pi(a'|s')q_{\pi}(s',a')
\end{aligned}
$$

## Bellman optimality equation

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is necessary to first define what an optimal policy is. The definition is based on the state values.

A policy $\pi^{\ast}$ is optimal if $v_{\pi^{\ast}}(s) \ge v_{\pi}(s)$ for all $s\in\mathcal{S}$ and for all policy $\pi$.

* Exists?
* Unique?
* If exists, how the obtain?

Bellman optimal equation:

$$
\begin{aligned}
v_{\pi}(s) &= \underset{\pi\in\Pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]\\
&= \underset{\pi\in\Pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s, a)
\end{aligned}
$$

matrix form:

$$v=\underset{\pi\in\Pi}{\max}(r_{\pi} + \gamma P_{\pi}v) = f(v)$$

We can show that $\left \| f(v_{1}) - f(v_{2}) \right \|_{\infty} \le \gamma\left \| v_{1} - v_{2} \right \|_{\infty}$, then by using the contraction mapping theorem, we conclude that there exits one unique solution to the Bellman optimal equation.

Next, we assert that the solution $v^{\ast}$ of the Bellman optimal equation is the optimal state value, and the corresponding policy

$$\pi^{\ast}(a|s)=\underset{a\in\mathcal{A}}{\text{argmax}}\ q^{\ast}(s,a)$$

is an optimal policy.<br/>
**Proof**: For any policy $\pi$, it holds that

$$v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}$$

Since

$$v^{\ast} = \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v^{\ast}) = r_{\pi^{\ast}} + \gamma P_{\pi^{\ast}}v^{\ast} \ge r_{\pi} + \gamma P_{\pi}v^{\ast}$$

we have

$$v^{\ast} - v_{\pi} \ge (r_{\pi} + \gamma P_{\pi}v^{\ast}) - (r_{\pi} + \gamma P_{\pi}v_{\pi}) = \gamma P_{\pi}(v^{\ast} - v_{\pi})$$

Repeated applying the above inequality gives $v^{\ast} - v_{\pi} \ge \gamma P_{\pi}(v^{\ast} - v_{\pi}) \ge \gamma^{2} P_{\pi}^{2}(v^{\ast} - v_{\pi})\ge \dots$. It follows that

$$v^{\ast} - v_{\pi} \ge \lim_{n\to\infty}\gamma^{n} P_{\pi}^{n}(v^{\ast} - v_{\pi})=0$$