# State values and Bellman equation

```{note}
This section introduces a core concept and an important tool. The core concept
is the state value, which is defined as the average reward that an agent can obtain if
it follows a given policy. The greater the state value is, the better the corresponding
policy is. While state values are important, how can we analyze them? The answer is the
Bellman equation, which describes the relationships between the values of all states.
```

## State values

Starting from $t$, we can obtain a state-action-reward trajectory:

$$S_{t}\overset{A_{t}}{\longrightarrow}S_{t+1},R_{t+1} \overset{A_{t+1}}{\longrightarrow}S_{t+2},R_{t+2} \overset{A_{t+2}}{\longrightarrow}S_{t+3},R_{t+3}\dots$$

By definition, the discounted return along the trajectory is

$$G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots$$

where $\gamma\in(0,1)$ is the discount rate. Note that $G_{t}$ is a random variable since $R_{t+1},R_{t+2},\dots$
are all random variables. Since $G_{t}$ is a random variable, we can calculate its expected value:

$$v_{\pi}(s) := \mathbb{E}[G_{t}|S_{t}=s]$$

Here, $v_{\pi}(s)$ is called the state-value function or simply the state value of $s$. Some important
remarks are given below.

* $v_{\pi}(s)$ depends on $s$.

* $v_{\pi}(s)$ depends on $\pi$.

* $v_{\pi}(s)$ does not depends on $t$.

## Bellman equation

We now introduce the Bellman equation, a set of linear equations that describe the
relationships between the values of all the states. First, note that Gt can be rewritten as

$$
\begin{aligned}
G_{t} &= R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots\\
&=R_{t+1} + \gamma(R_{t+2} + \gamma R_{t+3} + \dots)\\
&=R_{t+1} + \gamma G_{t+1}
\end{aligned}
$$

Note that state value can be written as

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}[G_{t}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1}|S_{t}=s] + \gamma\mathbb{E}[G_{t+1}|S_{t}=s]
\end{aligned}
$$

Calculate the first term by using the law of total expectation:

$$
\begin{aligned}
\mathbb{E}[R_{t+1}|S_{t}=s] &= \sum_{a\in\mathcal{A}}\pi(a|s)r(s,a)
\end{aligned}
$$

The second term can be calculated as:

$$
\begin{aligned}
\mathbb{E}[G_{t+1}|S_{t}=s] &= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t}=s,S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p(s'|s,a)\pi(a|s)
\end{aligned}
$$

This leads to the Bellman equation:

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_{t}=s] \\
&=\sum_{a\in\mathcal{A}}\pi(a|s)\left[r(s,a) + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]
\end{aligned}
$$

Bellman equation can be written in the matrix form:

$$v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}$$

## Action values

The action value of a state-action pair $(s,a)$ is defined as

$$q_{\pi}(s,a) := \mathbb{E}[G_{t}|S_{t}=s,A_{t}=a]$$

As can be seen, the action value is defined as the expected return that can be obtained
after taking an action at a state.

The relationship between action values and state values:

$$v_{\pi}(s) = \sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s, a)$$

The Bellman equation that we previously introduced was defined based on state values.
In fact, it can also be expressed in terms of action values:

$$
\begin{aligned}
q_{\pi}(s,a) &= \mathbb{E}[R_{t+1} + \gamma q_{\pi}(S_{t+1}, A_{t+1})|S_{t}=s,A_{t}=a]\\
&= r(s,a) + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)\sum_{a'\in\mathcal{A}}\pi(a'|s')q_{\pi}(s',a')
\end{aligned}
$$