# Chapter 4 - Dynamic Programming
This chapter considers one class of algorithms that can solve an RL problem (i.e. find an optimal policy) when the perfect model of the (finite) MDP is known. The authors mention that this is rarely the case in practice. However, ideas from Dynamic Programming are useful for more advanced RL techniques developed in the next chapters.

### Exercise 4.1

If $\pi$ is the equiprobable random policy, what is $Q^{\pi}(11, down)$? What is $Q^{\pi}(7, down)$?
<br>

From the Bellman equations we have:
$$
\begin{aligned}
    Q^{\pi}(s, a) &= \mathbb{E} \left[ r_{t+1} + \gamma V^{\pi} (s_{t+1}) \mid s_{t} = s, a_{t} = a \right] \\
    &= \sum_{s'} \mathcal{P}_{ss'}^{a} \left( \mathcal{R}_{ss'}^{a} + \gamma V^{\pi} (s') \right)
\end{aligned}
$$

Given the gridworld definition and values in Figure 4.2 (left), $s_{t} = 11 \text{ and } a_{t} = down \text{ imply } s_{t+1} = terminal \text{ , } r_{t+1} = -1 \text{ and } V^{\pi} (terminal) = 0 \text{ , so } \boxed{Q^{\pi}(11, down) = 1 \cdot \left( -1 + \gamma \cdot 0 \right) = -1}$.
<br>

Similarly, $\boxed{ Q^{\pi}(7, down) = 1 \cdot \left( -1 + \gamma V^{\pi}(11) \right) = -15}$.

### Exercise 4.2
Suppose a new state $15$ is added below state $13$, and the actions $left$, $up$, $right$ and $down$ take the agent to states $12$, $13$, $14$ and $15$, respectively. 
- If transitions _from_ the original states are unchanged, what is $V^{\pi}(15)$ for the equiprobable random policy?
- If the dynamics of state $13$ are changed such that $\mathcal{P}_{13 \to 15}^{down} = 1$, what is $V^{\pi}(15)$ for the same policy?
<br>

__Answers__:
<br>
- when the transitions from the original states don't change, it means that their state-values don't change either, since state $15$ is unreachable from any other state. In particular, $(s_t = 13, a_t = down) \longrightarrow s_{t+1} = 13$. So it's easy to calculate:
$$
\begin{aligned}
    V^{\pi}(15) &= \frac{1}{4} \left[ (-1 + \gamma V^{\pi}(12)) + (-1 + \gamma V^{\pi}(13)) + (-1 + \gamma V^{\pi}(14)) + (-1 + \gamma V^{\pi}(15)) \right] \\
    &= \frac{1}{4} \left[ -4 + V^{\pi}(12) + V^{\pi}(13) + V^{\pi}(14) \right] + \frac{1}{4} V^{\pi}(15) \\
    &= \frac{4}{3} \cdot \frac{1}{4} \left[ -4 + V^{\pi}(12) + V^{\pi}(13) + V^{\pi}(14) \right] \\
    &= \frac{1}{3} \cdot (-60) \\
    &= \boxed{-20} \\
\end{aligned}
$$

- when the transition from state $13$ with action $down$ leads to state $15$, we should have to reevaluate $V^{\pi}(13)$ in particular and iteratively all the other states as per the algorithm in __Figure 4.1__, since they are interdependent. We eliminate $\gamma$ to simplify the equations, since this task is undiscounted, i.e. $\gamma = 1$. We start by reevaluating $V^{\pi}(13)$ by including the value of $V^{\pi}(15)$ calculated above:
$$
\begin{aligned}
    V^{\pi}(13) &= \frac{1}{4} \left[ (-1 + V^{\pi}(12)) + (-1 + V^{\pi}(9)) + (-1 + V^{\pi}(14)) + (-1 + V^{\pi}(15)) \right] \\
    &= \frac{1}{4} \left[ (-23) + (-21) + (-15) + (-21) \right] \\
    &= -20 \\
\end{aligned}
$$
So luckily in this case, the value of $V^{\pi}(13)$ remains the same, which means that all other state-values remain the same, including $\boxed{V^{\pi}(15) = -20}$.

### Exercise 4.3
What are the equations analogous to (4.3), (4.4) and (4.5) for the action-value function $Q^{\pi}$ and its successive approximation by a sequence of functions $Q_{0}$, $Q_{1}$, $Q_{2}$,...?

Let's start from the Bellman equation for $Q^{\pi}$ (see exercise 3.8):
$$
\begin{aligned}
    Q^{\pi}(s, a) &= \mathbb{E}_{\pi} \left[ r_{t+1} + \gamma r_{t+2} + \gamma^2 r_{t+3} + ... \mid s_{t} = s, a_{t} = a \right] \\
    & = ... \\
    & = \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^{a} \left( \mathcal{R}_{ss'}^{a} + \gamma \sum_{a' \in \mathcal{A(s')}} \pi(s', a') \cdot Q^{\pi} (s', a') \right) \\
\end{aligned}
$$

Now in order to successively approximate $Q^{\pi}$ starting from some initial guess $Q_{0}^{\pi}$, we can write:
$$
\begin{aligned}
    Q_{k+1}^{\pi}(s, a) & = \sum_{s' \in \mathcal{S}} \mathcal{P}_{ss'}^{a} \left( \mathcal{R}_{ss'}^{a} + \gamma \sum_{a' \in \mathcal{A(s')}} \pi(s', a') \cdot Q_{k}^{\pi} (s', a') \right) \text{, for all } k \geq 0.\\
\end{aligned}
$$