# Chapter 4

The following in this notebook are my notes and code from Chapter 4 of "Reinforcement Learning: An Introduction" by Sutton and Barto.

Dynamic programming (DP) refers to a set of algorithms that can be used to calculate the optimal policies, given a perfect model of the environment as a Markov decision process.

**"Classical DP algorithms are of limited utility in reinforcement learning both because of their assumption of a perfect model and because of their great computational expense, but they are still important theoretically."**

Typically, we make the assumption that our environment is a finite MDP. Dynamic programming can be applied to problems that have continuous state and action spaces, but then exact solutions are only possible in special cases. One common method of getting approximate solutions for such tasks is to quantize the state and action spaces and then use the finite-state DP algorithms. Methods explored in Part 2 can be applied to continuous problems and are an extension of that approach. 

"Key idea of dynamic programming, and of reinforcement learning generally, is the use of value functions to organize and structure the search for good policies." In this chapter we'll show how dynamic programming can compute the value functions from Chapter 3. We can obtain optimal policies once we've found the optimal value functions, $v_*$ or $q_*$, that meet the Bellman optimality equations:

$v_*(s) = max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1} | S_t = s, A_t = a]$

$v_*(s) = max_a \sum_{s', r} p(s', r | s, a)[r + \gamma v_*(s') \quad (4.1) \quad$, or

$q_*(s, a) = \mathbb{E}[R_{t+1} + \gamma max_{a'} q_*(S_{t+1}, a') | S_t = s, A_t = a]$

$q_*(s, a) = \sum_{s', r} p(s', r | s, a) [r + \gamma max_{a'} q_*(s', a')] \quad (4.2) \quad$

$\forall \quad s \in S, \space a \in A(s), \space s' \in S^+$

Dynamic programming algorithms are formed by turning these into update rules for improving the approximations of the value functions.

### 4.1 Policy Evaluation (Prediction)

Policy evaluation is how we compute the state-value function $v_\pi$ for a policy $\pi$. It is also referred to as the prediction problem. Recall from ch. 3, $\forall \quad s \in S$,

$v_\pi(s) \stackrel{.}{=} \mathbb{E}_\pi [G_t | S_t = s]$

$v_\pi(s) = \mathbb{E} [R_{t+1} + \gamma G_{t+1} | S_t = s] \quad \quad$ (from (3.9))

$v_\pi(s) = \mathbb{E} [R_{t+1} + \gamma v_\pi(S_{t+1}) | S_t = s] \quad \quad (4.3)$

$v_\pi(s) = \sum_a \pi(a | s) \sum_{s', r} p(s', r | s, a)[r + \gamma v_k(s')] \quad$,

$\pi (a | s)$ is the probability of taking an action $a$ in state $s$ under policy $\pi$. The expected returns and values are subscripted by $\pi$ to indicate that they're conditional on $\pi$ being followed. 'The existence and uniqueness of $v_\pi$ are guaranteed as long as either $\gamma < 1$ or eventual termination is guaranteed from all states under the policy $\pi$.