# Dynamic Programming
This chapter is a bit of an oddity to me because I have always been taught to think of dynamic programming as a matter of decomposition and memoization, which it turns out does play into the history of optimal control...  The thing that throws me for a loop here is that I usually think of dynamic programming as an algorithmic strategy that you apply for some problems with the right structure.  I guess the structure is simply that you can express the optimal solution recursively, in terms of the optimal solution to a smaller problem, and that is certainly the gist of the Bellman equations, so there's that... 

Two quick notes about DP approaches to RL.  First, note that these algorithms require the dynamics, which is not a factor in the more usual DP algorithms you see in intro algo courses.  Second, we very often don't have the dynamics in real problems, but the approaches used in those situations often try to approximate these algorithms.  So it's worth understanding them!

Anyhow, this chapter starts to get into how to find optimal policies for _finite MDPs_ (i.e., finite state, reward, and action spaces) when we have a good model of the dynamics $p(s', r | s, a)$.  The basic idea is to use the Bellman optimality equations.  As a refresher, here they are again: 

$$
\begin{align}
v_*(s) & = \max_a \mathbb{E}[R_{t+1} + \gamma v_*(S_{t+1}) | S_t = s, A_t = a ] \\
       & = \max_a \sum_{s', r} p(s', r|s,a) [r + \gamma v_*(s)] \\
q_*(s,a) & = \mathbb{E} \left[ R_{t+1} + \gamma max_{a'} q_*(S_{t+1}, a' | S_t = s, A_t = a \right] \\       
         & = \sum_{s', r} p(s', r|s,a) [r + \gamma max_{a'} q_*(s,a')]
\end{align}
$$

Finding an optimal policy (the _control_ problem) requires that we have a way of figuring out how good a given policy is (the _evaluation_ problem), which is basically estimating the value function.  So that's where we start.

## Iterative Policy Evaluation
First off, it seems reasonable that in order to improve on a given policy, $\pi$, we need a way to evaluate how good it is.  This is called _policy evaluation_, or the _prediction problem_.  Note that we are not claiming that $\pi$ is optimal - we are just asking - how do we evaluate the value of an arbitrary $\pi$?  Recall one form of the Bellman equation for state value functions: 

$$
v_{\pi}(s) = \sum_a \pi(a|s) \sum_{s', r} p(s',r|s,a) [r + \gamma v_{\pi}(s')]
$$

First off, we note that $v_{\pi}$ exists and is unique as long as $\gamma < 1$ or termination is guaranteed from all states under the policy.  In principle, the above relation defines a linear system of equations we could solve but that is tedious and in any event this kind of approach simply won't scale for bigger problems.  So we focus on iterative approaches.  We start with a initial guess at $v_{\pi}, v_0$, and we improve on it (make it closer to the actual value function) iteratively using the update: 

For each state:
$$
\begin{align}
v_{k+1}(s) & = \mathbb{E}_{\pi} [ R_{t+1} + \gamma v_k(S_{t+1}) | S_t = s ] \\
           & = \sum_a \pi(a|s) \sum_{s',r} p(s',r|s,a) [r + \gamma v_k(s)] \\
\end{align}
$$

That is, we simply use the Bellman equation as an update.  Under the same conditions that guarantee the existence and uniqueness of $v_{\pi}$, we can prove that the sequence of $v_k$ converges to $v_{\pi}$ - we can definitely see that the latter is a fixed point of these updates, so this certainly seems reasonable!  Note - these kinds of updates are called _expected updates_ because we are calculating them using the expectation of the next state and reward rather than a single experiment or sample of the next state and reward...  

### How we do this in practice
First, analogous to Gibbs sampling, this is most often done _in place_.  Say we are representing $v_k$ as an array with an element for each state $s$.  The most straight forward way to run this is to start with such a vector for iteration $k$, and then for iteration $k+1$ we write into a new array representing $v_{k+1}$.  In practice, however, we converge faster if we keep a single array (i.e., we update value in place) so that later updates get the benefit of the improved estimates for earlier states.  We also need some way to decide when we've effectively converged - we can just test whether the maximum of $\max_s v_{k+1}(s) - v_k(s) < \epsilon$  for some threshold small $\epsilon$.  If it is, we stop.
           

<b>Exercise 4.3</b>  What are the analogous equations for the action-value function $q_{\pi}$?

For each state and action pair, we update:
$$
q_{k+1}(s,a) = \sum_{s',r} p(s',r|s,a) \left[ r + \gamma \sum_{a'} \pi(a'|s') q_k(s',a') \right]
$$

## Policy Improvement
So now that we can evaluate a policy, how do we actually improve on a given policy $\pi$?  Specifically, say we are considering changing $\pi$ to deterministically select action $a$ in state $s$.  The value of doing this is given simply by the current action-value function (assume we have used policy evaluation to get $v_{\pi}$): 

$$
q_{\pi}(s,a) = \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s')]
$$

If this is greater than $v_{\pi}(s)$, then, because of the Markov property, it is reasonable to assume that we should do action $a$ every time we are in state $s$, so we should update $\pi$ to do $a$ when in $s$.  This is the basic idea behind the _policy improvement theorem_ : let $\pi$ and $\pi'$ be any pair of deterministic policies, with: 

$$
q_{\pi}(s, \pi'(s)) \ge v_{\pi}(s)
$$

This is just saying the same thing as the above, but jumping straight to the relevant comparison.  Then $\pi'$ is as good as or better than $\pi$: 

$$
v_{\pi'}(s) \ge v_{\pi}(s) \text{  } \forall s \in \mathcal{S}
$$

The proof of this is simple - see page 78.  

So what happens if we expand this to consider changes to all states and all possible actions?  That is, for each state, update $\pi \rightarrow \pi'$ to select the action $a$ in a greedy fashion? 

For each state, we update the policy:
$$
\begin{align}
\pi(s) & = \underset{a}{argmax} q_{\pi}(s,a) \\
       & = \underset{a}{argmax} \mathbb{E}[R_{t+1} + \gamma v_{\pi}(S_{t+1}) | S_t = s, A_t = a ] \\
       & = \underset{a}{argmax} \sum_{s',r} p(s',r|s,a) [r + \gamma v_{\pi}(s)]
\end{align}
$$

## Policy Iteration
Now let's say that we do alternating rounds of policy evaluation and improvement until we are no longer making any updates.  Then we know that the Bellman optimality condition holds, since it certainly holds for any given state by construction.  Thus, when we have converged, we should have an optimal value function and policy!  This algorithm is called _policy iteration_.

When I first read this, I was a bit confused - won't this immediately give you the optimal policy?  But no, it won't - recall that this procedure is for a given policy $\pi$, and we are updating that each time we update $\pi$ (we have to do another round of policy evaluation for each iteration of policy improvement).  The process looks like this.  Say we start with an initial policy $\pi_0$...  We do iterations of policy evaluation and improvement: 

$$
\pi_0 \overset{E}{\rightarrow} v_{\pi_0} \overset{I}{\rightarrow} \pi_1 \overset{E}{\rightarrow} v_{\pi_1} \overset{I}{\rightarrow} \pi_2 \overset{E}{\rightarrow} v_{\pi_2} \overset{I}{\rightarrow} \pi_3 \dots \overset{I}{\rightarrow} \pi_* \overset{E}{\rightarrow} v_{\pi_*}
$$

After each round of policy improvement, we've improved the policy but still only with respect to the current policy - we don't necessarily have the optimal policy!  So we have to keep going until we are no longer updating the value function or policy, in which case we have converged per the discussion above.

Another way of describing this is that each policy improvement step makes the policy greedy with respect to the value function of the current policy; by the policy improvement theorem, this is a good step.  Then we need to estimate the value function for the new policy, and around we go until we have a policy that is already greedy with respect to its value function, at which point the Bellman optimality condition holds by definition and we are done.

Finally, note that this was all written with respect to deterministic policies, but in fact holds for stochastic policies as well - just assign zero probability to all non-optimal actions in the update. Each round of policy iteration improves on the previous policy, and since we have a finite MDP this will eventually converge.

### Policy Iteration
```
Initialize pi[s] for all s, prev_policy_value = None
while True:

    # Policy evaluation
    def eval_policy(pi, epsilon=1e-6):
        Initialize V[s] for all states s
        while True:
            delta = 0
            sum_V = 0
            for each state s:
                v = V[s]
                V[s] = sum over s',r of p(s',r|s,pi[s])*(r + gamma*V[s'])
                delta = max(delta, |v - V(s)|)
                sum_V += V[s]
            if delta < epsilon: break
        return V, V_sum

    V, current_policy_value = eval_policy(pi)
    if current_policy_value == prev_policy_value:
        return V, pi
    else:
        prev_policy_value = current_policy_value
    
    # Policy Improvement
    stop_condition = True
    for each state s: 
        old_action = pi[s]
        pi[s] = argmax over a of (sum over s', r of p(s',r|s,a)*(r + gamma*V[s']))
        if old_action != pi[s]:
            stop_condition = False
    if stop_condition:
        return V, pi # These are close to optimal now
    
```    

#### Exercise 4.4
The policy iteration algorithm on page 80 has a subtle bug in that it may never terminate if the policy continually switches between two or more policies that are equally good. This is ok for pedagogy, but not for actual use. Modify the pseudocode so that convergence is guaranteed.

See above pseudo code block.


# >>> Exercise 4.5  <<<
How would policy iteration be defined for action values? Give a complete algorithm for computing $q_*$, analogous to that on page 80 for computing $v_*$.

Here is the skeleton.  I don't quite understand how this should be very different from the policy iteration algorithm.  Certainly if you ran policy iteration and arrived at an optimal value function and policy it would be trivial to get an estimate for $q_*(s,a)$ - you could, sweep over states and actions and just calculate the expected value of the actions from different states, plus the discounted valued of the expected next state given $v_*(S_{k+1})$... 

Okay, so I think the only reasonable squaring of the circle is to say that instead of estimating V given pi, we are intended to estimate Q given pi, and then use that Q to improve on the policy.  So that's pretty straight forward:

```
Initialize q[s,a] for all s, a.  
Initialize pi[s] for all s
Initialize prev_policy_value = None
while True:

    # Policy evaluation - this uses q[s,a] instead of pi?  
    def eval_policy(pi, epsilon=1e-6):
        Initialize Q[s,a] for all states s and actions a.
        while True:
            delta = 0
            for each state s:
                for each action a:
                    q = Q[s,a]
                    Q[s,a] = sum over s',r of p(s',r|s,a)*(r + gamma*(sum over a' of pi[s',a']*Q[s',a']))
                delta = max(delta, |q - Q[s,a]|)
            if delta < epsilon: break
        return Q

    Q = eval_policy(pi)
    
    # Policy Improvement
    stop_condition = True
    for each state s: 
        old_action = pi[s]
        pi[s] = argmax over a of Q[s,a]
        if old_action != pi[s]:
            stop_condition = False
    if stop_condition:
        return Q and pi
    
```   


#### Exercise 4.6
Suppose you are restricted to considering only policies that are $\epsilon$-soft, meaning that the probability of selecting each action in each state, $s$, is at least $\epsilon/|\mathcal{A}(s)|$. Describe qualitatively the changes that would be required in each of the steps 3, 2, and 1, in that order, of the policy iteration algorithm for $v_*$ on page 80.

Step 3: Basically, instead of putting all the probability on the argmax action, we need to spread it out so each action receives at least $\epsilon / |\mathcal{A}(s)|$ mass.  

Step 2: We now have a stochastic policy so the evaluation routine needs to assign V[s] to be the expectation over the actions from state s instead of assuming the policy is deterministic.

Step 1: The initialization of the policy needs to change to ensure the starting policy adheres to our constraint.

## Value Iteration
The above algorithm must do policy evaluation every iteration, which is itself a potentially quite expensive iterative process.  It turns out that we can just truncate it after one iteration and combine the update with policy improvement - the resulting algorithm is called _value iteration_:

For each state, we update: 
$$
v_{k+1}(s) = \max_a \sum_{s',r} p(s',r|s,a)[ r + \gamma v_k(s') ]
$$

Note that unlike the policy evaluation iterations, we are not doing an expectation over actions under policy $\pi$ - instead we are taking the max, effectively updating $v$ while also trying to improve the policy.  This converges to $v_*$ under the same conditions that guarantee the existence of $v_*$.  As for policy evaluation, we stop when $v$ changes very little over all states.

### Value Iteration
```
def value_iteration(S, epsilon):
    Initialize V[s] for all s in S, except V[terminal] = 0
    while True:
        delta = 0
        for s in S:
            v = V[s]
            V[s] = max over a of (sum over s', r of (p(s',r|s,a)(r + gamma*V[s'])))
            delta = max(delta, |v - V(s)|)
        if delta < epsilon: 
            break
    return a deterministic policy pi based on V[s], which should be a good 
    approximation to the optimal value function...
    
```


### Exercises 4.8-9
<b>Exercise 4.8</b> There is nothing to be gained from winning more than 100, so it doesn't pay to ever bet more than 50.  That way, if you lose on that bet, then you can start from something again. 

<b>Exercise 4.9</b>  Implement value iteration for the gambler's problem and solve for $p_h = 0.25$ and $p_h = 0.55$.  Show results graphically as in figure 4.3.  

See below... 

<b>Exercise 4.10</b> What is the analog of the value iteration update (4.10) for action values $q_{k+1}(s,a)$?

$$
q_{k+1}(s,a) = \sum_{s',r} p(s',r|s,a) [r + \gamma \sum_{a'} \pi(a'|s') q_k(s',a')]
$$

## Asynchronous Dynamic Programming
Each iteration of policy evaluation and policy improvement as described above requires a full sweep through all states $s \in \mathcal{S}$, which can be very expensive.  It often works well to update values of for only some subset of states per iteration.  Indeed, some states may have their values updated multiple times before others are updated once.  However, in order to converge, we must eventually get around to updating all states of course.  At an extreme, we can update the value of only a single state at a time!  This sort of game is called asynchronous dynamic programming, and is very useful for online learning situations where we are in a single state IRL!

## Generalized Policy Iteration
There is a lot of room between our two extremes of iterations of policy-evalution followed policy improvement vs value iteration.  For instance, one can often get fast convergence by doing multiple rounds of policy evaluation for each policy improvement sweep.  GPI is an umbrella term for algorithms that alternate policy evaluation (not always full - see value iteration!) and policy improvement, and includes almost all RL algorithms in practice.  

## Approximate DP
There is a cool video about applications of DP to fleet management, where the state space is really huge and exact DP is out of the question!  See Warren Powell's page <a href='https://castlelab.princeton.edu/jungle/'>here</a> for some very cool stuff!  An interesting question is how you can improve on these algorithms and their robustness when you have real time sensors (to bring things back to what we do at work...).  Also, how do things change for automated vehicles?