# Markov Decision Processes

It's a framework to represent `reinforcement learning` problems.
`state`: $S$ — e.g. each board combo in 2048<br>
`model`: $T(s, a, s')$ ~ Pr(s'|s, a) — i.e. the rules of the game that we're playing
* it gives us the probability that we'll end up in state $s'$ given that we were in state $s$, and took action $a$

`actions`: $A(s), A$ — e.g. up, down, left, and right in 2048<br>
`reward`: $R(S)$ — i.e. reward of the value of entering state $s$<br>
`policy`: $\pi(s) \rightarrow a$ — for any given state that we're in, it determines the action that we should take. 

We're trying to find $\pi^*$, the optimal policy. It optimizes are total expected reward. 

### Properties
* `markov property` — only the previous state matters
* things are stationary, i.e. the rules don't change

Our goal is to find $\pi^*$
$$\pi^* = \arg\max_{\pi} E\left[ \sum_t \gamma^T R(s_t) | \pi\right]$$

There's also the notion of `delayed reward` which we can account with using `utility`.
$$U^{\pi}(s) = E\left[ \sum_t \gamma^T R(s_t) | \pi, S_0 = S\right]$$

It's important to note that $R(s) \ne U^{\pi}(s)$

Now that we know `utlity`, we can define our strategy at state $s$ by using:
$$\pi^*(s) = \arg\max_a \sum_{s'} T(s, a, s')U(s')$$

`utility` factors in the long term aspects, where the `reward` is moment to moment

### Bellman's Equation
$$U(s) = R(s) + \lambda max_a \sum_{s'} T(s, a, s') U(s')$$

How do we solve this equation? Well, if there are $N$ states, then there are $N$ unkonwns. Unfortunately, the equation is not linear because of the `max` component. 

To solve this we follow the following algorithm:<br>
1) start with arbitrary utilities<br>
2) update utliities based on neighbors<br>
3) repeat until convergence

### Value Iteration
So we'll update (at every iteration) my estimate of the utility at state $s$ by recalculating it to be actual reward I get for entering state $s$, plus the discounted utility that I expect given the original estimates of my utility.
$$\hat{U}_{t+1}(s) = R(s) + \lambda max_a \sum_{s'} T(s, a, s')\hat{U}_t(s')$$

We start off with an arbitrary function, and slowly add <i>truth</i> enough times that it converges to the right answer.

### Policy Iteration
- start with $\pi_0$ <- guess
- evaluate: given $\pi_t$ calculate $U_t = U^{\pi_t}$
- improve: $\pi_{t+1} = \arg\max_a \sum T(s, a, s')U_t(s')$

We're able to get rid of the `max` and make it a problem of `N` equations and `N` unkonwns. 

$$U_t(s) = R(s) + \gamma \sum T(s, \pi_z(s), s')U_t(s')$$