In this notebook we introduce reinforcement learning formalised as a Markov Decision Process.

# Markov Decision Processes

**State**: $s_t \in \mathbb{S}$

**Action**: $a_t \in \mathbb{A}$

**Reward**: $r_t \in \mathbb{R}$

**Policy**: $\Pi_k(a|s) = \text{P}(a_k = a|s_k = s)$

**Model**: $P(s', r|s, a)$

**Objective**: $\max_{\mathbf{a}}\sum_{k=0}^\infty \lambda^k r(s,a)$

**Markove Property**: $\text{Pr}(s_{k+1} = s', r_k = r|s_k, a_k, r_{k-1}, s_{k-1}, a_{k-1},...) = \text{Pr}(s_{k+1} = s', r_k = r|s_k, a_k)$


## Value Iteration

the **objective** can be rewritten as 

$$\min_{\mathbf{u}}\sum_{k=0}^\infty \lambda^k c_k(x_k, u_k)$$

Referring back to [`optimal_control.ipynb`](optimal_control.ipynb) we can easily show the new Gellman equation is

$$V(x, k) = \min_u(c(x, u) + \lambda V(f(x, u), k + 1)), \quad k = h-1, h-2, ..., 1, 0$$

where $\lambda \leq 1$ is the discount factor, lower indicates we prioritise ealier rewards

Episodic problems are finite horizon which stop when $\forall x \in X_s$ (stopping set X_s):

- $f(x,u) \in X_s$
- c(x, u) = 0

As a defined stopping set is given, we don't require separate values for the same state at different times, hence yielding

$$V(x) = \min_u(c(x, u) + \lambda V(f(x, u)))$$

solved by

$$V_{k+1}(x) = \min_u(c(x, u) + \lambda V_k(f(x, u)))$$

**Proof of Convergence** $(\lambda < 1)$:

$$
\begin{align*}
|BV_1(x) - BV_2(x)| &= |\min_u(c(x, u) + \lambda V_1(f(x, u))) - \min_u(c(x, u) + \lambda V_2(f(x, u)))| \\
&\leq \max_u|\lambda V_1(f(x,u))- \lambda V_2(f(x,u))| \\
&= \lambda \max_u| V_1(f(x,u))-  V_2(f(x,u))|

\end{align*}
$$
