# Basic Concepts

In supervised learning, we saw algorithms that tried to make their outputs mimic the labels $y$ given in the training set. In that setting, the labels gave
an unambiguous "right answer" for each of the inputs $x$. In contrast, for many sequential decision making and control problems, it is very difficult to provide this type of explicit supervision to a learning algorithm. For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the "correct" actions to take are to make it walk.

In the reinforcement learning framework, we will instead provide our algorithms only a reward function, which indicates to the learning agent when it is doing well, and when it is doing poorly. In the four-legged walking example, the reward function might give the robot positive rewards for moving forwards, and negative rewards for either moving backwards or falling over. It will then be the learning algorithm's job to  gure out how to choose actions over time so as to obtain large rewards.

<center><img src="images/reinforce.png" width="500px"></center>

That is:

$$s_{t} \xrightarrow[]{Policy} a_{t} \xrightarrow[]{Enviroment} r_{t+1},s_{t+1}$$

Policy: $\pi(a|s)$

Enviroment controls the state transition and reward process, they have the Markov property:

$$p(s_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(s_{t+1}|s_{t},a_{t})$$
$$p(r_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(r_{t+1}|s_{t},a_{t})$$

Obtain trajectory $(s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},s_{2},a_{2},\dots)$

## Bellman equation

Discounted return:

$$G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots$$

State value is the expected value of discounted return, Bellman equation move one-step forward:

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}[G_{t}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1}|S_{t}=s] + \gamma\mathbb{E}[G_{t+1}|S_{t}=s]
\end{aligned}
$$

The first term by using the law of total expectation:

$$
\begin{aligned}
\mathbb{E}[R_{t+1}|S_{t}=s] &= \sum_{a\in\mathcal{A}}\pi(a|s)\mathbb{E}[R_{t+1}|S_{t}=s,A_{t}=a]\\
&= \sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r
\end{aligned}
$$

The second term can be calculated as:

$$
\begin{aligned}
\mathbb{E}[G_{t+1}|S_{t}=s] &= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t}=s,S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}\mathbb{E}[G_{t+1}|S_{t+1}=s']p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')p(s'|s)\\
&= \sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p(s'|s,a)\pi(a|s)
\end{aligned}
$$

This leads:

$$
\begin{aligned}
v_{\pi}(s) &=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p(s'|s,a)\pi(a|s)\\
&=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]
\end{aligned}
$$

Bellman equation can be written in the matrix form:

$$v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}$$

In terms of the action values:

$$
\begin{aligned}
q_{\pi}(s, a) &= \sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\\
&= \sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)\sum_{a'\in\mathcal{A}}\pi(a'|s')q_{\pi}(s',a')
\end{aligned}
$$

## Bellman optimality equation

While the ultimate goal of reinforcement learning is to obtain optimal policies, it is necessary to first define what an optimal policy is. The definition is based on the state values.

A policy $\pi^{\ast}$ is optimal if $v_{\pi^{\ast}}(s) \ge v_{\pi}(s)$ for all $s\in\mathcal{S}$ and for all policy $\pi$.

* Exists?
* Unique?
* If exists, how the obtain?

Bellman optimal equation:

$$
\begin{aligned}
v_{\pi}(s) &= \underset{\pi\in\Pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p(s'|s,a)v_{\pi}(s')\right]\\
&= \underset{\pi\in\Pi}{\max} \sum_{a\in\mathcal{A}}\pi(a|s)q_{\pi}(s, a)
\end{aligned}
$$

matrix form:

$$v=\underset{\pi\in\Pi}{\max}(r_{\pi} + \gamma P_{\pi}v) = f(v)$$

We can show that $\left \| f(v_{1}) - f(v_{2}) \right \|_{\infty} \le \gamma\left \| v_{1} - v_{2} \right \|_{\infty}$, then by using the contraction mapping theorem, we conclude that there exits one unique solution to the Bellman optimal equation.

Next, we assert that the solution $v^{\ast}$ of the Bellman optimal equation is the optimal state value, and the corresponding policy

$$\pi^{\ast}(a|s)=\underset{a\in\mathcal{A}}{\text{argmax}}\ q^{\ast}(s,a)$$

is an optimal policy.<br/>
**Proof**: For any policy $\pi$, it holds that

$$v_{\pi} = r_{\pi} + \gamma P_{\pi}v_{\pi}$$

Since

$$v^{\ast} = \underset{\pi}{\max}(r_{\pi} + \gamma P_{\pi}v^{\ast}) = r_{\pi^{\ast}} + \gamma P_{\pi^{\ast}}v^{\ast} \ge r_{\pi} + \gamma P_{\pi}v^{\ast}$$

we have

$$v^{\ast} - v_{\pi} \ge (r_{\pi} + \gamma P_{\pi}v^{\ast}) - (r_{\pi} + \gamma P_{\pi}v_{\pi}) = \gamma P_{\pi}(v^{\ast} - v_{\pi})$$

Repeated applying the above inequality gives $v^{\ast} - v_{\pi} \ge \gamma P_{\pi}(v^{\ast} - v_{\pi}) \ge \gamma^{2} P_{\pi}^{2}(v^{\ast} - v_{\pi})\ge \dots$. It follows that

$$v^{\ast} - v_{\pi} \ge \lim_{n\to\infty}\gamma^{n} P_{\pi}^{n}(v^{\ast} - v_{\pi})=0$$