# Basic Concepts

In supervised learning, we saw algorithms that tried to make their outputs mimic the labels $y$ given in the training set. In that setting, the labels gave
an unambiguous "right answer" for each of the inputs $x$. In contrast, for many sequential decision making and control problems, it is very difficult to provide this type of explicit supervision to a learning algorithm. For example, if we have just built a four-legged robot and are trying to program it to walk, then initially we have no idea what the "correct" actions to take are to make it walk.

In the reinforcement learning framework, we will instead provide our algorithms only a reward function, which indicates to the learning agent when it is doing well, and when it is doing poorly. In the four-legged walking example, the reward function might give the robot positive rewards for moving forwards, and negative rewards for either moving backwards or falling over. It will then be the learning algorithm's job to  gure out how to choose actions over time so as to obtain large rewards.

<center><img src="images/reinforce.png" width="500px"></center>

That is:

$$s_{t} \xrightarrow[]{Policy} a_{t} \xrightarrow[]{Enviroment} r_{t+1},s_{t+1}$$

Policy: $\pi(a|s)$

Enviroment has the Markov property: state transition and reward process $p(r|s, a)$

$$p(s_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(s_{t+1}|s_{t},a_{t})$$
$$p(r_{t+1}|s_{t},a_{t},\dots,s_{0},a_{0})=p(r_{t+1}|s_{t},a_{t})$$

Obtain trajectory $(s_{0},a_{0},r_{1},s_{1},a_{1},r_{2},s_{2},a_{2},\dots)$

## Bellman equation

Discounted return:

$$G_{t} = R_{t+1} + \gamma R_{t+2} + \gamma^{2}R_{t+3} + \dots$$

State value is the expected value of discounted return:

$$v_{\pi}(s) := \mathbb{E}[G_{t}|S_{t}=s]$$

Bellman equation move one-step forward:

$$
\begin{aligned}
v_{\pi}(s) &= \mathbb{E}[G_{t}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1} + \gamma G_{t+1}|S_{t}=s]\\
&=\mathbb{E}[R_{t+1}|S_{t}=s] + \gamma\mathbb{E}[G_{t+1}|S_{t}=s]\\
&=\sum_{a\in\mathcal{A}}\pi(a|s)\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}v_{\pi}(s')\sum_{a\in\mathcal{A}}p'(s'|s,a)\pi(a|s)\\
&=\sum_{a\in\mathcal{A}}\pi(a|s)\left[\sum_{r\in\mathcal{R}}p(r|s,a)r + \gamma\sum_{s'\in\mathcal{S}}p'(s'|s,a)v_{\pi}(s')\right]
\end{aligned}
$$