# Reinforcement Learning, an introduction, Sutton and Barto

My open questions:
* find example of MDP where optimal eps-greedy policy has value faunction such that the greedy policy with respect to those values is not optimal (in the general sense).
    * the answer is in the exercise 6.6 - Cliff walking girdworld. The best eps-greedy policy (with fixed eps) will learn to avoid the cliff, since due to sometimes choosing a random arm it would fall from the cliff. The environment is deterministic, so the optimal policy is to go close to the cliff.


## Chapter 2. Multi-arm Bandits

You are faced repeatedly with a choice among $k$ different options or actions. How do you choose to maximize your return over long term?

Notation:
- $A_t$ - action at time t
- $R_t$ - reward after taking action $A_t$
- $q_*(a)=E[R_t|A_t=a]$ - value of action $a$
- $Q_t(a)$ - estimate of $q_*(a)$ at time t

**Action-value methods**
Natural way to estimate $Q(a)$ would be to compute **sample average**:
$$Q_t(a) = \frac{\text{sum of rewards when a taken prior to t}}{\text{number of times a taken prior to t}}$$

and then choose the arm with the biggest estimate (**greedy action**). 

Incremental implementation:

$$Q_{n+1} = Q_n + \frac{1}{n}[R_n-Q_n]$$

For non-stationary problems better might be (**recency**) **weighted average**:
$$Q_{n+1} = Q_n + \alpha[R_n-Q_n]$$

The fundamental problem of multiarmed bandits is the **exploration vs exploitation ratio**, which is a way of choosing if we explore or exploit in a given step. Some approaches:

- $\epsilon$-greedy - choose random arm $\epsilon$ times, and greedy one $1-\epsilon$ times,
- optimistic initial vales - you start with some big values for each arm and then choose greedily and update them as you go (this way each arm will be choosen lots of times),
- Upper Confidence Bound, you choose according to arm value:
$$A_t=\arg\max_a[Q_t(a)+c\sqrt{\frac{\log{t}}{N_t(a)}}]$$
where $c$ is a constant that controls degree of exploration.
- gradient bandit algorithms - here we estimate a **policy** directly, which is a probability distribution on arms:
$$\pi_t(a) = P(A_t=a) = \frac{e^{H_t(a)}}{\sum_{b=1}^{k}e^{H_t(b)}}$$
where $H_t(a)$ are computed according to an update rule which is after selecting arm $A_t$ does:
$$H_{t+1}(A_t)=H_t(A_t)+\alpha(R_t-\overline{R_t})(1-\pi_t(A_t))$$ and
$$H_{t+1}(a)=H_t(a)-\alpha(R_t-\overline{R_t})\pi_t(a)$$ for $a\neq A_t$. It can be shown that the rule above can be derived from:
$$H_{t+1}(a)=H_t(a)+\alpha \frac{\partial E[R_t]}{\partial H_t(a)}$$
The substraction of the term $\overline{R_t}$ makes the variance lower (but could be skipped).



## Chapter 3. Finite Markov Decision Processes

This chapter basically introduces notation and sets up the problem for future chapters.

At each time step $t=1,2,3,...$ the agent receives some representation of the environment's **state**, $S_t\in\mathcal{S}$, where $\mathcal{S}$ is a set of all possible states. Then, the agent chooses some **action** $A_t\in\mathcal{A}(S_t)$, where $\mathcal{A}(S_t)$ is a set of actions available in the state $S_t$. One time-step later agent receives a numerical **reward** (as a part of consequence), $R_{t+1} \in \mathcal{R} \subset\mathbb{R}$ and finds itself in new state $S_{t+1}$.

At each step, agent implements a mapping from states to distributions over actions which is called a **policy** and is denoted $\pi_t$, where $\pi_t(a|s)$ is the probability that $A_t=a$ if $S_t=s$.

The **return** is a sum of rewards over time:
$$G_t=\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}$$
where $\gamma\in[0,1]$ and is usually equal to $1$ for **episodic tasks** (such ones that the probability that they will end is $1$ and $R_t$ will be always $0$ after some time $T$) and less then $1$ for **continuing tasks**.

The environment has **the Markov property** when it's response at time $t+1$ depends only on it's state and action taken at time $t$:
$$\mathbb{P}(S_{t+1},R_{t+1}|S_0,A_0,R_1,...,S_{t-1},A_{t-1},R_t,S_t,A_t) = \mathbb{P}(S_{t+1},R_{t+1}|S_t,A_t)$$

A reinforcement learning task that satisfies the Markov property is called a **Markov decision process** (MDP). If the state and action spaces are finite then it's called a **finite** MDP.

Some more notation:

$$p(s',r|s,a)=\mathbb{P}(S_{t+1}=s',R_{t+1}=r|S_t=s,A_t=a)$$
$$r(s,a) = \mathbb{E}[R_{t+1}|S_t=s,A_t=a] = \sum_{r\in\mathcal{R}} r\sum_{s'\in\mathcal{S}} p(s',r|s,a)$$

**Value function** for policy $\pi$:
$$v_\pi(s) = \mathbb{E}_\pi[G_t|S_t=s]=\mathbb{E}_\pi[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s]$$
**Action-value function** for policy $\pi$:
$$q_\pi(s,a) = \mathbb{E}_\pi[G_t|S_t=s,A_t=a]=\mathbb{E}_\pi[\sum_{k=0}^{\infty}\gamma^kR_{t+k+1}|S_t=s,A_t=a]$$

Bellman equation for value functions:
$$v_\pi(s)=\sum_a\pi(a|s)\sum_{s',r}p(s',r|s,a)(r+\gamma v_\pi(s'))$$
$$q_\pi(s,a)=\sum_{s',r}p(s',r|s,a)(r+\gamma \sum_{a'}\pi(a'|s') q_\pi(s',a'))$$

**Optimal value function** for finite MDP is:
$$v_*(s) = \max_\pi v_\pi(s)$$
$$q_*(s,a) = \max_\pi q_\pi(s,a)$$

**Bellman optimality equations are**:
$$v_*(s)=\max_a\sum_{s',r}p(s',r|s,a)[r+\gamma v_*(s')]$$
$$q_*(s,a)=\sum_{s',r}p(s',r|s,a)[r+\gamma \max_{a'}q_*(s',a')]$$

## Chapter 4. Dynamic Programming

**Key idea of DP:** Use value functions to organize search for good policies.

**Key characteristics:**
* complete knowledge of environment - knowing the transition probabilities $p(s',r|s,a)$

Observations:
* **Iterative policy evaluation:** We can use Bellman equations to compute value function for given policy $\pi$: 
$$v_{k+1}(s)=\mathbb{E}_{\pi}[R_{t+1}+\gamma v_k(S_{t+1})| S_t=s] = \sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')]$$
* once we have a value function we can construct new policy that is greedy with respect to the value function:
    * if resulting policy is the same as the initial one - great! We have found the optimal policy.
    * if it's different then lets find its value function (it's certainly not worse because of policy improvement theorem - see below).
* **Policy improvement theorem:**
$$\forall_s q_{\pi}(s, \pi'(s)) \geq v_{\pi}(s) \Rightarrow \forall_s v_{\pi'}(s)\geq v_{\pi}(s)$$
* **Policy iteration** is the procedure described above.
* It is often not necessary for the value function to converge for the greedy policy with respect to it to be better then the original one. Based on this observation is **value iteration**:
$$v_{k+1}(s)= \max_a\sum_{s',r}p(s',r|s,a)[r+\gamma v_k(s')]$$
After it converges, the policy that is greedy with respect to it will be optimal.
* it is not necessary to update values of all states in one sweep. In fact even if we update some states more often then the others, as long as all states get updates infinitely many times in the limit, the algorithm will converge. Such procedure is called **asynchronous dynamic programming**.
* **Generalized policy iteration** is an idea of letting policy evaluation (finding its value function) and policy improvement (finding better policy based on value function for current policy) interact.
* DP methods are polynomial in the number of states and actions.
* DP methods assume complete knowledge of the MDP.
* **In place updates** are ones in which only one set/table of values is used to do updates. This means that depending on the order some values will be updated using already new other values updated just before.
* **Bootstrapping** - an idea of updating estimates based on other estimates.


## Chapter 5. Monte Carlo Methods

Monte Carlo methods are based only on experience. They don't require a model and do not bootstrap. If you have a model then you can simulate experience. The basic idea is (for episodic tasks): simulate episodes and estimate state-action values as means of observed returns.

Observations:
* with no model of the environment, given a state, we don't know which actions will lead us to which further states, so what we estimate will be state-action values (not state values)! (Based on values we still wouldn't know which action to choose.)
* **soft policy** is a policy that selects all actions in all states with non-zero probability.
* **off-policy learning** is the kind of learning where policy used to generate episodes (**behaviour policy**) is defferent then the evaluated and improved one (**target policy**).
* when we do off-policy learning then we need to make adjustments to state-action values we are learning based on the fact that behaviour policy chooses action with different frequencies then target policy. Those adjusments are called **importance sampling**. There two basic (and other more complex ones) ways of doing it:

    path weights: $$\rho_t^T=\frac{\prod_{k=t}^{T-1}\pi(A_k|S_k)p(S_{k+1}|S_k, A_k)}{\prod_{k=t}^{T-1}\mu(A_k|S_k)p(S_{k+1}|S_k, A_k)}=\prod_{k=t}^{T-1}\frac{\pi(A_k|S_k)}{\mu(A_k|S_k)}$$

    **ordinary importance sampling**: $$V(s)=\frac{\sum_{t\in \mathcal{T}(s)}\rho_t^{T(t)}G_t}{|\mathcal{T}(s)|}$$ 
    
    **weighted importance sampling**: $$V(s)=\frac{\sum_{t\in \mathcal{T}(s)}\rho_t^{T(t)}G_t}{\sum_{t\in \mathcal{T}(s)}\rho_t^{T(t)}}$$
    
    where we keep numbering time across episodes (if first ends at time 100 then next begins at time 101), $\mathcal{T}(s)$ is set of all time steps in which state $s$ was visited and $T(t)$ is the number of final step in episode in which time $t$ occured.
* in the final algorithm we estimate state action values and based on algorithm from page 119 this is the estimate (with weighted importance sampling):
    $$Q(s,a)=\frac{\sum_{t\in \mathcal{T}(s)}\rho_{t+1}^{T(t+1)}G_t}{\sum_{t\in \mathcal{T}(s)}\rho_{t+1}^{T(t+1)}}$$
    
    I would compute it like this:
    $$Q(s,a)=\frac{\sum_{t\in \mathcal{T}(s)}R_t + \gamma V(S_{t+1})}{|\mathcal{T}(s)|}$$
    
    where $V(S_{t+1})$ is the weighted importance sampling version of value estimation given above. Hmm, **the difference is only in the way to compute expected rewards from taking action $a$ in state $s$**. In first case we have weighted average and in the other sample average.
    
* Ordinary importance sampling is unbiased but it has potentially unlimited (even infinite) varinace. Weighted importance sampling is asymptotically unbiased and has finite limited varinace (if we have bounded returns then the variance converges to zero). In practice the latter is preferred.
* In the perfect scenario we are able to do **exploring starts** which means we can start an episode in arbitrary state taking arbitrary action. Then for policy evaluation we compute sample returns from exploring starts. To find optimal policy we interleave policy evaluation and improvement (we can speed it up by doing policy improvements after each episode). This procedure is called **Monte Carlo ES**. **Caveat:** it has not been formally proven that this procedure converges to optimal policy.
* My remark: a big disadvantage of MC is that it doesn't work if there are possible policies that make the episode infinite. Simple example is going deterministically between two states forth and back. It might be impossible to learn off-policy from such a behaviour policy. There might be some tweak to avoid this, but it's not obvious.

Some questions I had along the chapter:
* the state-action values we are learning are non-stationary, because we do value and policy iteration. (It becomes stationary at some point) Then why not use weighted averages instead of sample averages (like in bandits for non-stationary problems). Hmm, if it becomes stationary then it doesn't matter in the limit.
* how to explore smartly? Using the info we have so far (kind of using UCB)? Maybe limiting amount of exploration moves within a path?
* maybe some way to do approx of exploring starts?
* how to improve on long paths? (For episodic tasks. Unless you are punished for long episodes.)

## Chapter 6. Temporal-Difference Learning

* Temporal difference algorithms combine some features of Monte Carlo (sampling) and Dynamic Programming (bootstrapping).
* The basic idea is to update on every step (state, action, reward tuple) of a sample (this is known as TD(0) algorithm):
$$v(S_t) \leftarrow v(S_t) + \alpha [R_{t+1} + \gamma v(S_{t+1}) - v(S_t)]$$
using current estimates of values ($v(S_{t+1})$) and sample reward ($R_{t+1}$). The name 'temporal difference' comes from the fact that we use estimate from another time-step to compute current one.
* Value of a state is:
$$v_{\pi}(s) = \mathbb{E}_{\pi}[G_t|S_t=s] = \mathbb{E}_{\pi}[R_{t+1} + \gamma v_{\pi}(S_{t+1})|S_t=s]$$
Monte Carlo methods use the first part of the above equation as a target, while Dynamic Programming uses the second. TD combines the both, as it estmates the expected value from a sample like MC and uses the current estimate like DP.
* Some summary of differences between DP, MC and TD:

| feature | TD | MC | DP |
| --- | --- | --- | --- |
| requires full knowledge of the environment (espc. trasition probs) | ✖ | ✖ | ✔ |
| kind of backup | sample | sample | full |
| bootstrap (~ using estimates to compute estimates) | ✔ | ✔ | ✖ |
| online | fully incremental | has to wait for end of episode | ✖ |
| learns on exploration | ✔ | uses only on-policy parts| not applicable |
* it's not proven formally, but in practice TD methods converge faster then constant-$\alpha$ MC methods ($v(S_t) \leftarrow v(S_t) + \alpha [G_t - v(S_t)] $) on stochastic tasks.
* **batch updating** is a learning method where value function increments are computed for every non-terminal step in the batch of data, then value function is updated by sum of those, then whole procedure is repeated until convergence.
* TD(0) and constant-$\alpha$ MC converge deterministically on batch updating but to different answer! Batch Monte Carlo converges to a solution that minimizes mean-squared error on training set, whereas TD(0) finds estimates that would be correct for the maximum-likelyhood model of the Markov process (model whose probability of generating the data is the greatest). This doesn't hold for non-batch methods, but illustrates that TD generally goes in a better direction which could explain why it is faster.
* Some nice example of the of why TD is faster are batch-learning value function estimates on such batch of data:

|||
| --- | --- |
| A,0,B,0 | B,1 |
| B,1 | B,1 |
| B,1 | B,1 |
| B,1 | B,0 |

MC would learn $v(A) = 0$ (the only sample for A gives return 0) while TD would learn $v(A)=\frac{3}{4}$ (A leads to B with estimated prob 1 and B leads to reward 1 with estimated prob $\frac{3}{4}$).

* **SARSA** is an on-policy TD control method (algorithm solving the problem of finding optimal policy). The name comes from the tuple $(S_t,A_t,R_{t+1},S_{t+1},A_{t+1})$ that is used to compute value update:
    * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma Q(S_{t+1},A_{t+1})-Q(S_t,A_t))]$
    * if $S_{t+1}$ is terminal then $Q(S_{t+1},A_{t+1})$ equals $0$
    * SARSA is not formally proven to converge, but some unofficial results claim that it converges with probability 1 to an optimal policy as long as all state-action pairs are visited infinite number of times and policy converges to greedy one in the limit (this can be achieved by making it $\epsilon$-greedy, where $\epsilon$ converges to 0)
    * SARSA doesn't have problem (as MC methods do) with learning from policy that produces infinite episode (going forth and back between two states) as it learns during the episode (and changes policy).
* **Q-learning** is an off-policy TD control method defined by:
    * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma \max_aQ(S_{t+1},a)-Q(S_t,A_t))]$
    * convergence (with probability 1) is guaranteed when all state-action pair are visited infinitely many times and usual stochastic approximation conditions on the sequence of step-size parameters (see page 35: $\sum_{n=1}^{\infty}\alpha_n(a)=\infty$ and $\sum_{n=1}^{\infty}\alpha_n^2(a)<\infty$).
* **Expected SARSA** is a learning algorithm that uses the expected value (with respect to current policy) of next state to learn the current:
    * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma \mathbb{E}_{\pi}[Q(S_{t+1}, A)|S_{t+1}] -Q(S_t,A_t))]$, which is:
    * $Q(S_t,A_t) \leftarrow Q(S_t,A_t) + \alpha [R_{t+1} + \gamma \sum_a \pi(a|S_{t+1})Q(S_{t+1}, a) -Q(S_t,A_t))]$
    * it eleminitaes the variance that comes from sampled arm $A_{t+1}$ (like in SARSA), which is why we might expect it to perform better (and it does in practice),
    * it can be either on- or off-policy algorithm (depending on the choice of behaviour policy),
    * when state transitions are deterministic it can safely work with $\alpha = 1$,
* **Maximization Bias and Double learning**:
    * imagine a MDP where you have two nonterminal states A and B:
    
    ![maximization bias example](figs/fig_6_8_part.png)
    
    * you always start at A and every arm going left from B gives a reward that is sampled from normal distribution: $N(-0.1, 1)$,
    * the optimal policy is to always go right, but due to randomness we will discover that some of the arms going left from B have positive value (which will converge to $-0.1$ with time) - Q-learning which uses the actions with currently highest values will be biased towards going left. The bias is induced by using the maximum function in TD-error updates and that is why it is called maximization bias (this generalizes to any algorithm that relies on maximization for chosing policy).
    * a key problem is that we use the same number for estimating value and for choosing action,
    * suppose we divide the steps of each episode into two groups and use each group for learning separate Q-function, so we would have $Q_1(s,a)$ and $Q_2(s,a)$ for every action $a$ - we are learning one with the other used to do value update:
$$Q_1(S_t,A_t) \leftarrow Q_1(S_t,A_t) + \alpha [R_{t+1} + \gamma \max_aQ_2(S_{t+1},a)-Q_1(S_t,A_t))]$$
    * learning $Q_2$ is done analogously as above (by switching the indexes)
    * the procedure of learning two Q functions is called **double Q-learning**, it can be applied to other methods like sarsa or expected-sarsa
* **Games, afterstates and special cases**:
    * the methods presented so far can be tailored for concrete problems at hand to get more adequate solutions
    * one exaple is considering **after-states** instead of states: in games where transitions between states are deterministic, we might want to compute values of states, after chosing a particular action in given state - this has great value when (for example) chosing action $a_1$ in state $s_1$ leads to the same state $s$ as chosing $a_2$ in $s_2$, by focusing on after-states we keep only one value estimate for both (which speeds up learning and reduces the amount of stat-action pairs to remember)
    

## Chapter 7. Multi-step Bootstrapping

* TD method presented in chapter 6 will first update states that are close to terminal ones and it might take some time to update the rest. One way to amend it and use the fact that we have a sequence of rewards through an episode is to use:
$$v(S_t) \leftarrow v(S_t) + \alpha [R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^{n} v(S_{t+n}) - v(S_t)]$$, for some $n>0$ instead of:
$$v(S_t) \leftarrow v(S_t) + \alpha [R_{t+1} + \gamma v(S_{t+1}) - v(S_t)]$$
* This is called **$n$-step TD method** and is a kind of intermediate algorithm between TD(0) and Monte Carlo. The $n$-step update is done towards a target called $n$-step return:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^{n} V_{t+n-1}(S_{t+n})$$
We will be making slight modifications to this equation to adjust it for usage with previous (SARSA, Q-learning) and new algorithms.
* Expectation of $n$-step returns is guaranteed to be a better estimate of $v_{\pi}$ than $V_{t+n-1}$ in a worst state sense - **the error reduction property** of $n$-state resturns:
$$\max_s\lvert \mathbb{E}_{\pi}[G_t^{(n)}|S_t=s] - v_{\pi}(s)\rvert \leq \gamma^n \max_s \lvert V_{t+n-1}(s) - v_{\pi}(s)\rvert $$
* based on an example of random walk in a space of 19 states where you can walk only left or right (you can imagine it as walking on horizontal line on integers 1 to 19, where 0 and 20 are terminal states) it seems that as we expect $n$-step TD method can outperform both TD(0) and MC algorithms.
* **$n$-step SARSA** - we can apply $G_t^{(n)}$ to SARSA presented in chapter 6 to obtain $n$-step SARSA:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^{n} Q_{t+n-1}(S_{t+n}, A_{t+n})$$
where $G_t^{(n)} == G_t$ for $t+n \geq T$,
$$Q_{t+n}(S_t,A_t) \leftarrow Q_{t+n-1}(S_t,A_t) + \alpha [G_{t}^{(n)}-Q_{t+n-1}(S_t,A_t))]$$
* to get **Expected SARSA** we change the equation for:
$$G_t^{(n)} = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1} R_{t+n} + \gamma^{n}\sum_a \pi(a|S_{t+n}) Q_{t+n-1}(S_{t+n}, a)$$
* we can modify $n$-step SARSA and expected SARSA by **adding importance sampling to allow for off-policy learning**:
$$\rho_t^{t+n}=\prod_{k=t}^{\min(t+n-1, T-1)}\frac{\pi(A_k|S_k)}{\mu(A_k|S_k)}$$
by adding them to the update equation:
$$Q_{t+n}(S_t,A_t) \leftarrow Q_{t+n-1}(S_t,A_t) + \alpha \rho_{t+1}^{t+n}[G_{t}^{(n)}-Q_{t+n-1}(S_t,A_t))]$$
for expected SARSA we would use $\rho_{t+1}^{t+n-1}$ (as it uses expectation in the last term of $G_{t}^{(n)}$).
* **$n$-step Tree Backup** algorithm is a kind of generalization of Expected SARSA (or different way of adjusting it to use $n$-step backups). The idea is to take expectation for every step of the $n$-step update (instead of only at the end). It takes some effort to derive it (see Sutton, page 160). We compute returns in the following way:
$$V_t = \sum_a \pi(a|S_t)Q_{t-1}(S_t, a)$$
$$\delta_t = R_{t+1} + \gamma V_{t+1} - Q_{t-1}(S_t, A_t)$$
$$G_t^{(n)} = Q_{t-1}(S_t, A_t) + \sum_{k=t}^{\min(t+n-1, T-1)}\delta_k \prod_{i=t+1}^k \gamma\pi(A_i|S_i)$$
* all the algorithms presented so far in this chapter can be seen as special cases of a more general algorithm **$n$-step Q($\sigma$)**. One variation on what we've seen so far would be an algorithm that among the $n$-step update uses samples for some of the steps (like SARSA) and expectations for the rest (like tree backup). It could be generalised further to use fraction $\sigma \in [0, 1]$ of sample backup and $1-\sigma$ of expected one for a given step. Crude way of doing this would be to:
    * take the $G_t^{(n)}$ for $n$-step SARSA with importance sampling and multiply apropriate components of the sum by $1-\sigma$
    * take the $G_t^{(n)}$ for tree backup and multiply apropriate components of the sum by $\sigma$
    * then sum numbers from the above and use it as final $G_t^{(n)}$.
$$$$
More compact way of doing this and complete algorithm is given in the book. It looks interesting, but the question remains if it makes sense to do so. It migh be a way of reducing variance of sample updates of SARSA while keeping it's ability to do big changes especially in the beginning (this is my speculation).


## Chapter 8. Planning and Learning with Tabular Methods

* **Model** of the environment is anything that serves as a tool to predict future states and rewards, based on current ones. Distribution models try to compute whole distribution of of states and rewards, while sample models are able to return a sample. Distribution models are stronger (as they allow us to get a sample), but might be harder to estimate.
* Searching for a policy based on model of environment is called **planning**, while searching based on experience is **learning**.
* **Plan-space planning** are methods that involve directly searching through the space of plans (with for example evolutionary methods). **State-space planning** involves searching through space of states for optimal policy or a path towards a goal.
* Having models we can **simulate experience** (state transitions with rewards or even whole episodes) on which we can use methods from previous chapters to train policy.
* Relationships between learning and planning:
$$value/policy \rightarrow_{acting} experience \rightarrow_{direct RL} value / policy$$
$$value/policy \rightarrow_{acting} experience \rightarrow_{model learning} model \rightarrow_{planning} value / policy$$
* We combine both of the above to facilate RL algorithms. The example is Dyna architecture which does exactly that.
* Tabular Dyna-Q:
    0. Initialize $Q(s,a)$ and $Model(s,a)$ for all $s \in \mathcal{S}$, $a \in \mathcal{A}(s)$
    1. $S \leftarrow$ current (nonterminal) state
    2. $A \leftarrow \epsilon$-greedy$(S,Q)$
    3. Execute $A$, observe $R$ and $S'$
    4. $Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S',a) - Q(S,A)]$
    5. $Model(S,A) \leftarrow R,S'$ (whole model is memorizing last transition, we assume deterministic environment)
    6. Repeat $n$ times:
        * $S \leftarrow $ random previously observed state
        * $A \leftarrow$ random action previously taken in $S$
        * $R, S' \leftarrow Model(S,A)$
        * $Q(S,A) \leftarrow Q(S,A) + \alpha [R + \gamma \max_a Q(S',a) - Q(S,A)]$
    7. go to 2.
* Adding planning to RL greatly speeds-up learning in tabular environments, but this may not generalise easily. When model is wrong we may be learning suboptimal policy. There are two ways of this happening and their influence is assymetric:
    * we are over optimistic about some states - this is the better case, because we will be actiong upon it a lot which in turn will make us learn the true vale of the state (bias will disappear),
    * we are too pessimistic about a state - in this case we might stop visiting it and won't be able to learn it's true value. (This may also be a problem when the environment itself changes - we might want to explore states for this reason.)
    
    This is basically exploration vs explitation kind of problem in planning setting. Dyna-Q+ is an algorithm that in the planning process (step 6. from tabular Dyna-Q) adds a bonus reward to actions depending on how long they haven't been tried: $r + \kappa \sqrt{\tau}$, where $\kappa$ is some small fraction and $\tau$ is time steps since last using given arm. Another changes (introduced by Dyna-Q+) is considering actions that were never taken in visited states with their initial reward of 0 and making them lead back to the same state.
