# Chap 10: On-policy Control with Approximation 

## 10.1 Episodic Semi-gradient Control
- **Goal**:To approximate action value function __$\hat{q}* \approx q_\pi$__
- **Gradient-update**:<br>
$
\begin{align}
\boldsymbol{w}_{t+1} = \boldsymbol{w}_t + \alpha[U_t - \hat{q}(S_t, A_t, \boldsymbol{w}_t)]\bigtriangledown\hat{q}(S_t, A_t, \boldsymbol{w}_t)
\end{align}
$
, $U_t$: update target
- In Sarsa Case:<br>
$
\begin{align}
\boldsymbol{w}_{t+1} = \boldsymbol{w}_t + \alpha[R_{t+1} + \gamma\hat{q}(S_{t+1}, A_{t+1}, \boldsymbol{w}_{t+1}) - \hat{q}(S_t, A_t, w_t)]\bigtriangledown\hat{q}(S_t, A_t, \boldsymbol{w}_t)
\end{align}
$
- To form **control methods**, we need to couple such action-value prediction methods with techniques for policy improvement and action selection. Suitable techniques applicable to continuos actions, or to actions from large discrete sets, are a topic of ongoing research as yet no clear resolution.
- **Pseudo-code** for Semi-gradient Sarsa:<br>
    <img src='./images/chap10/chap10_1_ Sarsa Pseudo code.png' style='width: 600px'>


## 10.2 Semi-gradient n-step Sarsa
- **_n_-step return**:<br>
$
\begin{align}
G_{t:t+n} &= R_{t+1} + \gamma R_{t+2} + ... + \gamma^{n-1}R_{t+n} + \gamma^n\hat{q}(S_{t+n}, A_{t+n}, \boldsymbol{w}_{t+n-1}),  t+n<T\\ \\
\boldsymbol{w}_{t+n} &= \boldsymbol{w}_{t+n-1} + \alpha [G_{t:t+n} - \hat{q}(S_t, A_t, \boldsymbol{w}_{t+n-1}]\bigtriangledown\hat{q}(S_t, A_t, \boldsymbol{w}_t)
\end{align}
$
- **Pseudo-code** for Semi-gradient n-step Sarsa:<br>
    <img src='./images/chap10/chap10_2_ n-step Sarsa Pseudo code.png' style='width: 600px'>
- This algorithm tends to learn **faster at n=8 than at n=1** on Mountain Car task:<br>
    <img src='./images/chap10/chap10_fig10-3.png' style='width: 700px'>

## 10.3 Average Reward: A New Problem Setting for Continuing Tasks
+ **Average reward**:
    - Applies to continuing problem
    - No discounting
    - Commonly for dynamic programming but less in reinforcement learning
    - Better setting than discount for function approximation (see 10.4)
+ **Quality of policy $\pi$**: average rate of reward<br>
    $
    \begin{align}
    r(\pi) &= \lim_{h\to\infty} \frac{1}{h}\sum\limits_{t=1}^{h} \mathbb{E}[R_t | S_0, A_{0:t-1}\sim\pi] \\
           &= \lim_{t\to\infty} \mathbb{E}[R_t | S_0, A_{0:t-1}\sim\pi]\\
           &= \sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)\sum_{s',r}p(s',r|s,a)r
    \end{align}
    $
+ **$\mu_{\pi}$** is the steady-state distribution, which means:<br>
    $
    \begin{align}
    \sum_{s}\mu_{\pi}(s)\sum_{a}\pi(a|s)p(s'|s,a) = \mu_{\pi}(s')
    \end{align}
    $
+ **Differential return**:<br>
    $
    \begin{align}
    G_t = R_{t+1} - r(\pi) + R_{t+2} - r(\pi) + R_{t+3} - r(\pi) + ...
    \end{align}
    $
+ **Bellman equations** with differential return: <br>
    $
    \begin{align}
    v_{\pi} &= \sum_{a}\pi(a|s)\sum{r,s'}p(s',r|s,a)[r-r(\pi)+v_{\pi}(s')],\\
    q_{\pi}(s,a) &= \sum_(r,s')p(s',r|s,a)[r-r(\pi)+\sum_{a'}\pi(a'|s')q_{\pi}(s',a'),\\
    v_*(s) &= \max_{a}\sum_{r,s'}p(s',r|s,a)[r-\max_{\pi}r(\pi)+v_*(s')],\\
    q_*(s,a) &= \sum{r,s'}p(s',r|s,a)[r - \max_{\pi}r(\pi) + \max_{a'}q_*(s',a')]
    \end{align}
    $
+ **Differential form of TD errors:**<br>
    $
    \begin{align}
    \delta_t &= R_{t+1} - \overline{R}_{t+1} + \hat{v}(S_{t+1}, \boldsymbol{w}_t -\hat{v}(S_t,\boldsymbol{w}_t)\\
    \delta_t &= R_{t+1} - \overline{R}_{t+1} + \hat{q}(S_{t+1}, A_{t+1}, \boldsymbol{w}_t -\hat{q}(S_t,A_t,\boldsymbol{w}_t)
    \end{align}
    $
+ **Update rule:**<br>
    $
    \begin{align}
    \boldsymbol{w}_{t+1} = \boldsymbol{w}_t + \alpha\delta_t\bigtriangledown\hat{q}(S_t, A_t, \boldsymbol{w}_t)
    \end{align}
    $
+ **Pseudo-code for differential semi-gradient Sarsa**<br>
    <img src='./images/chap10/chap10_3_differential Sarsa.png' style='width: 600px'>

## 10.4 Deprecating the Discounted Setting
- Discounted reward might not work well in approximate case.
- **Setup:** an infinite sequence of returns with no beginning or end. All the feature vectors of state may be the same.
- **Discounted return** must be done through a large-time interval.
- **Discounted return** via large-time interval is propotional to average reward.<br>
    <img src='./images/chap10/chap10_4_Futility of discounting in continuing problems.png' style='width: 600px'>
- However, average-reward and function approximation **are not guaranteed to improve the policy**

## 10.5 Differential Semi-gradient _n_-step Sarsa 
+ **_n_-step version of TD error:**<br>
    $
    \begin{align}
    G_{t:t+n} &= R_{t+1} - \overline{R}_{t+1} + R_{t+2}  - \overline{R}_{t+2} + ... + R_{t+n}  - \overline{R}_{t+n}+ \hat{q}(S_{t+n}, A_{t+n}, \boldsymbol{w}_{t+n-1})\\
    \delta_t &= G_{t:t+n} - \hat{q}(S_t, A_t, \boldsymbol{w})
    \end{align}
    $
+ **Pseudo-code for differential semi-gradient _n_-step Sarsa:**<br>
    <img src='./images/chap10/chap10_5_Differential semi-gradient n-step Sarsa.png' style='width: 600px'>