## Chapter 5: Model-Free Control

From the value function of unknown MDP estimated by model-free prediction algorithms, we can approximate optimal policies for the MDP. 

Model-free control algorithms can be divided into two groups: on-policy control and off-policy control.

- **On-policy control**: Learn about policy $\pi$ from experience sampled from $\pi$

- **Off-policy control**: Learn about policy from $\pi$ from experience sampled from policy $\mu \neq \pi$ 

Both type of control algorithms are widely utilized in RL.

### On-policy Monte-Carlo Control

The overall idea of model-free control is to proceed the same pattern as the  **Generalized Policy Iteration** -> iteration of policy evaluation and policy improvement. 

For policy evaluation algorithm, Monte-Carlo policy evaluation can be used, and for policy improvement algorithm, **greedy policy improvement** can be used.

</br>
</br>
<font size="3">
$$\begin{align}
Q(S_t, A_t) = Q(S_t, A_t) + \alpha(G_t - Q(S_t, A_t))
\end{align}$$
</font>

</br>
</br>
<font size="4">
$$\begin{align}
\pi'(s) = \text{argmax}_{a \in \mathcal{A}} Q(s, a)
\end{align}$$
</font>

To make monte-carlo control model-free, action-value function $Q(s, a)$ has to be used. 

#### $\epsilon$-greedy policy improvement

Policy improvement theorem (Chapter 3) guarantees greedy policy improvement assures improved policy $\pi'$ is better or as good as the policy $\pi$. However, it does not guarantee the convergence of $\pi'$ to optimal policy due to lack of exploration. To ensure continual exploration, $\epsilon$-greedy exploration can be considered.
</br>
</br>
<font size="4">
$$ \pi(a|s)=\begin{cases}
    \frac{\epsilon}{m} + 1-\epsilon, & \text{if $\alpha^*=\text{argmax}_{a \in \mathcal{A}}Q(s,a)$}.\\
    \frac{\epsilon}{m}, & \text{otherwise}.
  \end{cases}$$
</font>

For any $\epsilon$-greedy policy $\pi$, the $\epsilon$-greedy policy $\pi'$ with respect to $q_\pi$ is an improvement: $v_{\pi'}(s) \geq v_\pi(s)$.

#### Greedy in the Limit with Infinite Exploration (GLIE)

However, if all state-action pairs are explored infinitely many times, exploration is not required. Thus **GLIE** suggests $\epsilon$-greedy policy $\pi_k$ that converges on a greedy policy as $k \rightarrow \infty$. 

For example, in Monte-Carlo Control, $\epsilon$ can be $\frac{1}{k}$ where $k$ is number of the episodes sampled for monte-carlo prediction.


### Off-policy Monte-Carlo Control

Off-policy Monte-Carlo Control estimates the policy $\pi$'s state-action values from trajectories sampled from different policy $\mu$ by using importance sampling. 

#### Importance Sampling

Importance sampling 

### On-policy Temporal-Difference Control 

As **Temporal-Difference (TD)** can be used for model-free prediction, it can be used for model-free control.


TD methods are only used for model evaluation or prediction part as it follows the pattern of **Generalized Policy Iteration (GPI)**. For policy improvement part, $\epsilon$-greedy policy or GLIE can be used.

For on-policy TD Control, **SARSA** algorithm is widely known.

#### SARSA

**SARSA** is an on-policy temporal-difference control algorithm that updates state-action values (Q-values) by quintuple of events: $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$.

![sarsa.png](attachment:sarsa.png)

SARSA updates state-action values from $(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1})$ that sampled from **current policy $\pi$**.

</br>
</br>
<font size="4">
$$ Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma Q(S', A') - Q(S, A)]$$
</font>

and update policy by $\epsilon$-greedy or GLIE.

SARSA converges to the optimal action-value function $q_*(s,a)$ under the following conditions.

1. policy $\pi_t(a|s)$ derived from GLIE
2. Robbins-Monro sequence of step-sizes $\alpha_t$
</br>
</br>
<font size="3">
$$ \sum^{\infty}_{t=1} \alpha_t = \infty, \sum^{\infty}_{t=1} \alpha_t^2 \leq \infty$$
</font>




### Off-policy Temporal-Difference Control


#### Q-learning

Q-learning is an off-policy TD control algorithm that called as one of the most important breakthroughs in reinforcement learning. 

Unlike SARSA, Q-learning requires only $(S_t, A_t, R_{t+1}, S_{t+1})$ state-action values are updated as the following:

</br>
</br>
<font size="4">
$$ Q(S, A) \leftarrow Q(S, A) + \alpha [R + \gamma \max_{a}Q(S', a) - Q(S, A)]$$
</font>

![q-learning.png](attachment:q-learning.png)

What differs Q-learning from SARSA is its target value, $R+ \gamma \max_{a}Q(S', a)$.

Target state-action values $R+ \gamma \max_{a}Q(S', a)$ means we choose policy in target value to be greedy with regard to $Q(s, a)$.

</br>
</br>
<font size="4">
$$\begin{align}
R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) & \\ = R_{t+1} + \gamma Q(S_{t+1}, \text{argmax}_{a'}Q(S_{t+1}, a')) \\ = R_{t+1} + \text{max}_{a'}\gamma Q(S_{t+1}, a')
\end{align}$$
</font>

Then policy $\pi$ is updated by $\epsilon$-greedy or GLIE. Q-learning is also proven to be converge to $q_*$ under the same condition that required for SARSA to converge. 