# Off-Policy Methods with Approximation

The tabular off-policy methods can extend to semi-gradient methods, but they are not guaranteed to converge robustly. While this can be addressed to a limited extent, off-policy learning with function approximation is still an active research area with many open questions.

Recall that off-policy learning seeks to estimate either state or action values for either some fixed policy $\pi$ or some changing policy, but only via observing data according to some behavior policy $b$.

The challenges of off-policy learning can be divided into two components.

1. The TD-error estimate must be adjusted based on the different probabilities of taking different actions between $b$ and $\pi$.

2. Much more importantly, the distribution of states observed by the behavior policy $b$ is very different than the distribution of states observed by the target policy $\pi$..

## 11.1 Semi-Gradient Methods

It is straightforward to turn the updates of on-policy semi-gradient methods into updates for semi-gradient off-policy methods by recalling the _importance sampling_ ratio:
$$
\rho_t = \rho_{t:t} = \frac{\pi(A_t|S_t)}{b(A_t|S_t)}
$$

For TD(0), The update then becomes:

$$
\mathbf{w}_{t'+1} \quad\dot{=}\quad \mathbf{w}_t + \alpha\rho_t\delta_t\nabla\hat{v}(S_t, \mathbf{w}_{t^\prime})
$$


where $\delta_t$ is defined appropriately for either the continuing differential case or the episodic case:

For action values, the Expected Sarsa update is:

$$
\mathbf{w}_{t'+1} \quad\dot{=}\quad \mathbf{w}_t + \alpha\delta_t\nabla\hat{q}(S_t, \mathbf{w}_t')
$$

where $\delta_t$ is defined as before for Expected SARSA. Note that there is no importance sampling ratio in the Sarsa update, which is explained later.

In the multi-step case, we simply have as our importance sampling ratio:
$$
\rho_{t+1:t+n} = \prod_{k=t}^{t+N-1}{\frac{\pi(A_k|S_k)}{b(A_k|S_k)}}
$$


1. Convert the equation of $n$-step off-policy TD (7.9) to semi-gradient form. Give accompanying definitions of the return for both the episodic and continuing cases.

$$
\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \rho_t \left[ G_{t:t+n} - \hat{v}(S_t, \mathbf{w}_t)\right]\nabla\hat{v}(S_t, \mathbf{w}_t)
$$

Where, for the episodic case, we have:

$$
\begin{align}
\rho_t &= \prod_{k=t}^{\min(T, t + n) - 1}{\frac{\pi(A_k|S_k)}{b(A_k|S_k)}}\\\\
G_{t:t+n} &= \sum_{k=t+1}^{\min(T, t + n) - 1)}{\gamma^{k-t-1}R_k} + \gamma^n\hat{v}(S_{t+n}, \mathbf{w_t})\mathbb{1}[t + n < T]
\end{align}
$$

While in the continuing case we have:

$$
\begin{align}
\rho_t &= \prod_{k=t}^{t + n - 1}{\frac{\pi(A_k|S_k)}{b(A_k|S_k)}}\\\\
G_{t:t+n} &= \sum_{k=t+1}^{t + n - 1}{(R_k - \bar{R}_{k-1})} + \hat{v}(S_{t+n}, \mathbf{w_t})
\end{align}
$$

2. Convert the equations of $n$-step $Q(\sigma)$ (7.11 and 7.17) to semi-gradient form. Give definitions that cover both the episodice and continuing cases.

$$
\mathbf{w}_{t+1} = \mathbf{w}_t + \alpha \left[ G_{t:t+n} - \hat{q}(S_t, A_t, \mathbf{w}_t)\right]\nabla\hat{q}(S_t, A_t, \mathbf{w}_t))
$$

Where $G_{t:h}$ is defined in the discounted case as:

$$
\begin{align}
h &= \min(T, t + n)\\
G_{t:h} \quad&\dot{=}\quad R_{t+1} + \gamma(\sigma_{t+1}\rho_{t+1} + (1 - \sigma_{t+1})\pi(A_t|S_t))(G_{t+1:h} - \hat{q}_{h-1}(S_{t+1}, A_{t+1}, \mathbf{w}_t)) + \gamma \bar{v}_{h-1}(S_{t+1})
\end{align}
$$

And $G_{t:h}$ is defined in the continuing case as:

$$
\begin{align}
G_{t:h} \quad\dot{=}\quad R_{t+1} + (\sigma_{t+1}\rho_{t+1} + (1 - \sigma_{t+1})\pi(A_t|S_t))(G_{t+1:h} - \hat{q}_{h-1}(S_{t+1}, A_{t+1}, \mathbf{w}_t)) - \bar{R}_t + \bar{v}(S_{t+1}, \mathbf{w}) - \bar{R}_t
\end{align}
$$

Where

$$
\bar{v}(s, \mathbf{w}) = \sum_{a\in\scr{A}(s)}\pi(a|s)\hat{q}(s, a, \mathbf{w})
$$

and $\sigma_{t+1} \in (0, 1)$ is chosen by some procedure at each decision from $1$ to $h$.

3. Apply one-step semi-gradient Q-learning to Baird's counterexample and show empirically that its weights diverge.

In [25]:
𝓢 = collect(1:7)
𝓐 = [:dashed, :solid]
γ = 0.99

π_(s, a) = a == :solid ? 1 : 0
π_sample(s) = :solid
b(s, a) = a == :dashed ? 6/7 : 1/7
b_sample(s) = rand(vcat(repeat([:dashed], 6), [:solid]))

p_sample(s, a) = (a == :dashed ? rand(1:6) : 7, 0)

function x(s)
    x = zeros(8)
    if s == 7
        x[7] = 1
        x[8] = 2
    else
        x[s] = 2
        x[8] = 1
    end
    return x
end

function q_learning(;ε=0.1, N=1000, α=0.1)
    W = zeros(2, 8)
    q(s, a) = a == :dashed ? (W*x)(s)[1] : (W*x(s))[2]
    s = rand(𝓢)
    for i in 1:N
        a = b_sample(s)
        sp, r = p_sample(s, a)
        grad = zeros(2, 8)
        grad[a == :dashed ? 1 : 0, :] = x(s)
        W += α*(R + γ * maximum(a -> q(sp, a), 𝓐)  - q(s, a)) * grad
        s = sp
    end
end

q_learning (generic function with 1 method)