In [1]:
%%capture
%load_ext autoreload
%autoreload 2
%matplotlib inline
%load_ext training_rl
%set_random_seed 12

In [2]:
%presentation_style

In [3]:
%load_latex_macros


$\newcommand{\vect}[1]{{\mathbf{\boldsymbol{#1}} }}$
$\newcommand{\amax}{{\text{argmax}}}$
$\newcommand{\P}{{\mathbb{P}}}$
$\newcommand{\E}{{\mathbb{E}}}$
$\newcommand{\R}{{\mathbb{R}}}$
$\newcommand{\Z}{{\mathbb{Z}}}$
$\newcommand{\N}{{\mathbb{N}}}$
$\newcommand{\C}{{\mathbb{C}}}$
$\newcommand{\abs}[1]{{ \left| #1 \right| }}$
$\newcommand{\simpl}[1]{{\Delta^{#1} }}$


# Offline RL theory Test

## Main issues

Can techniques from online RL, known for solving complex problems effectively, be applied to offline RL? 

<img src="_static/images/nb_94_on_policy_vs_off_policy.png" alt="offline_rl" style="width:60%">
<div class="slide title"> On-policy vs. off-policy approaches.
</div>

In off-policy online RL, we use a replay buffer to store $(state, action, reward)$ data, updating it as the learned policy improves. Why not apply an off-policy algorithm, filling the replay buffer directly with collected data.

<img src="_static/images/nb_94_off_policy_vs_offline.png" alt="offline_rl" style="width:70%">

**This is just a qualitative parallelism, and offline RL will work even if the data in your replay buffer is far from optimal.**

**However, even though both approaches seem similar, off-policy methods won't be able to work with collected data directly.**

**A bit of review:**

In particular many off-policy RL algorithms make use of the following approach:

$$
{\hat Q}^{k+1} \leftarrow L_1 = \arg \min_Q \mathbb{E}_{(s,a,s')\sim D} \left[\left( Q_\phi(s, a) - r(s, a) - \gamma \mathbb{E}_{a' \sim\pi_\theta(a'|s')}[Q_\phi(s', a')] \right)^2 \right]  \tag{Evaluation}
$$

$$
\pi_{k+1} \leftarrow L_2 = \arg \max_{\pi} \mathbb{E}_{s\sim D} \left[ \mathbb{E}_{a \sim\pi_\theta(a|s)} Q^{k+1}_\phi(s, a) \right] \tag{Improvement}
$$

with:

$$ 
Q^\pi(s, a) = \mathbb{E}_\pi \left[ r_0 + \gamma r_1 + \gamma^2 r_2 + \ldots \mid s_0 = s, a_0 = a \right]
\tag{Q-value}
$$


where $D$ is the replay buffer, which in the offline RL case will be filled with the collected dataset.

**As seen in the (Evaluation) step, the only potential out-of-distribution (o.o.d) issue arises when computing action $a'$, as all other values ($s$, $a$, $s'$) are from the dataset $D$.**

<img src="_static/images/nb_94_q_value.png" alt="offline_rl" style="height:200px;">
