# "Offline Reinforcement Learning"
> "This is a summary of Sergey Levine's talk on Offline RL"

- toc: True
- branch: master
- badges: true
- comments: true
- categories: [fastpages, jupyter]
- image: images/some_folder/your_image.png
- hide: false
- search_exclude: true


# what makes modern machine learning work?

so to start off let's start with a big question what is it that makes machine learning work you know i'm going to give a
simple answer to a big question this is
going to be
maybe somewhat controversial but perhaps
many of you will agree with this at
least at a very very high level
i think that what makes machine learning
work today is really
the combination of large and highly
diverse data sets
and large and high capacity models and
what we've seen time and time again
is that across a range of domains from
image recognition
to machine translation to text
recognition and speech recognition
this kind of formula seems to be the
formula that leads to good results
if you collect a large data set like
imagenet or ms coco
and then train a huge model with dozens
or even hundreds of layers
that's going to be the thing that leads
to very good performance and arguably
the widespread enthusiasm about machine
learning in recent years
has really been spurred on by
applications
that follow this basic formula
so what about reinforcement learning
what about uh using
learning to figure out how to make
decisions
well uh reinforcement learning is
fundamentally
at least in the textbook setting an
active learning framework where you have
an agent
that interacts with the world collects
some experience
uses that experience to update its model
policy or value function
and then collects more experience and
this process is repeated many many times
i and we've seen that this basic recipe
does lead to good results across
a range of domains from from playing
video games to basic robotic
locomotion and manipulation skills and
even to play in the game of go
however when you want
learning systems that generalize
effectively to large-scale real-world
settings
like the ones that i showed on the
previous slide you still have to collect
large and diverse data sets and in an
active learning framework
this means that each time the agent
interacts with the world they need to
collect
a breadth of experience that covers the
sort of situations they might see in the
world
for example if you imagine using
reinforcement learning to train an
autonomous driving system
now this system might need to practice
driving in
many many different cities each time it
updates the model
so it goes and drives in san francisco
new york berlin
london updates the model and goes back
to san francisco new york berlin and
london
now this very quickly becomes
impractical and indeed
if we look at the kind of domains in
which modern reinforcement learning
algorithms have been successful
and then contrasted side by side with
the kind of domains where we've seen the
success of supervised learning methods
that can leverage very large and diverse
data sets
we see that there's a really big gulf
it's not that the reinforcing learning
algorithms are not capable they're
learning very sophisticated strategies
very sophisticated behaviors
but they're not exhibiting the same kind
of open world generalization
all of the domains that i'm showing on
the left can all be characterized as
closed world settings
we know exactly what the rules of go are
we know how the emulator for the video
game works even the robotics application
is a laboratory setting
whereas all the applications shown on
the right are open world domains
images mined from flickr from all over
the world
natural text natural speech collected
from real human beings
actually speaking or writing text in the
real world
so if we want to bridge this gulf of
generalization
what can we do how can we bring
reinforcement learning into these open
world settings
to address the kind of applications that
we actually want
well i'm going to posit that in order to
enable this we really need to develop
data-driven reinforcement learning
methods and data driven
means that you need to be able to use
large and diverse previously collected
data sets
in the same way that supervised learning
algorithms can utilize large and diverse
previously collected data sets
the classic textbook version of
reinforcement learning
is really an on policy formulation in an
on policy reinforcement learning
algorithm
you have a setting where each time the
agent updates its policy
it has to go and collect new data
a very common area of study in
reinforcement learning is also to study
off policy reinforcement learning
algorithms and these algorithms can
utilize previous data
but they typically still learn in an
active online fashion
meaning that the agent interacts with
the world collects some data
adds that data to its buffer and then
uses the latest data and all pass data
to improve its policy
and we might think that we could simply
cut that connection we could simply use
prior data
using the same exact algorithms but as
i'll discuss in today's talk
that actually doesn't work very well and
if we really want data driven
rl algorithms then we need to come up
with some new algorithmic techniques to
make this possible
and develop a new class of what i'm
going to call offline reinforcement
learning methods
these have also been called batch
reinforcement methods in the literature
the idea here is that a data set was
collected previously
with some behavior policy that i'm going
to note pi beta in general you might not
even know what pi beta was maybe it was
a person driving a car a person
controlling the robot
maybe it was the robot's own experience
and other tasks whatever it was
it collected a buffer of data a large
and diverse data set
that covers many different open world
situations and now you have to learn the
best policy you can
for your current task using just that
data without the
luxury of interacting with the world
further to try out uh different methods
now of course in reality you might want
something more like a hybrid in reality
maybe
what we have is we have a big data set
from all the past interactions that our
agent has had
maybe you have a robot that has cleaned
the kitchen and mocked the floors and
did many other things and that's sort of
it's it's foundational data it's it's
imagenet
style data set that it's going to use to
get generalizable skills
we're going to use that data set in an
offline rl fashion
and then perhaps we'll actually come
back and interact with the world a
little bit more
just to fine-tune that skill just to
collect task specific
data in small amounts and i'll actually
discuss some methods that can
do this as well but the main component
the main challenge of this recipe is
really the first part is using large and
diverse previous data sets
to get the best policy you can without
having to revisit all those diverse and
varied situations
without having the car go back and drive
in san francisco and new york and berlin
and london each time its policy changes
and if we can accomplish this if we can
build algorithms that have this
capability
then we will not only be able to train
better policies for
robots and things like that but we'll
also be able to apply reinforcement
learning
to domains where conventionally it has
been very difficult to use
for instance we could imagine using past
data to use reinforcement learning
to train a policy for advising doctors
on treatment plans and diagnoses
it would be pretty crazy to imagine an
active exploration procedure for doing
this because that would result in
enormous costs and liability but if you
can use previously expected offline data
to get the best decision making system
you can to support doctors in a medical
setting
well that seems like a pretty good
recipe you can imagine the same
procedure being applied to scientific
discovery problems
controlling power plants chemical plants
operations research
logistics and inventory management
finance in all of these domains
there's ample previously collected
historical data
but it's extremely difficult and costly
to run active online data collection so
we would expect that these would be the
domains that could be revolutionized
with effective offline reinforcement
learning algorithms
so in today's talk i'm going to cover
the following topics
first i'll discuss why offline
reinforcement learning is difficult
then i'll talk about some basic recipes
for designing offline rl algorithms and
a little bit about recent progress in
this area i'll talk about
some of our recent work on model based
offline rl
and then i'll discuss a new algorithm
that we're pretty excited about called
conservative q learning
which seems to be a way to do offline
url that works quite a bit better
than many of the previous methods and
then i'll conclude with a discussion
of how we should be evaluating our
offline reinforcement learning
algorithms
and discuss a benchmark that we've
developed recently called d4rl
all right but let's start with the first
topic let's talk about what makes
offline reinforcement learning hard
and why we can't use our standard off
policy rl algorithms
to solve it so first
let's start with a quick primer on off
policy reinforcement learning
in a standard reinforced learning
problem you have an agent that interacts
with the world by selecting actions
and the world responds with states and
the rewards the consequences of those
actions
what the agent needs to do is select a
policy which i'm going to denote as pi
of a given s
and that policy needs to be selected so
as to maximize
the reinforced learning objective which
is the sum of expected rewards over all
time
so the agent doesn't just take the
action that yields the highest reward
right now
it's supposed to take the action that
yields the highest rewards
in the long run now a very useful object
for maximizing this reinforcement
learning objective
is the q function the q function tells
you
if you start a particular state take a
particular action
and then follow your current policy pi
what is the total reward that you will
accumulate
and if you can somehow determine the q
function for a given policy
then you can always improve that policy
by taking an action with probability one
if it is the r max of the q function and
zero otherwise
so this greedy policy with respect to q
pi will always be at least as good or
better than the previous pi
and you can also skip the middleman and
directly enforce
the bellman optimality equation for all
state action tuples
so if you may if you can make qsa equal
to rsa
plus the max of the q value at the next
time step then you can also show
that recovering the greedy policy from
this will get you the optimal policy
and one way you could do this is you
could enforce this equation at all
states
by minimizing the difference between its
left-hand side and right-hand side
now this is all fairly basic stuff this
is kind of what we learned
uh you know when we first learned about
reinforcement learning it's textbook
rl but you know as many of you might
already know
the nice thing about this equation is
that you don't necessarily need
on policy data to enforce it now the
stock is going to focus pretty much
entirely on approximate dynamic
programming methods basically
q learning style methods or after critic
methods there are many other methods for
offline rl
based more around policy gradient style
estimators and i'm not going to talk
about this so we'll basically start with
q learning and actor critic and sort of
stay there just as a as a warning
okay so this procedure that i've laid
out here this classic q learning
approach
is an off policy algorithm which means
that you don't need on policy data
to make the left-hand side and
right-hand side of the bellman
optimality equation
equal to each other so you could imagine
an off policy
include q learning method or a theta q
iteration procedure that has the
following basic recipe this is sort of
classic stuff from literature collect
the data set of transitions
using some policy add it to a buffer
sample a batch from that buffer minimize
the difference between the left hand
side and right hand side
of the bellman optimality equation and
then repeat this process some number of
times
and then periodically you could go out
and collect more data
you could also think of this graphically
you interacted with the world
collected a data set of transitions
you're going to run your q-learning
procedure on that data set
and then periodically go back and
explore and of course if we're doing
offline rl then this is the part that we
would omit we would just
take our buffer and just keep crunching
away with q learning on that buffer
so this is an off policy algorithm and
it can accommodate previously collected
data in principle
so does it solve the awful nrl problem
and if not what goes wrong
well we studied this question a little
bit so what i'm going to show next is
some results from a large-scale
reinforced learning project that we
actually did at google a couple of years
back called qt opt
our goal in this project is to develop a
highly generalizable and powerful
robotic grasping system that could learn
to grasp any object
directly from raw images but in the
process of doing this research
we actually evaluated a little bit how
these q learning style methods compare
in the fully offline regime
versus the regime where they're allowed
to collect additional online data
now here there's you know everything is
kind of scaled up so there are seven
different robots that are all collecting
lots of data in parallel they're pushing
it to a
decentralized uh distributed buffer
storing all the data from past
experiments and there are multiple
workers that are actually updating q
values
on everything in the buffer
and we looked at how the system worked
it actually worked really well
it could pick up novel objects heavy
objects awkward objects i could respond
to perturbations that i'd never seen
before so basically
when we scale up reinforcement learning
it does actually generalize something
is really working and we're seeing
generalization that kind of resembles
the kind of
uh open world generalization that we saw
in the supervised system
of course here one might argue it's not
entirely open world it's still in the
laboratory
but it is generalizing to never before
seen objects
so we evaluated this method on a test
set of objects of the policy never saw
during training we actually bought
entirely new objects just to ensure that
they were not in the lab
during the training process uh and
and then we compared the algorithm train
in fully offline mode
meaning using just the data collected
before trainable policy and see how it
does and this was fairly good offline
data it came from past rl experiments
and we also evaluated the algorithm in
fine-tuned mode where
the policy was allowed to collect a
little bit of additional data and
fine-tune online the offline data set
consisted of
580 000 uh episodes
so this is a very large grasping episode
the online fine tuning added another 28
000.
so the amount of additional online data
was pretty negligible
so it was clearly not the increase in
the data set size it was really that was
online
and we saw the following numbers the
success rate was 87
for the offline method and 96 with the
additional online fine-tuning
and while this might seem like a small
difference if we rewrite this as error
rates we have an error rate of 13
for the offline method and four percent
uh for the fine-tuned method so
less than a third of the number of
mistakes that's a pretty big difference
actually
so the system clearly works very well
but
you're getting three times fewer
mistakes
if you fine-tune you believe in a small
amount of data so something about the
fully offline setting
seems to be pretty hard now
more recently uh we actually studied
this question
in a paper called stabilizing off policy
queue learning via bootstrapping error
reduction
by overall kumar justin food george
tucker and myself
and we want to understand what what are
the reasons why this might be so
difficult
we had a few hypotheses that we wanted
to investigate one hypothesis we had was
maybe there's some kind of or fitting
effect maybe when you train on offline
data
if you train too long you sort of
overfit to that data and you start
seeing bad performance
but if it was an overfitting effect then
what we would expect
is that increasing the data set size
should decrease the problem
so what this plot shows is offline rl
performance
on the half cheetah benchmark and
different colors denote different data
sets so
blue is 1000 uh transitions and
red is 1 million transitions
and you can see that there is virtually
no difference between 1 000 and 1
million so
if there is an overfitting effect it is
not the conventional kind of overfitting
it doesn't seem to go away as you add
more data
now we also looked at how well the q
function thought it was doing
and meaning if you actually look at the
q values
for the current policy that tells you
how well the q function
thinks it's going to do when it's
executed in the world
and this is a plot showing that now
something to note here is that the y
axis here is actually a log scale so
what this is showing is
enormous overestimation actual
performance is below zero
estimated performance is between ten to
the seventh and ten to the twentieth
power
so the q function thinks it'll do
extremely well and it's doing extremely
poorly
well that's kind of weird um another
hypothesis we had as well
maybe the trouble is just the training
date is just not good like you know
maybe the best you can do with the state
is -250
now one way that you can evaluate that
hypothesis you guys should look at the
best
transitions in the data set if you just
copy the best transitions
how will we do we'll actually do
actually pretty well so
this is usually not the case in these
offline rl problems it's not that the
training data doesn't have good behavior
it's that somehow the q function becomes
excessively confident and optimistic
about actions that are not actually very
good
to understand what's happening here we
first have to understand distributional
shift
so here's distribution shift in a
nutshell
when we run supervised learning we're
typically solving what's called an
empirical risk minimization problem
so we have some data distribution p of x
we have some label distribution p of y
given x
our training set consists of samples
from p of x and p of y given x
and then we're going to minimize some
loss on those samples
and if we're doing vellum bellman error
minimization then we're minimizing the
squared error
between the values predicted by our
function f theta
and the target values y now when we run
empirical risk minimization
we could ask the question given some new
data point x star
is f theta of x star going to be correct
well one thing that we do know is if we
had enough samples
then the expected value of the error
under our training distribution should
be low that's what generalization means
generalization means the training error
which is the empirical risk is
representative of
generalization if you have a large
training set minimizing training error
is going to minimize generalization
error unless you overfit
but that doesn't necessarily mean that
we're going to make an accurate
prediction
on some other point x star
for example the expected value of our
error under some other distribution over
x
p bar of x is not in general going to be
low
unless p bar of x is equal to p of x so
our error under a different
input distribution might be in fact very
high
in fact even if x star was sampled from
p of x we're not guaranteed that error
on x star is low because we're
minimizing expected error
so we might still have some points with
higher we're not minimizing point wise
there or just expected error
now you might say that usually when we
train deep neural networks
we're not too worried about this kind of
distributional shift because deep
networks are really powerful they
generalize really really well
so maybe it's okay but what if we select
x star so as to maximize f theta of x
see we might have a function that fits
the true function really well
let's say that the green curve
represents the true function and the
blue curve represents our fit
mostly our fit is extremely good however
if we pick the largest value of f theta
of x
um which point are we going to land in
we're going to land exactly on that peak
we're going to find the point that has
the largest error in the positive
direction
so even if we're generalizing really
really well even if we're doing well for
x star points even ones outside of the
distribution
if we maximize the x we're going to get
big errors
so what does this have to do with q
learning well
let's look at our target values in q
learning it's r of s
a plus the max over a prime of q of s
prime comma a prime
and i'm going to rewrite this in kind of
a funny way i'm going to write this
as r plus the expected value of a prime
under the policy pi nu where pi nu is
this r max policy so pi nu
assigns the probability of 1 to the r
max now this is kind of just a really
weird way of writing the max
but i think it makes the distribution
shift problem much more apparent
so let's say that our target values are
called y
what is the objective for the q function
well it's to minimize
the empirical risk the empirical error
against y
under the distribution over states and
actions induced by our behavior policy
pi beta
which means that we would expect q to
give us accurate estimates of q values
under pi beta
right pi beta is the behavior policy
so we expect good accuracy when pi beta
is equal to pi nu so if pi nu is pi beta
then we're fine we're going we're going
to have lower estimates
but how often is that true pi news
chosen to maximize the q value so unless
pi beta
was actually also maximizing those same
q values before it even knew what they
were
these things are not going to match and
it's even worse
pioneer is selected to be the r max
and if you think back to the previous
slide when you select your point with a
max
you get some bad news and that's why we
see
on the safchita experiments that the
policy does poorly
whereas the q function thinks it's doing
really well because it's finding the
actions
with the largest error in the positive
direction and that's why just naively
applying off-pulse crl methods to the
offline setting
does not in general yield great results
so we have to somehow combat this
overestimation
issue we have to combat the
distributional shift
in the action space so let's talk about
how we can do this
how do we design offline rl algorithms
well one very large class of methods in
the literature and this is summarized in
a tutorial
that we assembled recently called
offline reinforcement tutorial review
and perspectives on open problems
one very large class of methods is what
we're going to call policy constraint
methods
so it's easiest to describe policy
constraint methods in the actor critics
setting but they can be applied in the q
learning setting too
so here we're going to update our q
function just like before using the
expected value under pi new
and then we'll update our policy not
just by taking the r
max but by taking the rmax subject to a
constraint
that some measure of divergence between
pi nu and pi beta
is bounded by epsilon so the idea is
that if
pi nu stays close to pi beta then the
distributional shift
will be bounded and therefore the
effective auto distribution actions will
be bounded
and then you repeat this process so it's
just like an actor critic algorithm
only with this additional constraint
against the behavior policy
so does a solid distributional shift
does it mean that we have no more
erroneous values
well to a degree it actually does so
there's a
pretty large class of methods that have
explored various kinds of policy
constraints
uh it's it's a very old idea we coined
the term policy constraint i think in
our tutorial
but it's this these kinds of approaches
have been called many things in
literature
including kl divergence control uh
maximum prl linearly cell blm dps all
sorts of things
and there are many uh researchers that
have studied this for decades
not just in the offline rl literature
but in the url literature more broadly
more recently this has been used
extensively for various off policy and
offline rl methods
here is just a collection of recent
citations kind of from the last five
years
that have done some variant of this
now this does have a number of problems
one problem is that
this kind of approach might be way too
conservative
imagine the following scenario let's say
that you have
a behavior policy that is
highly random in some places and almost
deterministic in other places
now when the behavior policy is highly
random you actually get pretty good
coverage of actions
and the effect of out of distribution is
actually minimized in fact if the
behavior policy was
uniformly random nothing would be out of
distribution because it has full support
but if you have a policy that is highly
random in some places and highly
deterministic in others
it's actually very difficult to pick a
value of epsilon that works
because if you if you pick a value of
epsilon that is too low
then in the places where the behavioral
policy is highly random you'll be able
to do all sorts of interesting things in
its support
but in the places where it's not highly
random you might take a really bad
action
right so imagine it's kind of close to
deterministic when it's
crossing a narrow bridge where with some
small probability it falls off
if you admit uh that amount of error you
might fall off the bridge
if you use a tighter constraint if you
limit the amount of deviation from the
behavior policy very strictly
then you won't fall off the bridge
you'll match it in that highly
deterministic region will you also be
forced
to match the highly random distribution
in the region where it's random and
that's just useless right
if the policy is being very random being
just as random
doesn't seem to give you anything like
if it's random you can just choose
within that support whatever you want
so it's hard to pick a single value of
epsilon that works better
now one thing that can mitigate this is
to use a support constraint
basically say it's not that you have to
match the distribution you have to match
the support of that distribution
so if the distribution is highly random
you can do sort of anything you want
within that support
but if it's highly deterministic you
really need to do whatever it did
because you don't have any wiggle room
to go out of support and we explored
this a little bit
uh in a in a paper from 2019 that
introduced an algorithm called bear
which uses support constraint the second
issue
which is a bit tougher to deal with and
actually pretty problematic in practice
is that estimating the behavior of the
the behavior policy itself
in order to enforce this constraint can
be very difficult
if you don't know what the behavior
policy was maybe the data was collected
by a human
you have to somehow figure out what
policy that human was following
and if you make a mistake in figuring it
out then your constraint might be
incorrect
so for example if you fit the behavior
policy with supervised learning
we know that supervised learning will
make certain mistakes it has kind of a
moment matching effect which means that
it will average across modes
which means you might have high
probability actions under your behavior
policy
that at a very low probability under the
true behavior policy
and when you enforce a constraint
against that the constraint will be
highly imperfect
and this can lead to very poor
performance in practice
now when is estimating the behavior
policy especially hard
well it's especially hard when the
behavior policy actually consists of a
mixture of multiple different behaviors
and this is actually exactly the setting
that we want because remember i
motivated all of this
by talking about the setting where you
have a large and highly diverse data set
and that's exactly what you would expect
that your
behavior policy would actually be a
mixture of many different policies
and at that point estimating it with a
single neural network is actually very
hard
so the easy case is when all the data
comes from the same markovian policy
but this is not very common or very
realistic uh you know if your data came
from humans they're certainly not going
to be markovian if it came from many
past tasks you've done
while each individual test might be
markovian the mixture might not be
so the hard case is where the data comes
from many different policies
and this is very common in reality and
it's also very common when you're doing
online fine-tuning so if you remember
when i motivated all this i also said
that we want to collect lots of data
from past behaviors train a policy with
offline rl and then maybe fine tune it
further with a little bit of online data
so let's use this online fine-tuning as
a kind of test case
in reality what we care about is the
setting where data comes from many
different policies but the online
fine-tuning situation
is a nice sort of petri dish in which to
explore this problem
so i'm going to talk about a method that
we've developed that specifically
targets that setting
where first you have offline training on
a large previously collected data
set and then you have some online
fine-tuning and during online
fine-tuning you're adding more data
from many different policies
so this is work by uh two of my students
astronautier and amazon gupta together
with an undergraduate student
named mortis and so the experiment is
that we do online fine-tuning from
offline initialization
and what i'm showing in this plot is the
log likelihood
of the behavior policy fit so this is
for one of these standard
policy constraint methods in this case
bayer and the left side shows the
offline training so you can see during
offline training we have pretty good log
likelihood for fitting our behavior
policy
but during online training when we have
these additional samples from
many different policies being pushed to
the buffer the likelihood
of the policy estimate drops
precipitously it drops very sharply for
the online data and even drops for the
old offline data because
you have to also model the new offline
data so you do worse on the offline
offline data so so our fit
gets bad and in fact we can see that uh
this method this is bare but this would
be true i think for for most policy
constraint methods
uh it doesn't do so well so this is just
showing the online fine tuning
and you're gonna have two choices you
can use a strict constraint or a loose
constraint
if you use a strict constraint then you
do a little bit better at the beginning
that's the yellow line
but you fail to improve because as the
behavior policy deteriorates
that strict constraint just basically
causes everything to get stuck
if you use a loose constraint then you
improve over the course of online
training but you start off in a really
bad place
and in this case it's no better than
just initializing from scratch
and in fact this general principle is
borne out in a few other papers for
example
this paper i have cited here called maq
shows that using a much more powerful
class of behavior policies
leads to substantial improvement
implying behavior policy modeling
is really a major bottleneck for these
policy constraint methods
so one of the things we could do is we
could actually enforce a constraint
without explicit modeling of the
behavior policy so this is a kind of
implicit constraint method
so here's the idea the problem we want
to solve
is this constrained argmax
it turns out that you can prove that the
solution to this constrained
optimization problem
can be expressed in closed form and the
way that you do this is you write down
the lagrangian
solve for the dual and then you can
actually write down the optimum for the
dual
and the optimal solution is given by one
over z times the behavior policy
times the exponential of the advantage
function
and the advantage function is multiplied
by one over lambda where lambda
is a lagrange multiplier and this is
straightforward to show using a duality
argument
uh we didn't actually come up with this
this is this is actually been brought up
in a number of previous papers including
the
reps papers by peters at all uh the
rolex at all sci learning paper and many
others this is kind of actually a very
well known identity
in kl regularized control
but the interesting thing about this
equation is that it shows
that we can approximate the solution to
the constrained optimization
using a weighted maximum likelihood
objective
and the way to do this is to note that
matching this pi
star is the same as
maximizing the likelihood of actions
taken from
pi star and you can do that by taking
action instead from pi
beta and weighting them by the
exponentiated advantage
so the if you can find the r
max of the expression i have written
here then you will get pi star
provided you can match pi star exactly
and the the cool thing about this
expression now is that
pi beta only shows up as the
distribution the expectation so even
though you need pi beta for this
in reality all you actually need is
samples from pi beta which means that
you do not need to actually know the
distribution pi beta
and samples from pi beta that's exactly
uh what our data set is composed of
so we no longer need to estimate the
behavioral policy we just need samples
from it and our data set already
contains those
and of course the advantage we get from
our critic so you could imagine an actor
critic algorithm
which updates the critic to get the
advantages and then updates the policy
using this weighted maximum likelihood
expression
all right so does this procedure work
well if we look at some results from the
paper we evaluated on these kind of
dexterous manipulation tasks
where we had a data set from human
demonstrations
uh with a with a data glove and then if
you look at what those human
demonstrations do they get a success
rate of about 24
on this door task with some online fine
tuning and get 88
success rate so this is doing much
better than if you just copy the
demonstrations
and you get meaningful fine tuning now
of course this is an offline rl method
so it doesn't actually need
demonstrations per se
that's just what we had in these
experiments we had other experiments
that also used random data
and quantitatively the method actually
does really well it's shown by this red
line here
the pen task is the easiest the door is
kind of medium and the relocate task is
the hardest and you can see that as the
tasks get harder
this method denoted as awac ends up
doing the best
so it kind of matches previous work on
the easy task and then greatly exceeds
it on the harder tasks
so this shows that something you know is
really working with these implicit
constraints
all right now let me take a little
detour to
delve a little bit into the world of
model based reinforcement learning
because
we can also develop effective offline rl
algorithms in the model based world too
so uh just a brief overview of
model-based rl
for those of you that aren't familiar
with it in model based rl
what we're going to do using data that
we collect from the environment is we're
going to fit a model we're going to fit
an
estimate to the transition probabilities
p hat of st plus 1
given st comma 80. and then we'll
somehow use that model
to train a policy pi of a t given st and
then typically repeat the process
so one way that you could imagine such a
model based rl algorithm working and
this is sort of a dyna style recipe
is that you collect real data denoted by
these block trajectories
you use that real data to fit your model
and then you're going to use that model
to make
kind of little short roll outs from
different states in your data set
and these little short rollouts will be
on policy using your latest policy
so the model kind of answers these
what-if questions if i were in this
state that i had seen before
what would have happened if i took a
different action and this of course lets
you train your policy much more
efficiently
and of course in offline rl we're going
to omit this error we're not going to
allow the algorithm to collect more data
it has to contend itself with the data
that it already has so what could go
wrong
well as you might have already guessed
the problem again is going to be
distributional shift
when we make these rollouts under the
model
if the policy that we're now evaluating
is very different from the policy they
originally collected the data
rolling out the model under that policy
will result in state visitations
that are very different from the states
under which the model was trained
and the model will make very large
mistakes and of course since the policy
is again being trained to maximize the
return
it'll learn to exploit the model to
essentially cheat so the model
erroneously produces good states
now in the in the literature one of the
ways that people have mitigated this
effect
is by just shortening these rollouts so
if you don't allow the policy to raw for
very many steps under the model there's
only so much exploiting that it can do
but in the fully offline setting of
course you don't want to do this because
your policy might deviate drastically
from the behavior policy so you want to
give it longer roll outs
so you can actually tell how well it's
doing
so the model's predictions will be
invalid if you get these out of
distribution states
very much like the problem we saw before
so
one solution we've developed and this is
uh joint work with
kenya you uh and garrett thomas who are
the co-first authors as well as a number
of collaborators
most of the team here is from stanford
university led by professors
chelsea finn and thank you mom and
there's also some concurrent work
by kadambi at all that also proposes a
very closely related algorithm called
moral
but i'm going to talk about our paper
called mopo
so the solution is to basically punish
the policy for exploiting these other
distribution states
so what we're going to do is we're going
to take our original reward function
and we'll subtract a penalty
mult with a coefficient lambda and this
penalty u
is essentially going to be some quantity
that is larger for states that are more
out of distribution or state action
tuples that are more out of distribution
so it's a kind of uncertainty penalty
and that's the only change we're going
to make we're just going to change our
reward like this
and then use some existing model based
rl algorithm the one that we used in
mopo is based on an algorithm called
mbpo model based policy optimization
so the idea now is when you visit one of
these outer distribution states
you get a really big penalty and that
results in the policy not wanting to
visit those other distribution states
instead of curving back
to the support of the data
okay so there's some pretty interesting
theoretical analysis that we can do for
this algorithm
and the theory this is kind of the
statement from the main theorem in the
paper
uh which states that under two technical
assumptions
the learned policy pi hat in the moba
algorithm satisfies the property that
the return of pi hat
is greater than or equal to the supremum
over pi
of uh the return of pi minus two times
lambda lambda
times epsilon u okay so what is what are
all these things
there are two technical assumptions one
is that we can basically represent
the value function so things will break
down badly if you have really bad
function approximation error and the
second one
is an assumption on you which says that
the true model error
is bounded above by you so this is this
is essentially saying that u is a
consistent estimator
for the error in the model and you need
this property
for basically because you need you to
mean something in practice the way that
we estimate mu
is by training an ensemble of models and
measuring their disagreement but any
estimator that upper bounds
the error in the model will do the job
so this is the true return of the policy
train under the model so not the return
the model thinks it has but the
return it actually has and
epsilon here is just the expected value
of u
so what this is saying is that the true
return of a policy we optimize under our
model
will actually have high will actually be
at least as high
as the best policy that still has to pay
this
error price another way of saying that
it'll be at least as good
as the best in distribution policy of
course you can get the optimal policy
because you don't know what happens out
of distribution we can at least have to
do at least as well
as the best in distribution policy
another way of saying this is that um
you can construct this policy pi delta
which is
the best policy whose error is bound
basically the best policy that
doesn't visit states with error larger
than delta
so this is under the true mdp but it's
saying you're going to
find the best policy under the true mdp
that doesn't visit states where the
model would have made mistakes
and this result says that the policy we
learn with model-based rl will be at
least as good as that policy
minus an error term that scales us two
times lambda times delta
so this basically shows that we'll get a
good policy within the support of our
data
so some implications of this this means
that we always improve over the behavior
policy
and we can actually quantify the
optimality gap against the optimal
policy in terms of model error so
basically
if our error on the states that the
optimal policy would visit is small
then we'll get close to the optimal
policy
empirically this method does very well
it outperforms regular mbpo it also
outperforms
quite a few of the previously proposed
policy constraint methods
all right so now let me discuss the last
algorithm i'm going to cover
which is called conservative q learning
conservative q learning takes a slightly
different approach
to offline rl from policy constraint
methods so
just to remind you the problem we saw
before is that the q function kept
thinking that it's going to do much
better than it's actually going to do
so it's going to overestimate like this
because when we take the r
max we get the point with the largest
positive error
so instead of trying to constrain the
policy what if we try to directly fix
the problem what if we directly push
down on erroneously high q values
one of the ways we could do this is we
can formulate a q
learning objective with a penalty so we
have the usual term which says
minimize development error and then we
have a penalty term which says
pick the actions with the high q values
and push down on those q values so
minimize the q values under this
distribution mu
which is chosen so that the q values
under mu are high
so this will basically find these
erroneous positive points and push them
down
it turns out that with this very simple
regularizer we can actually prove
that the q function we learn is a lower
bound on the true q function for the
policy pi
if you choose alpha the weight on the
regularizer appropriately
that's pretty cool we're actually
guaranteed not to overestimate if we do
this
um so this is work that was primarily
led by my student averal kumar
and the particular algorithm that
overall proposed is a kind of actor
critic style algorithm
where you learn the slower bound q
function for the current policy
uh so that q hat is less than or equal
to q
and then you update your policy now you
don't actually need to represent the
policy exactly it's just a little
explicitly you can still have a q
learning algorithm where it's implicit
where it's the r max policy
but it's a little easier to explain in
active critics setting we call this
conservative q learning though because
it's also very simple to instantiate as
a q learning method
now you can also derive a much better
bound for conservative q learning the
bound i had on the previous page
was actually too conservative it
actually pushed down the q values too
much
what you can do is you can push down on
the high q values
but you can compensate by also pushing
up on the q values in the data set
so you push down on the q values of the
q function thinks are high
and then you push up on the q values uh
for the actions in the data set
intuitively what this means that is that
if the high q values are all for actions
that are in the data set
then these right two terms will cancel
out and the regularizer goes away so
when you're in distribution
there is no regularization if you go
more out of distribution you get more
regularization
so these are the two error terms push
down on actions from you
push up on actions from d from the
dataset
now you're no longer guaranteed to have
a bound for all state action tuples if
you do this
but you are turns out still guaranteed
to have a bound in expectation under
your policy
so the expected q value under pi for q
hat will still be less than or equal to
the expected value
under the uh the true q function for pi
for all the states provided that alpha
is chosen appropriately of course
so um the full bound uh
since it's a the full conservative q
learning algorithm is shown here you
minimize the big q values and you
maximize
uh the q values under the data the full
bound is written out here
so the left side is the estimated value
function it's less than or equal to the
true value function
minus a positive pessimism term due to
the regularizer and then there's this
error term that accounts for sampling
error
so you just have to choose alpha so that
this positive pessimism term
is larger than the positive sampling
error term obviously if you have more
data then your sampling error is lower
and you don't have to worry as much if
you have high sampling error then you
need a higher alpha
to be conservative
okay so does this bound hold in practice
one of the things we did is we actually
empirically measured underestimation
versus overestimation on a simple
benchmark task
so what i'm going to be showing is
basically the expected value of the
learned q function
q hat minus the expected value of the
true q function
so if these values are negative that
means that q hat is less than
q and expectation if they're positive
that means we're overestimating
and we get the true q function by
actually rolling out the policy many
times and using monte carlo
so here are the results so the first
column shows this
the full cql method with both the
minimization and maximization term
the second column shows just the basic
method that has just the minimization
but no maximization
then we have four different ensembles so
an ensemble of two networks four
networks 10 networks and 28 networks
and then we have bayer which is a
representative policy constraint on it
now the first thing that you might note
is that all of the ensembles and the
policy constraint method
are overestimating massively despite the
policy constraint
the simple cq element that just has the
minimization
is underestimating but by quite a lot so
rewards here are on the order of a few
hundred
so getting minus 150 means that you're
very heavily underestimating
whereas the full cql method
underestimates but only by a little bit
so we are having we do have a lower
bound and the lower bound is pretty good
so the cqr always has negative errors
which means that it's pessimistic
all right now before i tell you how well
cql actually does empirically
i want to talk a little bit about how we
should evaluate offline rl methods
so how should we evaluate them well uh
ultimately what we're going to want to
do is train our offline rl methods using
large and highly diverse data sets
but in the meantime when we're just
prototyping algorithmic ideas
we need to somehow collect uh you know
some data to set up a benchmark task so
maybe one thing we could do is we could
collect some data using some online rl
algorithm and then use it to evaluate
offline rl
this is actually pretty typical in prior
work you train pi beta with online rl
and then you either collect data
throughout training or you take data
from the final policy
i'm going to claim that this is a really
bad idea this is a really terrible way
to evaluate offline reinforcement
learning methods and if you're doing
research on offline rl
you should not use data sets that have
just this kind of data
because if you already have a good
policy why bother with offline rl
but perhaps more importantly in the real
world that's not what your data is going
to look like
your data is not going to come from a
markovian policy it's going to come from
humans from many different sources from
your past
behavior it's going to be multitask it's
going to be diverse it's not going to
look like the replay buffer
for an online url run so human users and
engineer policies etc
so if you really want to value your
offline rl method you really have to use
data that is representative
of real-world settings and leaves plenty
of room for improvement
and then offline allows to learn
policies that are much better than the
behavior policy that are better than the
best thing in your data set
without testing for these properties you
can't really trust
that our offline rl algorithms are good
and
of course in past work from my group
we've also been guilty of doing this but
we're we're mending our ways we're not
going to do this anymore
so we developed a benchmark suite called
d4rl it stands for data sets for data
driven dprl
this was led by my student justin food
together with avril kumar
uh of your not true truman george tucker
and d4rl is actually rapidly uh you know
picking up it's rapidly becoming the
most popular benchmark for offline rl
because it really exercises the kinds of
properties you want in offline rl
algorithms
so what are some important principles to
keep in mind well you want data from
non-rl policies
including data from humans so we
included things like the dexterous
manipulation data that i showed before
this was based on data collected by
argentoswar and colleagues
you want to evaluate whether your
algorithm can put together different
parts of different trajectories we'll
call this stitching
so if you've seen for example you can go
from a to b and you've seen that you can
go from b to c
but it's more optimal to go from a to c
the data actually tells you everything
you need to know to figure that out
so you should be able to do this and do
better than the best trajectory in the
data set
uh so we evaluate this using some maze
navigation tasks both in a simple low
dimensional 2d setting
and a complex setting where you've got a
simulated four-legged robot actually
walk through the maze
so you never see full paths from the
start to the goal but you see paths
between different places in the maze in
your data center
and you also have to have realistic
tasks we included first person driving
data from images using the carlos
simulator
data manipulating objects in a kitchen
from paper by abhishek gupta called
relay policy learning
and traffic control data from professor
alex bines lab
that simulates the effect of autonomous
cars on traffic
so the set of uh d4l tasks includes the
standard
mujoco gym tasks
with some difficult data sets the
stitching tasks and the mazes
dexterous manipulation tasks with data
from real humans
robot manipulation tests in this kitchen
environment again using human data
traffic control data from a flow
simulator and
image-based driving in karla
now if we look at how cql compares on
this benchmark first let me show you the
prior methods
one of the things to note is on the
harder benchmark tasks
we actually see that first of all
nothing works on the harder stitching
task so
on the larger mazes previous methods
basically don't learn anything these
scores are all normalized between 0 and
100.
the most competitive baseline across all
of these harder tasks
is just simple behavior cloning which
suggests that previously proposed
offline reinforcement learning methods
which have primarily been tested
on data from other rl policies are not
actually doing very well
so nothing beats behavior cloning on
these harder tasks
if we look at the performance of cql it
achieves state-of-the-art results on
nearly all of these tasks
so i'm showing two variants of cql and
one of these two variants
is the best on all but one of the tasks
or tied for the best so it's up to five
times better on the harder dexterous
manipulation tasks
fifty uh to three hundred percent better
on on the
adjust the human data 10 to 30 better on
the kitchen tasks
and essentially infinitely better on the
larger mazes where it's the only
algorithm that's able to exhibit the
stitching behavior
we also evaluated the method on atari
data from paper by agrowall at all and
we saw there also that
cql is 50 to 600 percent better uh than
previously proposed algorithms
so this method is doing really well it
seems to work quite well across many
tasks
and we seem to know why it works because
of this lower bound property
of course there's still plenty of room
for improvement so if you want to
develop better offline rl methods
there there's plenty of work to do and
plenty of ways
in which you can improve the results all
right
so just to wrap up and conclude i talked
about how offline rl
is quite difficult but has enormous
promise and initial results suggest that
it can be extremely powerful
i talked about how effective dynamic
programming offline rl methods can be
implemented by imposing constraints on
the policy and perhaps implicit
constraints
can get around the need to model the
behavior policy and i talked about how
this
learning a lower bound on the q function
using conservative q learning
can substantially improve offline rl
performance thank you very much