Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Difference with Conservative-Q Learning (CQL) #3

Closed
emailweixu opened this issue Jul 27, 2022 · 5 comments
Closed

Difference with Conservative-Q Learning (CQL) #3

emailweixu opened this issue Jul 27, 2022 · 5 comments

Comments

@emailweixu
Copy link

The relative pessimism (1)(2)(2) proposed in ATAC seems exactly same as the learning objective (3) in [1] . And Algorithm 2 in ATAC looks remarkably similar to the Algorithm 1 in [1] omitting some implementation caveats. Could you explain what is the major difference between ATAC and CQL?

[1] Kumar et. al. Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020

@tengyangxie
Copy link

Hi @emailweixu, thanks for reaching out!

One notable difference is that CQL includes an inner maximization over policy in its critic, whereas ATAC does not. This difference enables ATAC to recover imitation learning (when the Bellman term is turned off) and further leads to the robust policy improvement property, but CQL doesn't.

For a detailed comparison between ATAC and CQL, please refer to Appendix D of our paper.

@emailweixu
Copy link
Author

@tengyangxie, Thanks for the response. Appendix D is very helpful in explaining some theoretical differences such as minmax vs. maxmin. But as a practitioner, I am more interested in the differences in the practical algorithms (i.e. Algorithm 2 in ATAC and Algorithm 1 in CQL). As shown in attached pictures, step 4 and 5 of ATAC corresponds to step 3 and 4 of CQL. And it seems to me that they are doing the same thing. I understand that there are some differences in implementation details such as using only $f_1$ for actor update, different learning rate, and the use of DQRA loss.
image
image

@chinganc
Copy link
Collaborator

chinganc commented Aug 2, 2022

Hi @emailweixu, you're right about those differences about implementation. Among them, using two-timescale stepsize to and single critic in the actor update are crucial to achieve proper adversarial training.

For the comparison of algorithm pseudocodes you mentioned, I would like to point out that line 3 of CQL is actually quite different from line 4 of ATAC. CQL uses Eq 4 in the CQL paper to update the critic, whereas the closest analogy of ATAC's critic objective would be Eq 2 in the CQL paper if we take \mu there to be the \pi, though in ATAC we don't estimate the behavior policy. This leads a large difference in both theory and implementation. This objective difference (together with the difference in the stepsize and single critic choice above) leads to the difference between minmax vs. maxmin we discussed in theory, so you can expect that they learn different critics. For the code implementation, the log-sum-exp term in CQL requires summing over the action space, which is intractable in high dimensional problems (in their implementation, it heuristically takes samples from the learner policy and a bunch of other random actions to approximate the sum). On the other hand, ATAC just takes samples from the learner and the data to directly compute the critic objective, which is straigtforward to implement.

@emailweixu
Copy link
Author

@chinganc Thanks for the detailed explanation. It's really helpful.

For the comparison of algorithm pseudocodes you mentioned, I would like to point out that line 3 of CQL is actually quite different from line 4 of ATAC. CQL uses Eq 4 in the CQL paper to update the critic, whereas the closest analogy of ATAC's critic objective would be Eq 2 in the CQL paper if we take \mu there to be the \pi,

I agree that (4) in CQL paper is different from ATAC. But (3) in CQL seems quite similar to ATAC object except the order of max and min.

though in ATAC we don't estimate the behavior policy.

My understanding is that CQL does not estimate behavior policy either?

This leads a large difference in both theory and implementation.
This objective difference (together with the difference in the stepsize and single critic choice above) leads to the difference between minmax vs. maxmin we discussed in theory, so you can expect that they learn different critics.

For the code implementation, the log-sum-exp term in CQL requires summing over the action space, which is intractable in high dimensional problems (in their implementation, it heuristically takes samples from the learner policy and a bunch of other random actions to approximate the sum).

Agree with you that log-sum-exp in (4) of CQL is problematic.

On the other hand, ATAC just takes samples from the learner and the data to directly compute the critic objective, which is straigtforward to implement.

It seems that (3) of CQL can be implemented in the similar way.

@chinganc
Copy link
Collaborator

Sorry I missed the previous response.

RE: (3) vs ATAC. Yes they look very similar as we pointed out in the paper too, but the order of min max matters a lot here. If we exchange the order (e.g., using (3) in CQL), we no longer get the important property of robust policy improvement of ATAC. Intuitively, this is similar to running GAN with a weak, slowly updated discriminator, which would not enable learning a good generator of the data. (The analogy here is that after exchanging the order, the policy would not prefer to imitate the data anymore, which is the source of robust policy improvement).

CQL doesn't estimate the behavior policy as well. I made a mistake in typing the previous response.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants