Difference with Conservative-Q Learning (CQL) #3

emailweixu · 2022-07-27T18:14:36Z

The relative pessimism (1)(2)(2) proposed in ATAC seems exactly same as the learning objective (3) in [1] . And Algorithm 2 in ATAC looks remarkably similar to the Algorithm 1 in [1] omitting some implementation caveats. Could you explain what is the major difference between ATAC and CQL?

[1] Kumar et. al. Conservative Q-Learning for Offline Reinforcement Learning. NeurIPS 2020

tengyangxie · 2022-07-28T03:34:38Z

Hi @emailweixu, thanks for reaching out!

One notable difference is that CQL includes an inner maximization over policy in its critic, whereas ATAC does not. This difference enables ATAC to recover imitation learning (when the Bellman term is turned off) and further leads to the robust policy improvement property, but CQL doesn't.

For a detailed comparison between ATAC and CQL, please refer to Appendix D of our paper.

emailweixu · 2022-07-28T17:58:28Z

@tengyangxie, Thanks for the response. Appendix D is very helpful in explaining some theoretical differences such as minmax vs. maxmin. But as a practitioner, I am more interested in the differences in the practical algorithms (i.e. Algorithm 2 in ATAC and Algorithm 1 in CQL). As shown in attached pictures, step 4 and 5 of ATAC corresponds to step 3 and 4 of CQL. And it seems to me that they are doing the same thing. I understand that there are some differences in implementation details such as using only $f_1$ for actor update, different learning rate, and the use of DQRA loss.

chinganc · 2022-08-02T18:32:42Z

Hi @emailweixu, you're right about those differences about implementation. Among them, using two-timescale stepsize to and single critic in the actor update are crucial to achieve proper adversarial training.

For the comparison of algorithm pseudocodes you mentioned, I would like to point out that line 3 of CQL is actually quite different from line 4 of ATAC. CQL uses Eq 4 in the CQL paper to update the critic, whereas the closest analogy of ATAC's critic objective would be Eq 2 in the CQL paper if we take \mu there to be the \pi, though in ATAC we don't estimate the behavior policy. This leads a large difference in both theory and implementation. This objective difference (together with the difference in the stepsize and single critic choice above) leads to the difference between minmax vs. maxmin we discussed in theory, so you can expect that they learn different critics. For the code implementation, the log-sum-exp term in CQL requires summing over the action space, which is intractable in high dimensional problems (in their implementation, it heuristically takes samples from the learner policy and a bunch of other random actions to approximate the sum). On the other hand, ATAC just takes samples from the learner and the data to directly compute the critic objective, which is straigtforward to implement.

emailweixu · 2022-08-03T22:45:47Z

@chinganc Thanks for the detailed explanation. It's really helpful.

For the comparison of algorithm pseudocodes you mentioned, I would like to point out that line 3 of CQL is actually quite different from line 4 of ATAC. CQL uses Eq 4 in the CQL paper to update the critic, whereas the closest analogy of ATAC's critic objective would be Eq 2 in the CQL paper if we take \mu there to be the \pi,

I agree that (4) in CQL paper is different from ATAC. But (3) in CQL seems quite similar to ATAC object except the order of max and min.

though in ATAC we don't estimate the behavior policy.

My understanding is that CQL does not estimate behavior policy either?

This leads a large difference in both theory and implementation.
This objective difference (together with the difference in the stepsize and single critic choice above) leads to the difference between minmax vs. maxmin we discussed in theory, so you can expect that they learn different critics.

For the code implementation, the log-sum-exp term in CQL requires summing over the action space, which is intractable in high dimensional problems (in their implementation, it heuristically takes samples from the learner policy and a bunch of other random actions to approximate the sum).

Agree with you that log-sum-exp in (4) of CQL is problematic.

On the other hand, ATAC just takes samples from the learner and the data to directly compute the critic objective, which is straigtforward to implement.

It seems that (3) of CQL can be implemented in the similar way.

chinganc · 2022-09-12T18:08:20Z

Sorry I missed the previous response.

RE: (3) vs ATAC. Yes they look very similar as we pointed out in the paper too, but the order of min max matters a lot here. If we exchange the order (e.g., using (3) in CQL), we no longer get the important property of robust policy improvement of ATAC. Intuitively, this is similar to running GAN with a weak, slowly updated discriminator, which would not enable learning a good generator of the data. (The analogy here is that after exchanging the order, the policy would not prefer to imitate the data anymore, which is the source of robust policy improvement).

CQL doesn't estimate the behavior policy as well. I made a mistake in typing the previous response.

chinganc closed this as completed Sep 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Difference with Conservative-Q Learning (CQL) #3

Difference with Conservative-Q Learning (CQL) #3

emailweixu commented Jul 27, 2022

tengyangxie commented Jul 28, 2022

emailweixu commented Jul 28, 2022

chinganc commented Aug 2, 2022

emailweixu commented Aug 3, 2022

chinganc commented Sep 12, 2022

Difference with Conservative-Q Learning (CQL) #3

Difference with Conservative-Q Learning (CQL) #3

Comments

emailweixu commented Jul 27, 2022

tengyangxie commented Jul 28, 2022

emailweixu commented Jul 28, 2022

chinganc commented Aug 2, 2022

emailweixu commented Aug 3, 2022

chinganc commented Sep 12, 2022