maximization bias #29

mikelty · 2020-08-17T08:37:21Z

Hello, I'm not sure whether this is an issue or not but I've been looking at your implementation for half an hour, and I think there might be a maximization bias in the implementation. Specifically, you used the same set of experience to update two q-tables. The paper says two independent q-tables will benefit training.
I've tested my thought out on a similar code base and the owner agreed with my view so far. I've opened a stack overflow question here. Could you say something about this? I think I'll test the implementation as well.
Thanks in advance.

haarnoja · 2020-09-13T12:17:03Z

Hi,

we indeed use the same data to update both of the Q-functions. I haven't tested splitting the data and using different sets for different Q's, but I'm guessing that that doesn't make much difference in terms of the maximization bias. My reasoning is that we evaluate the Q-functions (both for the TD target and the policy target) at actions that is not part of the data but instead samples from the current policy. For those actions, given a seen state, the Q values are less correlated since the Q's were never trained for those particular actions, thus reducing maximization bias. We've observed that this can make a big difference in practice, especially in higher dimensional tasks.

I hope this answers your question!

Tuomas

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maximization bias #29

maximization bias #29

mikelty commented Aug 17, 2020

haarnoja commented Sep 13, 2020

maximization bias #29

maximization bias #29

Comments

mikelty commented Aug 17, 2020

haarnoja commented Sep 13, 2020