action distribution for estimating V #8

immars · 2018-02-26T09:08:58Z

Hi,
First of all, thanks for this inspiring work!

In

softqlearning/softqlearning/algorithms/sql.py

Line 164 in 59c0bbb

next_value = tf.reduce_logsumexp(q_value_targets, axis=1)

it seems to me that action is sampled from uniform distribution when estimating V_{soft} .

In Sec. 3.2. of your original paper, it is stated that

For q_a we have more options. A
convenient choice is a uniform distribution. However, this
choice can scale poorly to high dimensions. A better choice
is to use the current policy, which produces an unbiased
estimate of the soft value as can be confirmed by substi-
tution.

have you experimented with sampling from current policy to estimate V? Or, how good does uniform distribution do in practice, especially in higher dimensional cases?

thanks,

The text was updated successfully, but these errors were encountered:

haarnoja · 2018-03-01T07:00:29Z

Thanks for your question. We use uniform sampling because there is no direct way to evaluate the log-probabilities of action of SVGD policies, which would be needed for the importance weights. Using some other tractable policy representation could fix this issue.

You're right that uniform samples do not necessarily scale well to higher dimensions. I haven't really studied how accurate the uniform value estimator is, but from my experience, using more samples to estimate the value improves the performance only marginally.

immars · 2018-03-05T06:15:03Z

ok, i see.
Thanks for the reply!

ghost · 2018-03-16T03:32:03Z

I could be totally misunderstanding, but doesn't appendix C.2 talk about how one can use the sampling network for q_a' and derive the corresponding densities so long as the jacobian of a'/epsilon' is non-singular?

haarnoja · 2018-03-17T23:34:50Z

I see, that's indeed confusing. You are right in that we could compute the log probs if the sampling network is invertible. My feeling is that, in our case, the network does not remain invertible, and that the log probs we would obtain that way are wrong. We initially experimented with this trick (and that's why we discuss it in the appendix), but in the end, uniform samples worked better. We'll fix this in the next version of the paper, thanks for pointing it out!

ghost · 2018-03-18T02:35:37Z

My pleasure! Glad I was sort of on the right track. That's very interesting, especially since non-singular weight matrices or choice of activation function are the only thing off the top of my head that might make a feedforward net non-invertible. I might play around with that.

SJTUGuofei · 2018-04-16T08:21:03Z

Also in "softqlearning/softqlearning/algorithms/sql.py"
ys = tf.stop_gradient(self._reward_scale * self._rewards_pl + (
1 - self._terminals_pl) * self._discount * next_value)
I just wonder is it sufficient that only one sample for computing the Expectation in $\hat Q$.
Thanks a lot!

haarnoja · 2018-04-19T06:21:11Z

Do you mean the expectation over states and actions in Eq. (11)? It is OK, since the corresponding gradient estimator is unbiased, though it can have high variance.

SJTUGuofei · 2018-04-19T06:49:16Z

I see.
Thank you so much!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

action distribution for estimating V #8

action distribution for estimating V #8

immars commented Feb 26, 2018

haarnoja commented Mar 1, 2018

immars commented Mar 5, 2018

ghost commented Mar 16, 2018

haarnoja commented Mar 17, 2018

ghost commented Mar 18, 2018

SJTUGuofei commented Apr 16, 2018

haarnoja commented Apr 19, 2018

SJTUGuofei commented Apr 19, 2018

action distribution for estimating V #8

action distribution for estimating V #8

Comments

immars commented Feb 26, 2018

haarnoja commented Mar 1, 2018

immars commented Mar 5, 2018

ghost commented Mar 16, 2018

haarnoja commented Mar 17, 2018

ghost commented Mar 18, 2018

SJTUGuofei commented Apr 16, 2018

haarnoja commented Apr 19, 2018

SJTUGuofei commented Apr 19, 2018