Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temperature parameter is not handled properly #5

Closed
haarnoja opened this issue Nov 8, 2017 · 5 comments
Closed

Temperature parameter is not handled properly #5

haarnoja opened this issue Nov 8, 2017 · 5 comments

Comments

@haarnoja
Copy link
Owner

haarnoja commented Nov 8, 2017

The temperature parameter (alpha) is missing from the TD updates. For example

v_next = tf.squeeze(tf.reduce_logsumexp(q_next, axis=1)) # N
should have alpha as in Eq. (10) in the paper. The code is correct only if alpha = 1.

As a quick fix to change the temperature, you can set scale_reward = 1 / temperature and alpha=1, which has an equivalent effect as discussed on page 2.

@haarnoja
Copy link
Owner Author

Latest refactor removes the temperature coefficient (alpha). To adjust the temperature, you can change reward_scale instead.

@immars
Copy link

immars commented Mar 21, 2018

sorry to re-raise this issue,
I have implemented a version of soft q learning, and find it convenient to have a separate alpha, I think alpha acts effectively as entropy coefficient in policy gradient methods, which can be annealed without affecting value iteration of the critic.

@haarnoja
Copy link
Owner Author

In my experience, annealing alpha to zero was little problematic because of how it enters the value function (V = alpha * log sum exp (Q / alpha)) and naive way of implementing this and setting alpha -> 0 obviously fails. How did you fix this? If you'd like to share your code, I'd be happy to merge your PR :).

@immars
Copy link

immars commented Apr 2, 2018

that's right, in my experiment, if alpha is annealed below a threshold (near 0.08 or so), the training becomes numerically unstable. But it could take a much larger value at the beginning of the training to encourage exploration.

the code is now in a private branch of hobotrl. the code structure is very different from this repo i think, and is hard to merge. I created a gist with relevant code pieces.
alpha_exploration could be a object of a subclass of python float, with variable values when evaluated each time as float.

@immars
Copy link

immars commented Apr 2, 2018

I've pushed to hobotrl for your reference.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants