Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential errors in the loss funktion #12

Closed
stokasto opened this issue Jun 20, 2016 · 5 comments
Closed

Potential errors in the loss funktion #12

stokasto opened this issue Jun 20, 2016 · 5 comments

Comments

@stokasto
Copy link

Hey,
first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :).
In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):

  1. You use the log_probabilities directly when computing the loss for the actor
    #103 a3c.py :
    pi_loss -= log_prob * float(advantage.data)
    I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
  2. To compute the advantage you use the form #97 a3c.py:
    advantage = R - v
    where v comes from self.past_values[i] which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v here (as the value function should only be updated according to the v_loss in your code). In theano/tensorflow this would be handled with a disconnected_grad() or stop_gradient() operation respectively.

I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.

@stokasto
Copy link
Author

And I win for misspelling function in the issue title ;).

@etienne87
Copy link

etienne87 commented Jun 21, 2016

@stokasto, probably not for 1. :

look closely at :

#L54 policy_output.py :

def sampled_actions_log_probs(self): return F.select_item( self.log_probs, chainer.Variable(np.asarray(self.action_indices, dtype=np.int32)))

I assume the function is equivalent to multiply by one-hot encoded target.

@stokasto
Copy link
Author

@etienne87 Ah yes, that is fine then. As I wrote I did not overly carefully study the code during my re-implementation but just thought that I would bring these two issues up since (in case they are true) they are hard to detect and could be lurking in the code-base for a while.
Thank you for falsifying the first issue.

@muupan
Copy link
Owner

muupan commented Jun 23, 2016

Thank you for your comments.

  1. As @etienne87 wrote, F.select_item returns only the corresponding log probability of the selected action (which is stored in self.action_indices of SoftmaxPolicyOutput).
  2. float(advantage.data) is what makes sure the gradients won't flow throught v. advantage is a chainer.Variable, which can back-propagate gradients, but float(advantage.data) is literally just a python float and no gradients will be back-propagated through it.

@stokasto
Copy link
Author

Hey, a great that explains it, I was wondering how the implementation could have converged to a reasonable value function if that error were indeed there.
As I said I am not experienced with chainer and so did not get this little interaction. My results also look fairly similar to the graphs you have posted. I'll upload it later and link to your implementation.
Feel free to close this issue.

@muupan muupan closed this as completed Jun 24, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants