Potential errors in the loss funktion #12

stokasto · 2016-06-20T09:21:09Z

Hey,
first off: great work. I just re-implemented the paper myself using tensorflow and your code provided great "side-information" for doing so :).
In the process I also realized that there may be two subtle bugs in your implementation (although I have never used chainer before so I might be misunderstanding things):

You use the log_probabilities directly when computing the loss for the actor
#103 a3c.py :
pi_loss -= log_prob * float(advantage.data)
I believe this is incorrect as you should multiply log_prob with a one-hot encoding of the action (since only one of the actions was selected)
To compute the advantage you use the form #97 a3c.py:
advantage = R - v
where v comes from self.past_values[i] which, in turn, is the output of the value network. As I wrote I am no expert regarding chainer but you need to make sure that no gradient flows through v here (as the value function should only be updated according to the v_loss in your code). In theano/tensorflow this would be handled with a disconnected_grad() or stop_gradient() operation respectively.

I will push my implementation to github sometime during this week as soon as I have more thoroughly tested it and can then reference it here for you to compare.

The text was updated successfully, but these errors were encountered:

stokasto · 2016-06-20T09:21:48Z

And I win for misspelling function in the issue title ;).

etienne87 · 2016-06-21T08:29:08Z

@stokasto, probably not for 1. :

look closely at :

#L54 policy_output.py :

def sampled_actions_log_probs(self): return F.select_item( self.log_probs, chainer.Variable(np.asarray(self.action_indices, dtype=np.int32)))

I assume the function is equivalent to multiply by one-hot encoded target.

stokasto · 2016-06-21T08:38:40Z

@etienne87 Ah yes, that is fine then. As I wrote I did not overly carefully study the code during my re-implementation but just thought that I would bring these two issues up since (in case they are true) they are hard to detect and could be lurking in the code-base for a while.
Thank you for falsifying the first issue.

muupan · 2016-06-23T01:22:42Z

Thank you for your comments.

As @etienne87 wrote, F.select_item returns only the corresponding log probability of the selected action (which is stored in self.action_indices of SoftmaxPolicyOutput).
float(advantage.data) is what makes sure the gradients won't flow throught v. advantage is a chainer.Variable, which can back-propagate gradients, but float(advantage.data) is literally just a python float and no gradients will be back-propagated through it.

stokasto · 2016-06-23T07:49:34Z

Hey, a great that explains it, I was wondering how the implementation could have converged to a reasonable value function if that error were indeed there.
As I said I am not experienced with chainer and so did not get this little interaction. My results also look fairly similar to the graphs you have posted. I'll upload it later and link to your implementation.
Feel free to close this issue.

muupan closed this as completed Jun 24, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Potential errors in the loss funktion #12

Potential errors in the loss funktion #12

stokasto commented Jun 20, 2016

stokasto commented Jun 20, 2016

etienne87 commented Jun 21, 2016 •

edited

Loading

stokasto commented Jun 21, 2016

muupan commented Jun 23, 2016

stokasto commented Jun 23, 2016

Potential errors in the loss funktion #12

Potential errors in the loss funktion #12

Comments

stokasto commented Jun 20, 2016

stokasto commented Jun 20, 2016

etienne87 commented Jun 21, 2016 • edited Loading

stokasto commented Jun 21, 2016

muupan commented Jun 23, 2016

stokasto commented Jun 23, 2016

etienne87 commented Jun 21, 2016 •

edited

Loading