You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.
That is, how you transform it into a supervised learning problem.
For example, the code below:
Y = self.probs + self.learning_rate * np.squeeze(np.vstack([gradients]))
Why is Y not 1-hot vector for the action taken?
You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction.
I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.
I do understand the backpropagation in policy gradient networks, but am not sure how your code work keras's auto-differentiation.
That is, how you transform it into a supervised learning problem.
For example, the code below:
Why is Y not 1-hot vector for the action taken?
You compute the gradient assuming the action is correct, Y is one-hot vector. Then you multiplies it by the reward in the corresponding time-step. But while training you feed it as the correction.
I think one could multiply the rewards by one-hot vector instead. And then feed it straight away.
If possible please clarify my doubt. :)
https://github.com/keon/policy-gradient/blob/master/pg.py#L67
The text was updated successfully, but these errors were encountered: