Question on the meaning of 'advantage' #11

windsorwho · 2023-09-14T21:58:16Z

Hi,
Thank you for open sourcing the repo. I am reading the code and want to understand how the loss is computed.

It looks like in the final loss,

Line 125 in f0b6ca7

loss = jnp.mean(jnp.maximum(unclipped_loss, clipped_loss))

the 'ratio' is just the $p_{\theta}/p_{\theta_old}$, meaning if I want to compute the loss corresponding to gradient in Eqn(3), I only need the variable 'advantage' in

ddpo/ddpo/training/policy_gradient.py

Line 123 in f0b6ca7

unclipped_loss = -advantages * ratio

which is essentially gaussian normalized score of the original reward value?

I guess then this loss will be non-differentiable if the reward is say the jpeg encoding length?

I must be missing something, am i ?

Thanks!

windsorwho · 2023-09-15T00:06:56Z

Never mind, now I see that the \nabla on ratio will give you \nabla on \log(p).

windsorwho closed this as completed Sep 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question on the meaning of 'advantage' #11

Question on the meaning of 'advantage' #11

windsorwho commented Sep 14, 2023

windsorwho commented Sep 15, 2023

Question on the meaning of 'advantage' #11

Question on the meaning of 'advantage' #11

Comments

windsorwho commented Sep 14, 2023

windsorwho commented Sep 15, 2023