-
Notifications
You must be signed in to change notification settings - Fork 94
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What is get_kl() doing in main.py? #2
Comments
KL is used to compute the hessian (as in the original code). |
Hi, I've tested the get_kl() function in various environments, but the return is always 0. The following is an example of get_kl() outputs in BipedalWalker-v2:
Does this make sense? |
@jtoyama4 Hi, have you figure out how does get_kl() work? |
@pxlong I think what get_kl() does is to get the gradient of kl (for hessian computing) , and kl-constraining part is somehow working with ratio in |
@jtoyama4, thanks for a quick reply. |
@pxlong a simple example: In this case we have something like this f(x)=(x_0^2 - x^2), f(x_0) = 0 but f'(x_0)=-2x_0. The function is not f(x) == 0 but it has a value at one specific point == 0. |
@ikostrikov, thanks for your explanation. To debug it, I added some print as you can see belowing:
but all the tensors are zero (showed in the second message of this issue). I am wondering what the problem of the bad performance on these environments? |
Check hyperparams from the original implementation (modular rl). And also estimates of how long does convergence take. Default hyperparams of this code are tuned specifically for reacher-v1 |
ok, thanks. |
This part is used to compute the hessian of KL. KL itself == 0, the derivative of the KL == 0 but the hessian is not. This is the reason why we have to compute a second order approximation of the KL terms. Because its first order approximation is equal to zero. |
@pxlong Sorry, that it took me so long to fix the bug. It didn't work because they've changed default argument values for some functions in PyTorch recently. |
@pxlong And anyone else, I found the
It still feels non-intuitive to me, but I guess the goal is auto diff / hessian calc vs getting an actual value out of Also For two univariate normal distributions p and q the above simplifies t has the math that looks directly related to the code. The paper cites Numerical Optimization so I guess I have some reading to do :) *edit: |
Hi. Thanks for publishing implementation of trpo.
I have question about get_kl().
I thought what get_kl() is supposed to do is to calculate the kl divergence of old policy and new policy, but this get_kl() seems always returning 0.
Also,I do not see kl constraining part in the parameters updating process.
Is this code the modification of trpo or do I have some misunderstanding?
Thanks,
The text was updated successfully, but these errors were encountered: