-
Notifications
You must be signed in to change notification settings - Fork 170
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How is the derivative wrt. representations computed in a single backward pass? #20
Comments
I think you have two questions:
|
Thank you for the clarification. After going through the code more carefully and reading your reply I think I understand what you meant. I think the confusion was that by 'single-pass' you meant 'single-pass on the shared parameters', which are typically orders of magnitude more than the task-specific parameters. In your code, you are still calling backward() for each task, but only backproping till Z. So the saving resides in not having to backprop till the \theta^sh, PLUS not having to forward over them T times. Is my understanding correct? |
Yes, your understanding is correct. Also, keep in mind that parameters from shared representation to task-losses are disjoint following encoder-decoder assumption. Hence, we compute gradient over any parameter only ones. |
First, thank you for the nice work. If you don't mind, I have a question:
You mention right after defining the upper bound (Eq. 6) that the derivative wrt. representations, i.e. \nabla_Z L^t, can be computed in a single backward pass for all tasks. But all tasks are sharing the same representation nodes, so how is this possible? Wouldn't this also apply then to the original derivative wrt. to the shared parameters?
The text was updated successfully, but these errors were encountered: