How is the derivative wrt. representations computed in a single backward pass? #20

chanshing · 2020-03-03T11:03:33Z

First, thank you for the nice work. If you don't mind, I have a question:

You mention right after defining the upper bound (Eq. 6) that the derivative wrt. representations, i.e. \nabla_Z L^t, can be computed in a single backward pass for all tasks. But all tasks are sharing the same representation nodes, so how is this possible? Wouldn't this also apply then to the original derivative wrt. to the shared parameters?

ozansener · 2020-03-03T11:20:08Z

I think you have two questions:

How \nabla_z L^t can be computed in a single backward pass? We assume encoder-decoder structure as we explain in that section. Hence, the path from z to L^t is independent between tasks. Since they are independent, it is just a single backward pass through decoders.
Wouldn't this apply to \nabla_theta L^t? No, because \nabla_theta L^t = \nabla_{theta^sh} Z \nabla_Z L^t. You can compute all \nabla_Z L^t values in a single pass. Then, you can compute \nabla_{theta^sh} Z explicitly as well in a single pass but that computation will be much more costly than computing \nabla_{theta^sh} L^t since auto-differentiation works without explicit computation of \nabla_{theta^sh} Z. I think this is hard to explain without getting into details of AD but you can try to go through computation graph explicitly to see the difference. In other words, it is not a single-pass computation in the sense that a single-pass of AD would not be able to compute it.

chanshing · 2020-03-03T11:37:17Z

Thank you for the clarification. After going through the code more carefully and reading your reply I think I understand what you meant. I think the confusion was that by 'single-pass' you meant 'single-pass on the shared parameters', which are typically orders of magnitude more than the task-specific parameters. In your code, you are still calling backward() for each task, but only backproping till Z. So the saving resides in not having to backprop till the \theta^sh, PLUS not having to forward over them T times. Is my understanding correct?

ozansener · 2020-03-03T12:08:07Z

Yes, your understanding is correct. Also, keep in mind that parameters from shared representation to task-losses are disjoint following encoder-decoder assumption. Hence, we compute gradient over any parameter only ones.

ozansener closed this as completed Mar 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How is the derivative wrt. representations computed in a single backward pass? #20

How is the derivative wrt. representations computed in a single backward pass? #20

chanshing commented Mar 3, 2020

ozansener commented Mar 3, 2020

chanshing commented Mar 3, 2020

ozansener commented Mar 3, 2020

How is the derivative wrt. representations computed in a single backward pass? #20

How is the derivative wrt. representations computed in a single backward pass? #20

Comments

chanshing commented Mar 3, 2020

ozansener commented Mar 3, 2020

chanshing commented Mar 3, 2020

ozansener commented Mar 3, 2020