-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Broyden defeats the purpose of DEQs? #28
Comments
Hello @polo5, Thanks for your interest in our repo and DEQ! To begin with, we want to caution that "constant memory cost" is constant w.r.t. the number of layers. That is, we do have one layer only (e.g., one Transformer layer), and the memory consumption is not that of 2, 3, etc. layers. That said, you are absolutely right that Broyden or Anderson both needs to store some past fixed point estimates. In fact, we analyzed this "hidden cost" in the Jacobian regularization paper (see Sec. 3.4). However, we do want to point out that this is only very minimal memory cost, because we only need to store the activation tensor. For example, in the DEQ-Transformer case (which is pretty a large-scale model), the hidden units could have shape However, conventional neural networks are costly because the layers are complex (each layer could cost hundreds or thousands of MBs). For example, in a Transformer layer, not only do we have to memorize what the output is (which is what DEQ only needs), but also everything that happened within the layer--- e.g., the activation after self-attention; that after LayerNorm, etc. On your second question, my Broyden's method implementation is solely based on its wikipedia page: https://en.wikipedia.org/wiki/Broyden%27s_method (look at "good Broyden's method"). However, in order to make it GPU-efficient, there are two things that potentially made this resemblance a bit obscure to see, and I'm happy to explain:
This means that instead of actually keeping a large matrix J^{-1} around in the memory, I can simply store the u vectors and the v vectors. This amounts to keeping a big U matrix and a bid V matrix whose columns are u1,u2,... and v1,v2,... In other words, we can write J_n^{-1} = -I + UV^T. At each Broyden iteration, I append the new u and v to the matrix column, which is here. So after L steps of Broyden iterations, U is of shape
Since U has dimension Similarly, the update rule itself contains things like J_{n-1}^{-1} \delta f_n, which can be computed in a similar fashion by computing V^T \delta f_n first, and then U (V^T \delta f_n). I hope this answers your question! |
Also, I want to add that in Anderson we usually keep |
Thanks a lot @jerrybai1995 ! Your explanations are very clear and I understand your broyden code much better :)
Apologies for abusing your kindness:) P |
Hi @polo5,
Hope this helps. |
Hi @polo5 , In addition to @jerrybai1995 ,
I wonder how large the activation could be to prevent using the Broyden solver. In some cases, after properly normalizing the DEQ function If you encounter some stability issues under IFT+naive forward solver, consider inexact grad, Jacobian regularization, fixed point correction, or their combination to handle the naive forward solver. DEQ-Flow might be a helpful reference for this as we successfully train the DEQ to a satisfying forward pass convergence and quite strong performance using the naive solver, phantom grad, and fixed point correction compared to the one trained solely by IFT or a very expensive dense BPTT. And this method scales up well to larger models. See the latest implementation. Hope this might help. Zhengyang |
Great points @jerrybai1995 @Gsunshine !
Thanks a lot for the discussion! I think DEQs are very promising and don't get the attention they deserve ;) |
Heya,
Thanks for your continued work in building better DEQs.
The main selling point of DEQs is that the solver can take as many steps as required to converge without increasing the memory. This isn't true for your implementation of broyden, which starts off with:
and therefore has a memory cost linear with max_iters, even though the ops aren't tracked. Anderson also keeps the previous
m
states in memory, wherem
is usually larger than the number of solver iterations needed anyways. Don't those solvers contradict the claim of constant memory cost?On a related note, I've found it quite hard to modify these solvers even after going over the theory. Is there any notes or resources you could point to to help people understand your implementation? Thanks!
The text was updated successfully, but these errors were encountered: