New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Understanding the moving parts for training #76
Comments
Hi, I apologize for taking a while to respond. These are excellent questions and I am actually trying to prepare a FAQ document for the upcoming next release that includes these and more. It will take a few weeks longer so I'll directly answer these here first. GLU activationI personally view GLU activations as a minor architectural detail that empirically often improves performance for free and makes training smoother (see https://arxiv.org/abs/2002.05202). I'm not sure if there are any modern explanations for this phenomenon. Since adding this trick sometime between V1 and V2 (between the original paper and the Sashimi follow-up), I think I pretty much always use GLU as a safe default because the slight overhead of parameters/computation is worth the increased performance (in general). I wouldn't expect the performance gains to be dramatic though. What are you seeing happen on ListOps? The original version of S4 had good results without using GLU. Dropout2DFirst, I'll remark that Dropout2d is a bad name for this version of dropout; for sequences it should be called Dropout1d. For some more context, Dropout2d is from this paper: the idea is that when inputs vary "smoothly" (e.g. think of a very high-resolution image), the features will generally vary smoothly and vanilla Dropout won't really be doing much regularization. So Dropout1d (for sequences) ties the dropout mask across the length of the sequence, while Dropout2d (for images) ties it across the length/width of an image. The reason it's implemented as "Dropout2d" (instead of 1d) is because PyTorch didn't support Dropout1d until version 1.12 and Dropout2d emulated this. But the semantics were changed so that Dropout2d was completely broken in PyTorch 1.11 (pytorch/pytorch#77081). It's a whole mess so this repo implements its own version called DropoutNd. I don't like using it because it's non-standard and horribly broken between different versions of PyTorch. But for whatever reason it helps on some small-scale tasks (especially sCIFAR)
TL;DR: My general recommendation is to use standard Dropout, unless overfitting is the main challenge on your problem and you just want to tune more things. Learning rateThe way I think of it is that the SSM parameter What I've found works in general is capping the learning rate on these SSM parameters, so I set it by default to Note that this is also called the max LR; the learning rate schedule should have the same shape as the rest of the parameters, but just scaled down proportionally at all times. (A mistake I've seen made is that people might fix the LR of the special params to The scheduler I've always used since V1 is a cosine scheduler with warmup, which seems quite standard. Another alternative is linear decay with warmup. Older experiments used a decay-on-plateau scheduler, which is also reasonable but I found it harder to do controlled ablations or control the training time. Finally, I recently found that increasing the number of warmup steps can help on hard problems (particularly PathX). I haven't really explored this on other tasks. See also the response to questions about trainability here: srush/annotated-s4#67 Parameter sharingI'm a little confused what "sharing depth-wise" means. Does this mean to share them across the Sharing across features: This is an option supported out-of-the-box with the Sharing across depth (layers): I've never experimented with this and I'm not sure what the motivation would be. There have been works trying similar things with other models (e.g. Universal Transformer) but I haven't really seen people do this commonly. Sharing NPLR Parameterization (Hurwitz etc)It's always recommended to use the improvements explained in the Sashimi paper, which are now defaults in the current implementation. In particular, the two main changes are tying the Which variant to use? S4, S4D, DSS, etc.S4 vs S4D S4 advantages:
S4 disadvantages:
S4D advantages:
TL;DR: If you are going to use only 1 variant as a black box, and your problem does not have some of the characteristics where S4 might run into problems (in particular, sequences of changing lengths), S4 may be better. Otherwise, S4D is easier to set up and use, but you may need to play around with the initialization options. S4D vs DSS Differences: As for the actual differences, they can be fairly nuanced and are described in more detail in the S4D paper. The current implementation of S4D essentially supports all the parameterization choices that DSS made as options, so it is more general. Implementation: S4D is the one I actively maintain, and I may even deprecate some of the DSS options in V4 that I think shouldn't be used. For historical completeness, I am planning to include a separate DSS implementation for the upcoming V4 release, but without reproducing all its experiments. TL;DR: Use S4D, but credit DSS! Bidirectional vs unidirectionalI generally stick to bidirectional if it makes sense for the problem. E.g. most classification tasks have no requirement to be causal, so it makes sense to propagate signal both ways. For tasks that require unidirectionality (i.e. causality, such as for autoregressive modeling tasks, or perhaps settings involving "online inference"), then you'd want a unidirectional model. Note: the description of bidirectional in Sashimi doesn't actually match what is done in this codebase. The implementation here is more parameter- and computation- efficient. V4 will also add a "more standard" option which is the one described in Sashimi. Timescale / DeltaA final important hyperparameter that may affect training are the For some examples of how I've set it:
Other model parameters
|
Hi team!
The paper and the accompanying codebase are really great! We are trying to use S4 for a different problem and there are a lot of engineering details that seem to be affecting the training:
...
From your experience, could you share any ideas on how to choose from these options? Also, could you list any other important details that might affect training?
The text was updated successfully, but these errors were encountered: