Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Understanding the moving parts for training #76

Open
nbgundavarapu opened this issue Nov 11, 2022 · 1 comment
Open

Understanding the moving parts for training #76

nbgundavarapu opened this issue Nov 11, 2022 · 1 comment

Comments

@nbgundavarapu
Copy link

Hi team!

The paper and the accompanying codebase are really great! We are trying to use S4 for a different problem and there are a lot of engineering details that seem to be affecting the training:

  1. Adding/removing GLU activation => Also affects ListOps a lot
  2. Dropout vs Dropout2D
  3. Learning rate of the state space parameters and their schedule in relation to rest of parameters
  4. Whether to share the diagonal and low-rank params depth-wise or not
  5. Whether to share log_step depth-wise or not => Leads to NaN loss at times
  6. Whether to use the NPLR formulation from S4 paper or Sashimi
  7. Whether to use S4 or S4D or DSS
  8. Bidirectional or unidirectional
    ...

From your experience, could you share any ideas on how to choose from these options? Also, could you list any other important details that might affect training?

@albertfgu
Copy link
Contributor

albertfgu commented Dec 4, 2022

Hi, I apologize for taking a while to respond. These are excellent questions and I am actually trying to prepare a FAQ document for the upcoming next release that includes these and more. It will take a few weeks longer so I'll directly answer these here first.

GLU activation

I personally view GLU activations as a minor architectural detail that empirically often improves performance for free and makes training smoother (see https://arxiv.org/abs/2002.05202). I'm not sure if there are any modern explanations for this phenomenon. Since adding this trick sometime between V1 and V2 (between the original paper and the Sashimi follow-up), I think I pretty much always use GLU as a safe default because the slight overhead of parameters/computation is worth the increased performance (in general). I wouldn't expect the performance gains to be dramatic though. What are you seeing happen on ListOps? The original version of S4 had good results without using GLU.

Dropout2D

First, I'll remark that Dropout2d is a bad name for this version of dropout; for sequences it should be called Dropout1d. For some more context, Dropout2d is from this paper: the idea is that when inputs vary "smoothly" (e.g. think of a very high-resolution image), the features will generally vary smoothly and vanilla Dropout won't really be doing much regularization. So Dropout1d (for sequences) ties the dropout mask across the length of the sequence, while Dropout2d (for images) ties it across the length/width of an image.

The reason it's implemented as "Dropout2d" (instead of 1d) is because PyTorch didn't support Dropout1d until version 1.12 and Dropout2d emulated this. But the semantics were changed so that Dropout2d was completely broken in PyTorch 1.11 (pytorch/pytorch#77081). It's a whole mess so this repo implements its own version called DropoutNd.

I don't like using it because it's non-standard and horribly broken between different versions of PyTorch. But for whatever reason it helps on some small-scale tasks (especially sCIFAR) ¯\_(ツ)_/¯ so I've left it in to get better performance in the small examples. My general experience is that

  • the differences are usually not too noticeable on most tasks
  • the differences seem to go away somewhat in different settings (e.g. larger models, small variations of architectures, etc.), although I haven't done careful ablations
  • there are plenty of other ways to get regularization, like just increasing the LR and WD

TL;DR: My general recommendation is to use standard Dropout, unless overfitting is the main challenge on your problem and you just want to tune more things.

Learning rate

The way I think of it is that the SSM parameter $\Delta, A, B$ involved in the differential equation $x'(t) = A x(t) + B u(t)$ are special, and get lower learning rate. Another way to think about it is that in RNN form, the recurrence $x_{k+1} = \overline{A}x_k + \overline{B}u_k$ involves repeatedly multiplying by the same matrix $\overline{A}$, so it is very sensitive to this parameter and having a high LR makes it unstable. Such phenomenon has also been observed in RNNs in general.

What I've found works in general is capping the learning rate on these SSM parameters, so I set it by default to $\min(lr, 0.001)$ where $lr$ is the global learning rate (of the rest of the parameters). Another option is just to keep it equal to $0.1$ times the global LR. I don't really play around with it much, but I do think small changes can make a difference here.

Note that this is also called the max LR; the learning rate schedule should have the same shape as the rest of the parameters, but just scaled down proportionally at all times. (A mistake I've seen made is that people might fix the LR of the special params to $0.001$, but they should follow the same warmup and decay.)

The scheduler I've always used since V1 is a cosine scheduler with warmup, which seems quite standard. Another alternative is linear decay with warmup. Older experiments used a decay-on-plateau scheduler, which is also reasonable but I found it harder to do controlled ablations or control the training time.

Finally, I recently found that increasing the number of warmup steps can help on hard problems (particularly PathX). I haven't really explored this on other tasks.

See also the response to questions about trainability here: srush/annotated-s4#67

Parameter sharing

I'm a little confused what "sharing depth-wise" means. Does this mean to share them across the $H$ features inside a given layer, or share them across every layer of a deep neural network?

Sharing across features: This is an option supported out-of-the-box with the n_ssm option. I generally don't find that it makes a big difference. On many tasks, I've noticed that sharing the parameters helps a little bit, but these were all low-data tasks; in other words, I interpret this as the standard "overfitting" phenomenon where having fewer parameters generalizes slightly better. The one notable exception is that on the LRA-PathX task, keeping all the parameters independent across the $H$ features is noticably better (https://github.com/HazyResearch/state-spaces/blob/06dbbdfd0876501a7f12bf3262121badbc7658af/configs/experiment/lra/s4-lra-pathx.yaml#L25).

Sharing across depth (layers): I've never experimented with this and I'm not sure what the motivation would be. There have been works trying similar things with other models (e.g. Universal Transformer) but I haven't really seen people do this commonly.

Sharing $\Delta$: I'm also not sure what this question means. Sharing $\Delta$ across depth (layers) was never tried as per above. Sharing $\Delta$ across the $H$ features is not supported and never tried, because the original intuition was that this parameter lets each of the features learn a different timescale of the data.

NPLR Parameterization (Hurwitz etc)

It's always recommended to use the improvements explained in the Sashimi paper, which are now defaults in the current implementation. In particular, the two main changes are tying the $P, Q$ parameters and ensuring that the real part of the diagonal term of $A$ is negative.

Which variant to use? S4, S4D, DSS, etc.

S4 vs S4D

S4 advantages:
S4 is usually the best out-of-the-box and most robust to hyperparameters.

  • There are multiple variants of S4D initialization and I've found that there are tasks where each of them fail
    • S4D-Lin doesn't perform well on LRA-PathX, or an EEG classification task that someone tried but is not published
    • S4D-LegS/S4D-Inv have shown instability sometimes and in particular completely failed at the SC09 audio generation task from Sashimi.
    • S4 generally matches any variant of S4D on all tasks I've tried
  • In earlier hyperparameter settings for LRA-PathX (those reported in the initial version of the S4D paper), S4 performed noticeably better than the best S4D. These hyperparameters turned out to be suboptimal and S4D can catch up, but it seems like S4 is more robust

S4 disadvantages:

  • The main disadvantage is its complexity. If you want to get into the internals and modify things or add other features, it will be much more complicated.
  • Another big disadvantage that can crop up is that S4 is "attuned" to be particular sequence length. If the sequence lengths in your data can be wildly different or change over time, it is not recommended to use S4.
  • S4D currently supports more exotic features because they were easier to implement, for example "state forwarding"

S4D advantages:

  • Much better for varying sequence lengths
  • Easier to understand and modify
  • Slightly faster (maybe 5-15% depending on the exact data and model)

TL;DR: If you are going to use only 1 variant as a black box, and your problem does not have some of the characteristics where S4 might run into problems (in particular, sequences of changing lengths), S4 may be better. Otherwise, S4D is easier to set up and use, but you may need to play around with the initialization options.

S4D vs DSS
History: S4 described a diagonal SSM model but did not experiment with it; DSS empirically found that (a variant of this) performed well; and S4D elaborates on the original model theoretically and empirically. S4D and DSS are extremely close and for all intents and purposes the names can be used interchangeably. Although I usually use the term S4D, full credit should be given to DSS to be the first to experiment with this type of diagonal SSM.

Differences: As for the actual differences, they can be fairly nuanced and are described in more detail in the S4D paper. The current implementation of S4D essentially supports all the parameterization choices that DSS made as options, so it is more general.

Implementation: S4D is the one I actively maintain, and I may even deprecate some of the DSS options in V4 that I think shouldn't be used. For historical completeness, I am planning to include a separate DSS implementation for the upcoming V4 release, but without reproducing all its experiments.

TL;DR: Use S4D, but credit DSS!

Bidirectional vs unidirectional

I generally stick to bidirectional if it makes sense for the problem. E.g. most classification tasks have no requirement to be causal, so it makes sense to propagate signal both ways. For tasks that require unidirectionality (i.e. causality, such as for autoregressive modeling tasks, or perhaps settings involving "online inference"), then you'd want a unidirectional model.

Note: the description of bidirectional in Sashimi doesn't actually match what is done in this codebase. The implementation here is more parameter- and computation- efficient. V4 will also add a "more standard" option which is the one described in Sashimi.

Timescale / Delta

A final important hyperparameter that may affect training are the dt_min and dt_max flags. These initialize $\Delta$ randomly for each feature within the specified range. $\Delta$ represents what I call a "timescale" and is roughly inverse to the range of dependencies the model is focusing on (HTTYH).

For some examples of how I've set it:

  • Generally, it's been set to $\Delta_- = 0.001$ and $\Delta_+ = 0.1$, which was found to be an intuitive and sensible default for sequences from length ~100 to a few thousand. Most tasks that S4 has been evaluated on are in this range.
  • PathX sets it to $(\Delta_-, \Delta_+) = (0.0001, 0.01)$ or $(\Delta_-, \Delta_+) = (0.0001, 0.1)$ which performs much better than the default
  • For short sequences (10s-100 long) you would want to set it to much higher. For example the S4ND paper set it to something like $(\Delta_-, \Delta_+) = (0.01, 1.0)$ or $(0.1, 1.0)$ or even higher.

Other model parameters

$N$ or d_state: I almost always fix it to $N=64$, and changing it usually doesn't make a huge difference. But it can make some difference; I think larger $N$ might have been helpful on some time series problems, whereas I heard that smaller $N$ helps on ListOps. Intuitively, small $N$ makes smoother convolution kernels and is closer to an exponential moving average (EMA), whereas large N can make more complex and higher-frequency convolution kernels.

$H$ or d_model: This is analogous to the number of channels in a CNN or width of a Transformer. Usually something like $128$ or $256$ for small models and $1024$ for large models seems pretty standard.

prenorm: This option is generally considered to make optimization easier, but might hurt generalization. I think pre-norm is generally favored, but for small-data tasks where overfitting is the issue post-norm can be worth trying.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants