New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unused parameters in training #26
Comments
I am not sure how to reproduce this issue since we don't use DDP directly but through Pytorch-Lightning.
|
I was able to track down the issue and fix it locally. It is a small fix, I'll send you a PR addressing it in the next few days. |
Thank you! What was the general issue? |
Hi! Apologies for the late fix. I sent you a PR fixing the issue. The problem was variable naming. In particular, |
Thanks for pointing this out! I think we never turned off residuals so we didn't notice. |
Hi! I'm running some experiments using your code. For my use-case, I'm using
torch.nn.DistributedDataParallel
, which automatically detects unused parameters, i.e., parameters that get no gradients.The unused parameters are:
D
(from theS4
module)output_linear.weight
andoutput_linear.bias
(from theS4
module). These are instances of theTransposedLinear
layer.kernel.C
(fromSSKernelNPLR
).I have manually confirmed these parameters don't get gradients by running the following code after computing the loss:
Usually, the above means the parameters are instantiated but not used. In this case, surprisingly, all the parameters get used in the
forward
method. However, none of them get used in "vanilla" PyTorch ops.D
,output_linear.weight
andoutput_linear.bias
get used throughopt_einsum.contract
, andkernel.C
gets used through your Cauchy GPU op.Can you confirm the issue on your end? These parameters all look important for the model.
The text was updated successfully, but these errors were encountered: