Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unused parameters in training #26

Closed
telmop opened this issue Apr 28, 2022 · 5 comments · Fixed by #30
Closed

Unused parameters in training #26

telmop opened this issue Apr 28, 2022 · 5 comments · Fixed by #30

Comments

@telmop
Copy link
Contributor

telmop commented Apr 28, 2022

Hi! I'm running some experiments using your code. For my use-case, I'm using torch.nn.DistributedDataParallel, which automatically detects unused parameters, i.e., parameters that get no gradients.

The unused parameters are:

  • D (from the S4 module)
  • output_linear.weight and output_linear.bias (from the S4 module). These are instances of the TransposedLinear layer.
  • kernel.C (from SSKernelNPLR).

I have manually confirmed these parameters don't get gradients by running the following code after computing the loss:

for name, param in model.named_parameters():
    if param.grad is None:
        print(name)

Usually, the above means the parameters are instantiated but not used. In this case, surprisingly, all the parameters get used in the forward method. However, none of them get used in "vanilla" PyTorch ops. D, output_linear.weight and output_linear.bias get used through opt_einsum.contract, and kernel.C gets used through your Cauchy GPU op.

Can you confirm the issue on your end? These parameters all look important for the model.

@albertfgu
Copy link
Contributor

I am not sure how to reproduce this issue since we don't use DDP directly but through Pytorch-Lightning.

  • If you want us to check this issue specifically, it would be helpful to provide a train script the way you're using the model.
  • It sounds like the issue involves the interaction of opt_einsum.contract with DDP, so it's not specific to this repo. Is that correct?

@telmop
Copy link
Contributor Author

telmop commented May 5, 2022

I was able to track down the issue and fix it locally. It is a small fix, I'll send you a PR addressing it in the next few days.

@albertfgu
Copy link
Contributor

Thank you! What was the general issue?

@telmop
Copy link
Contributor Author

telmop commented May 12, 2022

Hi! Apologies for the late fix. I sent you a PR fixing the issue. The problem was variable naming. In particular, self.norm and self.pool in SequenceResidualBlock were applied to x, not y. When self.residual is present, everything works fine, but if it is None, x is the block input, meaning that self.layer wasn't used, and hence there were no gradients to the layer.

@albertfgu
Copy link
Contributor

Thanks for pointing this out! I think we never turned off residuals so we didn't notice.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants