Unused parameters in training #26

telmop · 2022-04-28T09:50:33Z

Hi! I'm running some experiments using your code. For my use-case, I'm using torch.nn.DistributedDataParallel, which automatically detects unused parameters, i.e., parameters that get no gradients.

The unused parameters are:

D (from the S4 module)
output_linear.weight and output_linear.bias (from the S4 module). These are instances of the TransposedLinear layer.
kernel.C (from SSKernelNPLR).

I have manually confirmed these parameters don't get gradients by running the following code after computing the loss:

for name, param in model.named_parameters():
    if param.grad is None:
        print(name)

Usually, the above means the parameters are instantiated but not used. In this case, surprisingly, all the parameters get used in the forward method. However, none of them get used in "vanilla" PyTorch ops. D, output_linear.weight and output_linear.bias get used through opt_einsum.contract, and kernel.C gets used through your Cauchy GPU op.

Can you confirm the issue on your end? These parameters all look important for the model.

The text was updated successfully, but these errors were encountered:

albertfgu · 2022-05-01T17:11:26Z

I am not sure how to reproduce this issue since we don't use DDP directly but through Pytorch-Lightning.

If you want us to check this issue specifically, it would be helpful to provide a train script the way you're using the model.
It sounds like the issue involves the interaction of opt_einsum.contract with DDP, so it's not specific to this repo. Is that correct?

telmop · 2022-05-05T08:12:40Z

I was able to track down the issue and fix it locally. It is a small fix, I'll send you a PR addressing it in the next few days.

albertfgu · 2022-05-05T15:39:01Z

Thank you! What was the general issue?

telmop · 2022-05-12T09:00:11Z

Hi! Apologies for the late fix. I sent you a PR fixing the issue. The problem was variable naming. In particular, self.norm and self.pool in SequenceResidualBlock were applied to x, not y. When self.residual is present, everything works fine, but if it is None, x is the block input, meaning that self.layer wasn't used, and hence there were no gradients to the layer.

albertfgu · 2022-05-23T23:27:51Z

Thanks for pointing this out! I think we never turned off residuals so we didn't notice.

telmop mentioned this issue May 12, 2022

Fixed ignored model outputs. #30

Merged

albertfgu closed this as completed in #30 May 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unused parameters in training #26

Unused parameters in training #26

telmop commented Apr 28, 2022

albertfgu commented May 1, 2022

telmop commented May 5, 2022

albertfgu commented May 5, 2022

telmop commented May 12, 2022

albertfgu commented May 23, 2022

Unused parameters in training #26

Unused parameters in training #26

Comments

telmop commented Apr 28, 2022

albertfgu commented May 1, 2022

telmop commented May 5, 2022

albertfgu commented May 5, 2022

telmop commented May 12, 2022

albertfgu commented May 23, 2022