Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MuP Coord Check not Working with Electra Style Model #27

Closed
zanussbaum opened this issue Nov 6, 2022 · 8 comments
Closed

MuP Coord Check not Working with Electra Style Model #27

zanussbaum opened this issue Nov 6, 2022 · 8 comments

Comments

@zanussbaum
Copy link
Contributor

I'm trying to use an Electra-Style model with µP but am not able to get a the coord plots to work correctly. Currently, I have Readout layers on both the Discriminator and Generator.

Creating coord checks for the Discriminator and Generator alone seem to work, but when combined the µP plot does not seem as expected.

Generator coord checks:
μp_electra_generator_adam_lr0 001_nseeds5_coord
sp_electra_generator_adam_lr0 001_nseeds5_coord

Discriminator coord checks:
μp_electra_adam_lr0 001_nseeds5_coord
sp_electra_adam_lr0 001_nseeds5_coord

Electra Model coord checks:

sp_electra_model_adam_lr0 001_nseeds5_coord
μp_electra_model_adam_lr0 001_nseeds5_coord

Will µP not work for "multi-task" losses like here where the overall loss is a weighted sum of mlm_loss and disc_loss?

@thegregyang
Copy link
Contributor

thegregyang commented Nov 6, 2022 via email

@zanussbaum
Copy link
Contributor Author

Similar to this, we use the logits from the Generator to sample/replace tokens for the input to the Discriminator. The Discriminator tries to predict which tokens have been replaced

@thegregyang
Copy link
Contributor

thegregyang commented Nov 6, 2022 via email

@zanussbaum
Copy link
Contributor Author

It looks like the attention layers such as
discriminator.electra.encoder.layer.0.attention.output.dense, discriminator.electra.encoder.layer.0.attention.output.dropout,
discriminator.electra.encoder.layer.0.attention.self.key,
discriminator.electra.encoder.layer.0.attention.self.query
discriminator.electra.encoder.layer.0.attention.self.value,
This seems to be present in the generator layers too. However it seems a bit odd to me that the individual layers seem to blow up but the full layer seems to be constant?

https://docs.google.com/spreadsheets/d/1vd_cVkNAbr0jSLax_IrH4sjjIcanxfOXar6cbYVS3DE/edit?usp=sharing

@thegregyang
Copy link
Contributor

thegregyang commented Nov 7, 2022 via email

@zanussbaum
Copy link
Contributor Author

zanussbaum commented Nov 7, 2022

For inputs, the data fed into the model is roughly the same. The main differences I see is which tokens are masked, but I don't imagine that having a large impact on what's fed into the generator. The only difference from before is that masking is now handled within the Electra class instead of within the DataCollator. I'll do some more debugging because it would make sense that the generator layers should have same L1s.

Another difference is that instead of directly backpropagating with respect to the loss on the generator, we backprop with respect to the weighted sum of the generator and discriminator, but again I agree that this shouldn't affect the coord checks at least for t==1 for the generator

@zanussbaum
Copy link
Contributor Author

@thegregyang sorry for the previous brief comment, I updated with some (hopefully) more useful info

@zanussbaum
Copy link
Contributor Author

Ok I found the bug in my code! I was doing the _init_weights before set_base_shapes and came across a comment saying that it's needed to be called after set_base_shapes. Thanks for the help pointing me in the right direction 😄 @thegregyang

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants