Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Why LayerNorm layers are frozen? #1

Open
iliaschalkidis opened this issue Feb 23, 2021 · 6 comments
Open

Why LayerNorm layers are frozen? #1

iliaschalkidis opened this issue Feb 23, 2021 · 6 comments

Comments

@iliaschalkidis
Copy link

Hi @hosein-m,

I read your code and the paper (https://arxiv.org/pdf/1902.00751.pdf). According to the paper, the LayerNorm layers shall be trainable. Am I missing something?

https://github.com/hosein-m/TF-Adapter-BERT/blob/8ddad140dc8c61b5db4db50d47fc258b0e9868cb/run_tf_glue_adapter_bert.py#L110

@hmohebbi
Copy link
Owner

Hi @iliaschalkidis, thank you for the correction. I agree with you. As I remember, fixing the LayerNorm layers' parameters yielded a better GLUE performance score in my experiments.

@iliaschalkidis
Copy link
Author

Great, thanks!

I would also recommend you to let the pooler trainable, too. It's a randomly initialized layer that shall be fine-tuned, otherwise, it will remain as it is (randomly initialized). You could also skip this layer using bert(inputs)[:,0], the final CLStoken, as a document representation instead of the pooler, bert(inputs)[1] one.

I also was wondering if the TruncatedNormal initialization with stddev=0.02 is aligned with the paper part:

"With the skip-connection, if the parameters of the projection layers are initialized to near-zero, the module is initialized to an approximate identity function."

In the official implementation the stddev=1e-3: https://github.com/google-research/adapter-bert/blob/1a31fc6e92b1b89a6530f48eb0f9e1f04cc4b750/modeling.py#L321, which seems to better approximate "near-zero", as BERT weights are anyway very close to zero already. 😄

Thanks for your implementation!

@hmohebbi
Copy link
Owner

Thanks for the detailed comments. @iliaschalkidis

It is worth mentioning that after calling the .from_pretrained method in line 74, the pooler layer will be initialized with its pre-trained weights which were trained by the NSP task. (see huggingface/transformers#300)

hmohebbi added a commit that referenced this issue Feb 25, 2021
near-zero initialization
@iliaschalkidis
Copy link
Author

Sorry @hosein-m, one last question:

https://github.com/hosein-m/TF-Adapter-BERT/blob/8ddad140dc8c61b5db4db50d47fc258b0e9868cb/modeling_tf_adapter_bert.py#L10

Does this mean that you use the very same AdapterModule, for 2 layers, back-to-back?

@hmohebbi
Copy link
Owner

Sorry @iliaschalkidis for closing the issue!

According to Houlsby’s architecture, there must be two Adapter modules in each Transformer layer: one in the TFBertSelfOutput component and the other in the TFBertOutput, which both of them must have a shared weight. So, this line fulfills this purpose and aligned with this part of the paper:
"In each layer, the total number of parameters added per layer, including biases, is 2md + d + m."

Please let me know if I miss something :)

@hmohebbi hmohebbi reopened this Feb 26, 2021
@iliaschalkidis
Copy link
Author

Sorry @iliaschalkidis for closing the issue!

According to Houlsby’s architecture, there must be two Adapter modules in each Transformer layer: one in the TFBertSelfOutput component and the other in the TFBertOutput, which both of them must have a shared weight. So, this line fulfills this purpose and aligned with this part of the paper:
"In each layer, the total number of parameters added per layer, including biases, is 2md + d + m."

Please let me know if I miss something :)

Yeah, I show this line in the article and suspected this is your motivation. They should have phrased this better and clearer, like "The two adapter layers are tied." or something similar. I cannot validate this in the original implementation.

Thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants