Why LayerNorm layers are frozen? #1

iliaschalkidis · 2021-02-23T15:23:51Z

Hi @hosein-m,

I read your code and the paper (https://arxiv.org/pdf/1902.00751.pdf). According to the paper, the LayerNorm layers shall be trainable. Am I missing something?

https://github.com/hosein-m/TF-Adapter-BERT/blob/8ddad140dc8c61b5db4db50d47fc258b0e9868cb/run_tf_glue_adapter_bert.py#L110

hmohebbi · 2021-02-23T22:48:14Z

Hi @iliaschalkidis, thank you for the correction. I agree with you. As I remember, fixing the LayerNorm layers' parameters yielded a better GLUE performance score in my experiments.

iliaschalkidis · 2021-02-24T06:27:04Z

Great, thanks!

I would also recommend you to let the pooler trainable, too. It's a randomly initialized layer that shall be fine-tuned, otherwise, it will remain as it is (randomly initialized). You could also skip this layer using bert(inputs)[:,0], the final CLStoken, as a document representation instead of the pooler, bert(inputs)[1] one.

I also was wondering if the TruncatedNormal initialization with stddev=0.02 is aligned with the paper part:

"With the skip-connection, if the parameters of the projection layers are initialized to near-zero, the module is initialized to an approximate identity function."

In the official implementation the stddev=1e-3: https://github.com/google-research/adapter-bert/blob/1a31fc6e92b1b89a6530f48eb0f9e1f04cc4b750/modeling.py#L321, which seems to better approximate "near-zero", as BERT weights are anyway very close to zero already. 😄

Thanks for your implementation!

hmohebbi · 2021-02-24T17:40:05Z

Thanks for the detailed comments. @iliaschalkidis

It is worth mentioning that after calling the .from_pretrained method in line 74, the pooler layer will be initialized with its pre-trained weights which were trained by the NSP task. (see huggingface/transformers#300)

near-zero initialization

iliaschalkidis · 2021-02-26T15:20:48Z

Sorry @hosein-m, one last question:

https://github.com/hosein-m/TF-Adapter-BERT/blob/8ddad140dc8c61b5db4db50d47fc258b0e9868cb/modeling_tf_adapter_bert.py#L10

Does this mean that you use the very same AdapterModule, for 2 layers, back-to-back?

hmohebbi · 2021-02-26T16:16:22Z

Sorry @iliaschalkidis for closing the issue!

According to Houlsby’s architecture, there must be two Adapter modules in each Transformer layer: one in the TFBertSelfOutput component and the other in the TFBertOutput, which both of them must have a shared weight. So, this line fulfills this purpose and aligned with this part of the paper:
"In each layer, the total number of parameters added per layer, including biases, is 2md + d + m."

Please let me know if I miss something :)

iliaschalkidis · 2021-02-26T17:06:27Z

Sorry @iliaschalkidis for closing the issue!

According to Houlsby’s architecture, there must be two Adapter modules in each Transformer layer: one in the TFBertSelfOutput component and the other in the TFBertOutput, which both of them must have a shared weight. So, this line fulfills this purpose and aligned with this part of the paper:
"In each layer, the total number of parameters added per layer, including biases, is 2md + d + m."

Please let me know if I miss something :)

Yeah, I show this line in the article and suspected this is your motivation. They should have phrased this better and clearer, like "The two adapter layers are tied." or something similar. I cannot validate this in the original implementation.

Thanks again!

hmohebbi closed this as completed in 71045a2 Feb 25, 2021

hmohebbi added a commit that referenced this issue Feb 25, 2021

fixes #1

e38e00d

near-zero initialization

hmohebbi reopened this Feb 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Why LayerNorm layers are frozen? #1

Why LayerNorm layers are frozen? #1

iliaschalkidis commented Feb 23, 2021

hmohebbi commented Feb 23, 2021

iliaschalkidis commented Feb 24, 2021

hmohebbi commented Feb 24, 2021

iliaschalkidis commented Feb 26, 2021

hmohebbi commented Feb 26, 2021

iliaschalkidis commented Feb 26, 2021

Why LayerNorm layers are frozen? #1

Why LayerNorm layers are frozen? #1

Comments

iliaschalkidis commented Feb 23, 2021

hmohebbi commented Feb 23, 2021

iliaschalkidis commented Feb 24, 2021

hmohebbi commented Feb 24, 2021

iliaschalkidis commented Feb 26, 2021

hmohebbi commented Feb 26, 2021

iliaschalkidis commented Feb 26, 2021