initialization of qkv #68

XintianHan · 2023-09-20T11:37:15Z

In the paper, the authors mentioned that the initialization followed DeepNet but from the code, it's kind of different. Why is there a mismatch?

def reset_parameters(self):
    nn.init.xavier_uniform_(self.q_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.k_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.v_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.g_proj.weight, gain=2 ** -2.5)
    nn.init.xavier_uniform_(self.out_proj.weight)
    nn.init.constant_(self.out_proj.bias, 0.0)

The text was updated successfully, but these errors were encountered:

shumingma · 2023-09-27T02:32:38Z

RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.

XintianHan · 2023-10-07T03:35:06Z

RetNet uses DeepNet's derivation methods to obtain the initialization for better training stability, instead of directly re-using its derived initialization (on Post-LN transformers), because the initialization depends on the model architecture according to the theory in DeepNet.

Thanks for the quick reply!

"because the initialization depends on the model architecture according to the theory in DeepNet"

Could you elaborate the derivation methods more? How do you get the number 2 ** -2.5 here? Thanks

radarFudan · 2023-11-27T02:28:08Z

I am also interested in this initialisation scheme. It seems for recurrent models such as S4 and S5, they have different schemes. Do you have any particular explanation or heuristic of this scale?

donglixp assigned shumingma Sep 20, 2023

shumingma closed this as completed Sep 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

initialization of qkv #68

initialization of qkv #68

XintianHan commented Sep 20, 2023

shumingma commented Sep 27, 2023

XintianHan commented Oct 7, 2023

radarFudan commented Nov 27, 2023 •

edited

initialization of qkv #68

initialization of qkv #68

Comments

XintianHan commented Sep 20, 2023

shumingma commented Sep 27, 2023

XintianHan commented Oct 7, 2023

radarFudan commented Nov 27, 2023 • edited

radarFudan commented Nov 27, 2023 •

edited