Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions about the implementation of deepnorm #16

Closed
jiaohuix opened this issue Feb 22, 2023 · 2 comments
Closed

Questions about the implementation of deepnorm #16

jiaohuix opened this issue Feb 22, 2023 · 2 comments

Comments

@jiaohuix
Copy link

jiaohuix commented Feb 22, 2023

I have a doubt about deepnorm. In the paper, deepnorm_init function use xavier_normal_(x, gain=beta) for "ffn" "v_proj" "out_proj".
image
However, in the source code of torhscale use xavier_normal_(x, gain=1)/ beta:

`

        for name, p in self.named_parameters():
            if (
                "fc1" in name
                or "fc2" in name
                or "out_proj" in name
                or "v_proj" in name
            ):
                p.data.mul_(init_scale)

`
Although i know that X ~ N(0,std^2), aX ~ N(0,(a*std)^2), I plot the distribution of both methods using a histogram,the results show some differences between the two methods:

image

`

import torch
import matplotlib.pyplot as plt
from torch.nn.init import xavier_normal_
torch.manual_seed(1)

init_scale = 0.343
linear1 = torch.nn.Linear(4096, 512)  # 1  xavier_norm_(x, gain=beta)
linear2 = torch.nn.Linear(4096, 512) # 2 xavier_norm_(x, gain=1) / beta
xavier_normal_(linear1.weight,gain=init_scale)
xavier_normal_(linear2.weight,gain=1)

linear1_weight = linear1.weight.detach().numpy().reshape((-1, ))
linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) / init_scale
plt.figure(figsize=(10, 6))
temp = plt.hist([linear1_weight, linear2_weight], bins=100, rwidth=0.8, histtype="step")
plt.xlabel("value")
plt.ylabel("count")
plt.legend({"1 xavier_norm_(x, gain=beta)", "2 xavier_norm_(x, gain=1)/beta"})

plt.show()

`

Is my implementation wrong? Which method should I use? I hope someone can enlighten me, thank you!!!

@shumingma
Copy link
Contributor

Hi @MiuGod0126

$\beta$ is a multiplier, so it should be:

linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) * init_scale

instead of

linear2_weight = linear2.weight.detach().numpy().reshape((-1, )) / init_scale

@jiaohuix
Copy link
Author

@shumingma Ooooh! Sorry, I was careless to see mul as division, thank you for your correction!!! I understand deeper on deepnorm_init, and the corrected distribution is as follows:
image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants