Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

freeze_bn seems to be an invalid option #35

Open
wangruohui opened this issue Jun 19, 2022 · 3 comments
Open

freeze_bn seems to be an invalid option #35

wangruohui opened this issue Jun 19, 2022 · 3 comments

Comments

@wangruohui
Copy link

wangruohui commented Jun 19, 2022

Dear author,

I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.

In current code, you define a freeze_bn() function to change all batch normalization layers to eval mode, like

self.freeze_bn()

But you neither rewrite the train function of nn.Module nor call this function every time before the training cycle.

This means when train() function call net.train(), these BN layers becomes training mode again and freeze_bn actually takes no effect and all training is conducted under with BN enabled. Is this right?

@poppinace
Copy link
Owner

Dear author,

I am trying to read and reproduce your codes, but I found some possible issue with batch normalization.

In current code, you define a freeze_bn() function to change all batch normalization layers to eval mode, like

self.freeze_bn()

But you neither rewrite the train function of nn.Module nor call this function every time before the training cycle.

This means when train() function call net.train(), these BN layers becomes training mode again and freeze_bn actually takes no effect and all training is conducted under with BN enabled. Is this right?

Hi, net.train() does not affect the setting of eval model of bn layers if it is set using the freeze_bn() function in the code. You can check the status of bn layers during training.

@wangruohui
Copy link
Author

wangruohui commented Jun 21, 2022

Hello,

Thanks for your quick response. I did some checking but the results show that BN is in training mode.

I made a fork to your repository and add some codes to check the running mean/var of bn during training, as here.

However, with current implementation, the results show that the running mean is still being updated during training which means the BN is in train mode.

module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.65694, frame: 1.21Hz/1.21Hz
module.layer0.1.running_mean tensor([ 0.0068, -0.0150,  0.0079,  0.0414, -0.0621,  0.1761, -0.0140, -0.0060,
         0.0165,  0.0259,  0.0073,  0.1927, -0.2286, -0.0602,  0.0484, -0.0455,
         0.0065,  0.0189, -0.1496, -0.0099, -0.0214, -0.0488,  0.0089,  0.0518,
         0.0236, -0.0177, -0.0556, -0.1153, -0.0569,  0.0842, -0.0276,  0.0240],
       device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.46729, frame: 2.69Hz/1.95Hz
module.layer0.1.running_mean tensor([ 0.0101, -0.0301,  0.0155,  0.0893, -0.1588,  0.2140, -0.0270, -0.0085,
         0.0359,  0.0477,  0.0153,  0.3849, -0.4547, -0.0492,  0.0886, -0.0884,
         0.0118,  0.0393, -0.2979, -0.0194, -0.0458, -0.0328,  0.0181,  0.0438,
         0.0500, -0.0376, -0.1109, -0.2188, -0.1717,  0.1630, -0.0560,  0.0464],
       device='cuda:0')
epoch: 1, train: 3/10775, loss: 0.38356, frame: 2.57Hz/2.16Hz
module.layer0.1.running_mean tensor([ 0.0122, -0.0392,  0.0198,  0.1183, -0.2181,  0.2257, -0.0345, -0.0102,
         0.0471,  0.0608,  0.0198,  0.5007, -0.5906, -0.0325,  0.1124, -0.1140,
         0.0146,  0.0517, -0.3863, -0.0254, -0.0606, -0.0288,  0.0238,  0.0453,
         0.0662, -0.0498, -0.1439, -0.2812, -0.2416,  0.2102, -0.0730,  0.0595],
       device='cuda:0')
epoch: 1, train: 4/10775, loss: 0.33935, frame: 2.75Hz/2.31Hz

If I comment out net.train(), these variables keep constant, like:

alchemy start...
module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 1/10775, loss: 0.64172, frame: 1.20Hz/1.20Hz
module.layer0.1.running_mean tensor([ 0.0036, -0.0038,  0.0019,  0.0096, -0.0123,  0.0645, -0.0039, -0.0033,
         0.0043,  0.0067,  0.0018,  0.0469, -0.0557, -0.0310,  0.0132, -0.0124,
         0.0022,  0.0046, -0.0369, -0.0028, -0.0050, -0.0080,  0.0019,  0.0060,
         0.0052, -0.0040, -0.0138, -0.0289, -0.0096,  0.0213, -0.0068,  0.0069],
       device='cuda:0')
epoch: 1, train: 2/10775, loss: 0.45836, frame: 2.70Hz/1.95Hz

Would you please have a look at that? Or I missed something?

@poppinace
Copy link
Owner

ok, have you fixed the issue?
Do you see improved performance with fronzen bn?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants