Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upChange BN layer to use moving mean/var if frozen #9965
Conversation
This comment has been minimized.
This comment has been minimized.
|
Thanks for the effort. You are misunderstanding the meaning of the "trainable" property of layers. Historically, it has initially meant "this layer should not be trainable, i.e. the weights of this layer should not be updated during backprop (specifically What you want is a BN layer in inference mode. There is an argument to control training/inference mode in BN (and other layers): it's the What you want is: x = BatchNormalization()(y, training=False)For fine-tuning, you could do something like: # Set up inference-mode base
K.set_learning_phase(0)
inputs = Input(...)
x = layer1(...)(inputs)
x = layer2(...)(x)
...
x = layerN(...)(x)
# Add training-mode layers
K.set_learning_phase(1)
x = layerNp1(...)(x)
x = layerNp2(...)(x) |
This comment has been minimized.
This comment has been minimized.
|
Hi @fchollet, First of all thanks for taking the time to review and respond. I was aware that this is a significant change in the default behaviour and that there would be debate. :) I understand that your main concern is around the semantic meaning of the trainable property and how it is being used in this PR. I agree that semantically the training parameter that you proposed is closer to what I do, nevertheless this parameter can't change after the network definition. For instance when you use one of the pre-trained models of keras or when you load a persisted model you have no control over this variable. Would you be open to discuss a solution that would make the training variable changeable after the network definition (or perhaps another property)? If you are open to this, I could update my PR to reflect the agreed behaviour. Concerning your second recommendation of updating the learning_phase as the network is defined, I see two limitations:
I'm not sure if you had a look on the blog post (it is understandably a bit long), but you can see how significant perfomance boost you get by making it possible to set the BN in inference mode. Without this the trainable layers after the BNs adjust their weights based on input that has different scale (comparing to inference). I hope, we can re-open this PR; I'm happy to update it until it satisfies the semantic definitions. Cheers! |
This comment has been minimized.
This comment has been minimized.
|
Again, there is an existing API that does exactly what you want: the Additionally, your proposed PR adds a computational overhead (which might amount to a ~5% slowdown for a BN-heavy model like InceptionV3) to every single convnet that uses BN, fine-tuning or not. This is a heavy price to pay for supporting an incrementally simpler UX (disputable) for a very specific use case.
Typically if you want to heavily modify an existing model, rather than merely use it in inference mode, you should have access to the code for the model. But even if you don't, you can still do your style of fine-tuning in this case:
|
This comment has been minimized.
This comment has been minimized.
|
@fchollet My main point is that the training argument can't be changed after model definition, so the existing API does not cover this valid case. I don't argue that there are workarounds, but they are hacky/non-elegant and the default behaviour leads to much confusion to users. Interesting what you mention about the 5% slow down, I would love to see the benchmarks; perhaps it can be resolved. Finally something you don't address here is whether this discrepancy in the scaling makes sense (theoretically or otherwise) and whether the accuracy decrease is worth it. At any case, let's agree we disagree. I do hope though that you will revise your decision on the future, as it happened with the update of the mini-batch statistics on the BN. |
This comment has been minimized.
This comment has been minimized.
This is based on something I've observed in the past for InceptionV3 with static learning phase vs. with dynamic learning phase. Only difference between the two settings is |
This comment has been minimized.
This comment has been minimized.
|
Thanks for the clarifying that you are referring to a different benchmark and not to something you ran on this PR. I can't comment on the results without seeing them but when I ran comparisons on CIFAR10 the time difference was negligible (current branch: 4216 secs vs patched: 4251 secs); both ran on GPUs on the same server. Note that the snippet that I used (and listed on my article) comes from Keras' documentation on how to fine-tune a network. Admittedly the above measurements are single point estimates but especially the 5 point accuracy increase I report is consistent with what I've been observing for almost a year while applying workarounds (first time I reported this is on #7177). I don't know if the speed is currently your main concern for reopening this but I would say that this is unlikely to affect the majority of the users of Keras. This is because by default the Learning Phase is dynamic and the At any case I don't insist that it should me who changes this or that my current solution is the one we should use. I'm just raising a valid use case that is taken directly from Keras' documentation on how fine-tuning is performed. Currently there is no straightforward way to do what I describe (the current API doesn't cover it), nevertheless if you provide specific guidelines on what tickboxes the update should check it would be useful. Or perhaps some other longtime contributor of the BatchNormalization layer has an opinion or can offer a more elegant solution on this? @ozabluda @taehoonlee @farizrahman4u @Dref360 |
This comment has been minimized.
This comment has been minimized.
|
Sorry for late reply, still trying to understand the issues. For example, I am trying to understand if this is related at all to #9214 |
This comment has been minimized.
This comment has been minimized.
|
What sort of batch sizes were you using in your linked experiments? Some datasets are only viable with very small batch sizes of 1-4, like with image segmentation on a GPU with 8GB of memory. After briefly skimming this diff, I think the documentation would need to be updated to clearly delineate the different modes and when/why each should typically be chosen. In my case the current frozen behavior improved performance quite a lot over the previous behavior in which mean/var could shift when trainable=False, so I'm a bit hesitant about this though I'll reiterate I haven't reviewed what's happening in full detail. Here is a PR with some past discussion on BN #8616 |
This comment has been minimized.
This comment has been minimized.
|
@ozabluda First of all thank you for spending time on this. I wish I had provided on my PR the example that you posted on the issue #9214; perhaps this would have built a stronger case for this patch. What you showed on your post is exactly what I've been observing on real-world non-opensource datasets for the last year (close to 100% accuracy on training mode and 50% during inference on the same dataset and on similar validation sets). As @fchollet said the are lots of hacks that can help you avoid it but none of them should have been necessary. Based on the code you provided, I'm 100% certain you are being bitten by the behaviour of the BN layer that I'm trying to fix in this PR. In a nutshell, during training mode the frozen BN layers are scaled with different statistics than in inference mode. There is absolutely no theoretical foundation to support this behaviour. As a result, this can have devastating effects when you try to deploy the model or when you try to validate its accuracy. I am certain that the majority of people who face this believe they have overfitted the model while in reality this is just a side-effect of how Keras implements the Batch Normalization layer. So let's test your example on my branch of Keras where the BN layer is patched:
Below I run your code for ResNet50. As you can see the problem that you report is fixed once the BN behaviour is changed:
I would love to know if you can reproduce my results and whether you can observe any speed degradation that @fchollet suspects. |
This comment has been minimized.
This comment has been minimized.
|
@ahundt Thanks for your comment! In this very specific experiment I used a fixed batch size of 32. Nevertheless in this dummy example I try to reproduce a behaviour we've been facing for over a year now on real-world datasets and problems. In those cases a large number of different batch sizes were tested and the results were comparable. Please note that his PR DOES NOT undo the recent change where the mean/var no longer shifts when trainable=False. I 100% agree with you that this change is very beneficial. This PR actually takes it a step further and makes sure that the moving mean/var are used instead of the mini-batch statistics when trainable=False. This ensures that the non-frozen layers will be trained on data scaled the same way as in inference mode. BTW thanks for sending me the discussion on #8616. Give me sometime to read all the details to see how this is related. |
This comment has been minimized.
This comment has been minimized.
|
@ahundt I've read the discussion on #8616. I understand it focuses on the previous change on BN that correctly stopped the update of the moving mean/var when trainable=False. I totally agree with this change. As I said on my previous comment, this PR takes this a step further to ensure that the data after a frozen BN are scaled in the same way during training as during inference. What I find interesting is that during the original discussion on #8616, @fchollet raises similar concerns about the semantic meaning of trainable as in this PR. Nevertheless in that discussion, he proposed the introduction of another property to extend the API. I also see he tried to implement another property called "updatable" which was reverted due to the increased complexity (and at the end we settled with extending the semantics of trainable). I wonder if in this case it makes sense to extend the API to cover this valid case OR update the semantics of trainable (preferred solution) OR update the documentation/examples. I would love to have an opinion from @lukedeo on this since he reviewed the code on the other PR. |
This comment has been minimized.
This comment has been minimized.
|
@datumbox Ok think I see what you are saying, I might try this out on my dataset. Do I need to change any settings like trainable in my training code or can I just pull this in? In my example I use frozen vgg16 imagenet pretrained weights as a feature extractor with additional trainable layers afterwards. One thing that might help with getting this through is a few improvements to the PR, variable names, and description. If you better separate the concepts and clarify the conditions under which different data is fixed vs changing the reasons this improves performance may be more obvious. |
This comment has been minimized.
This comment has been minimized.
|
Ok so based on the test_batchnorm_trainable() changes this should be active by default in all cases except when both
Correct? |
This comment has been minimized.
This comment has been minimized.
|
@ahundt Thanks for looking into this. My PR affects only networks that use Batch Normalization layers, so VGG will not be affected. No additional configuration is required other than setting trainable=False on the BN layers. Pulling this in should work fine
Sure thing, send me your comments and I'll make the changes. :-) |
This comment has been minimized.
This comment has been minimized.
|
Oh, yeah sorry I did a first run definitely wasn't configured correctly since vgg makes no sense for this case, and the BN layers I had were trained from scratch. I did have other models including resnet and densenet that didn't perform as well as vgg that use pretrained weights, and the fix in this PR might be why. I will try them out but can you confirm the following steps will make use of your changes?
Should I expect the above sequence to change the performance when running with this PR applied? edit: fixed typo mentioned in the next post |
This comment has been minimized.
This comment has been minimized.
ali-masoudi
commented
Jul 28, 2019
|
@ozabluda @taehoonlee @farizrahman4u @Dref360 @fchollet |
This comment has been minimized.
This comment has been minimized.
aethersis
commented
Aug 5, 2019
•
|
@ozabluda @taehoonlee @farizrahman4u @Dref360 @fchollet Are there any plans to merge this path? It would make my work on semantic segmentation so much easier! |
This comment has been minimized.
This comment has been minimized.
Toukenize
commented
Aug 8, 2019
|
Hi, may I know what is the correct way of finetuning pre-trained models with BN layers on a new dataset? From what I read so far, it seems like the Inception V3 finetuning example in Keras documentation is not the correct way, due to the way BN layers behave when frozen. |
This comment has been minimized.
This comment has been minimized.
geometrikal
commented
Aug 8, 2019
|
@Toukenize I saw there is a note in https://github.com/tensorflow/tensorflow/blob/master/tensorflow/python/keras/layers/normalization_v2.py that suggest in tensorflow 2.0, setting In my case, I just want to do some simple transfer learning (output of ResNet50 plus a few trainable dense layers). My workaround was this:
E.g.
Note that if you want to freeze the resulting graph for production etc there is a bug in tf.keras (1.13.1, 1.14.0 and 1.15.0 nightly) where the conversion to frozen graph either doesn't work (1.14.0) or works but creates a frozen graph that gives errors on import (1.13.1 and 1.15.0.dev20190807 nightly). See my workaround here: tensorflow/tensorflow#31331 (comment) |
This comment has been minimized.
This comment has been minimized.
Toukenize
commented
Aug 8, 2019
@geometrikal Thanks for your prompt response. Does that mean I got to extract the train and test vectors every epoch, if I'm doing data augmentation? My use case is slightly more complicated, meant for a sequence of images (video classification), the actual model is as such:
Do you have any suggestion on other work arounds for this? |
This comment has been minimized.
This comment has been minimized.
geometrikal
commented
Aug 8, 2019
•
|
@Toukenize I was just talking about this with a colleague, who said that they use the normal keras (not tensorflow.keras) ResNet50 for transfer learning and have no problem. I have not tried that myself however, Edit: Apologies. I thought I was commenting on the tensorflow.keras github, it seems this issue exists in normal keras too, and still problems (#9214) As for suggestions, maybe you could pre-compute a bunch of augmented vectors as well? Or it might work to create a generator (to train with I just looked at a friends code and they just leave the base model trainable. Not sure what impact that makes. |
This comment has been minimized.
This comment has been minimized.
sandeepjana
commented
Aug 19, 2019
•
|
Does the issue affect the models that are saved and loaded using |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 22, 2019
•
|
Cannot believe that this patch hasn't been merged after 1.5 years! Keras people seem to be very arrogant to accept others' delicate solutions; ignoring 50% performance drop, but saying 5% extra cost without benchmark evidence. Really angry. This stupid Keras issue costs me two weeks. Does anyone know that how should I apply @datumbox 's patch to the Tensorflow's Keras? Thank you. |
This comment has been minimized.
This comment has been minimized.
TheGuywithTheHat
commented
Aug 22, 2019
|
@ybsave, datumbox has previously said to use the following command to install his fork:
|
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 22, 2019
|
@TheGuywithTheHat Thank you for your help. After install his fork, I test the codes in @datumbox 's blog. But the results are still not fixed... I tested the Tensorflow's batch normalization. As long as set is_training to False during testing, the performance is the same as during training. Keras sucks. I would switch to pure Tensorflow instead. Thanks @datumbox 's excellent solution and @TheGuywithTheHat 's help. Shame on Keras people damaging Tensorflow. I feel sad Tensorflow use the stupid Keras as default mode. |
This comment has been minimized.
This comment has been minimized.
|
@ybsave check if you define the learning phase in your implementation. You should not define it else the behaviour of the patch won't be what you expect. Also I understand the frustration and your point of view but if you think about it, Keras is a good solution and at the end of the day it is an open source project that anyone can fork and adjust to their needs. That's what we do here. It's up to the owner to merge or not PRs if they meet their criteria. Maybe one day someone will provide a more elegant solution and it will be part of Keras. :) |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 22, 2019
|
@datumbox Thank you for your kind comments. My previous codes used tensorflow.keras, but not the patched keras. After fixing it, I got the same results as shown in your blog. Thank you for your kind and warm words. I just lose confidence on the Keras people. "After training, testing on the same training data should produce the same results in the last training round". This is a common sense to me. But I think Keras people lack common sense. You give very detailed explanations, experiments, fixes, and discussions. But they do not listen. I cannot understand their ignoring the huge performance drop (>30% in my own work) but attacking your solution for tiny (<5% extra cost, and no experimental evidence). If you will produce new Keras version, I will definitely use it. But without your patch, I will never use Keras anymore. Thank you so much for saving my world! |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 22, 2019
|
@datumbox Do you have a plan to also produce a fix to the tensorflow.keras? Thank you. |
This comment has been minimized.
This comment has been minimized.
geometrikal
commented
Aug 22, 2019
|
@ybsave Try tensorflow 2.0, I think the behavious has been changed there |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 24, 2019
|
@geometrikal Thanks for the suggestion. Have you tried @datumbox 's example codes on his blog by changing keras to tf.keras? In my test, the problem still exists on Tensorflow 2.0. Does this problem disappear in your test? |
This comment has been minimized.
This comment has been minimized.
geometrikal
commented
Aug 25, 2019
|
@ybsave I haven't done any more tests so I'm not sure of the status of this. I just pre-calculate the vectors as per my workaround. |
This comment has been minimized.
This comment has been minimized.
ofgurcan
commented
Aug 27, 2019
|
@ybsave can you share keras codes please? I haven't seen complete code, still dont know where I will put this pack in code? "install pip install -U --force-reinstall --no-dependencies git+https://github.com/datumbox/keras@fork/keras2.2.4" thank you |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 27, 2019
|
@ofgurcan you can just run the pip installation command above in the terminal. Just make sure your keras version; if not 2.2.4, datumbox also has patches for other versions; just choose the correct one. You can see the revisions in the "Files Change" tab. |
This comment has been minimized.
This comment has been minimized.
ybsave
commented
Aug 27, 2019
|
Test the newest Tensorflow 2.0.0-rc0, @datumbox 's testing codes are correct now. It seems we could safely use tf.keras from the newest Tensorflow, instead of waiting for arrogant keras people fixing this. |
This comment has been minimized.
This comment has been minimized.
lovejing0306
commented
Sep 19, 2019
I used Tensorflow 2.0.0-rc0, when I fine-tuning resnet have same problem. |
This comment has been minimized.
This comment has been minimized.
off99555
commented
Oct 6, 2019
•
|
In tensorflow2 GPU there is still this problem occurring, I have to use @faustomorales code in order to fix the issue. I check and followed this tensorflow guide on transfer learning and it didn't even once mentioned issue with batch norm: https://www.tensorflow.org/tutorials/images/transfer_learning And it also doesn't have the problem like I have with my coordinate regression task. It seems like classification dogs and cats is too easy that batch norm issue isn't a big deal. It's already 2019, and almost 2020. Does there exist an elegant solution yet? Update: It seems the increase in prediction time is caused by weird issue that happens after you call |
This comment has been minimized.
This comment has been minimized.
captainst
commented
Oct 8, 2019
•
|
After some try out, I think that a fesible yet simple solution is this (pseudo code, taking inceptionV3 as example):
|
This comment has been minimized.
This comment has been minimized.
gjy1992
commented
Oct 21, 2019
•
|
@captainst Hello, this way can work. But after I save the finalModel, and want to continue a training through load the saved_model, I cannot set part of the finalModel be at learning_phase=0. (╥╯^╰╥) |
This comment has been minimized.
This comment has been minimized.
ec1841
commented
Nov 2, 2019
•
|
@captainst your approach doesn't work, when you want to fine-tune top-k layers part of your base-model that may have BN layers. Yet another work-around :), using @faustomorales suggestion as the base soup to solve the top-k layer fine-tuning.
Will this work? |
This comment has been minimized.
This comment has been minimized.
rpeloff
commented
Nov 3, 2019
For those using TensorFlow 2.0 and trying to fine-tune resnet, inceptionv3, etc., the problem seems to persist due to the injection of tensorflow.python.keras.layers. This references the TF 1.0 behaviour batch normalisation in keras_applications when calling Similar to what @faustomorales suggested, I found that simply injecting
|
This comment has been minimized.
This comment has been minimized.
Tauranis
commented
Nov 25, 2019
•
|
@datumbox, For all others, what works in my case is to do transfer-learning on a 2-step process: Extract embeddings first (into tfrecords shards or not, it is up to you) for further classification. @rpeloff , your workaround worked for me on TF 1.15.0 but not at TF 2.0.0, but thanks anyway. |
This comment has been minimized.
This comment has been minimized.
sedghi
commented
Dec 3, 2019
|
I can't believe this is not fixed yet |
This comment has been minimized.
This comment has been minimized.
sameervk
commented
Jan 17, 2020
This comment has been minimized.
This comment has been minimized.
sameervk
commented
Jan 17, 2020
|
@Tauranis would you mind elaborating on your 2-step process please? Thanks. |
This comment has been minimized.
This comment has been minimized.
RomainSabathe
commented
Jan 25, 2020
•
Thank you so much!!! Looks like it solved it for me. That is certainly a strange behaviour though. One would think that they're using TF 2.x components when using the official TF 2.x release. Anyways, thanks for your reply and the explanation. |
datumbox commentedApr 17, 2018
•
edited
During fine-tuning, if a Batch Normalization layer is frozen it uses the mini-batch statistics. I believe this is incorrect and it can lead to reduced accuracy especially when we use Transfer learning. A better approach in this case would be to use the values of the moving mean and variance.
Changes on this PR:
In this PR I update the Batch Normalization layer to use the learned statistics if frozen during training. This is achieved by making the trainable flag part of the computational graph and by depending the behavior of the BN not only on the learning_phase but also on the value of the trainable property.
Brief explanation:
Assume we use one of the pre-trained CNNs of Keras and we want to fine-tune it. Unfortunately, we get no guarantees that the mean and variance of our new dataset inside the BN layers will be similar to the ones of the original dataset. As a result, if we fine-tune the top layers, their weights will be adjusted to the mean/variance of the new dataset. Nevertheless, during inference the top layers will receive data which are scaled using the mean/variance of the original dataset. This discrepancy can lead to reduced accuracy.
I understand that this is a significant change that requires thorough review. To faciliate the situation I've documented why making such a change is important and provided detailed comparisons before and after applying the patch on my blog.
EDIT: Since the fix was not merged on master, I maintain unofficial patches available for Keras 2.1.6, Keras 2.2.2 and Keras 2.2.4.