-
Notifications
You must be signed in to change notification settings - Fork 25.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bad weight init dependant of a processor import #30374
Comments
Ok, this is a wild ride. I think the initialisation of the processor is independent of the clip model. I can reproduce NaNs and big/small numbers when loading with the processor loaded before, after, both and without. NaNs might be the bigger issue whereas big/small numbers can be mitigated with gradient clipping and the sort. I follow your colab code with a bit more adjusting to test multiple inits: import torch
from transformers import AutoConfig, CLIPForImageClassification, CLIPProcessor
model_name = "openai/clip-vit-base-patch32"
config = AutoConfig.from_pretrained(model_name, cache_dir='/datadisk1/av11/downloads/huggingface')
config.problem_type = "single_label_classification"
config.label2id = {
'apple': '0',
'banana': '1',
}
config.id2label = {
'0': 'apple',
'1': 'banana',
}
config.num_labels = 2
# loading flags to test processor relation
init_processor_before, init_processor_after = False, False
def init_model_and_get_cl_weights(init_processor_before=False, init_processor_after=False):
if init_processor_before: CLIPProcessor.from_pretrained(model_name)
model = CLIPForImageClassification.from_pretrained(
model_name, config=config
)
if init_processor_after: CLIPProcessor.from_pretrained(model_name)
return model.classifier.weight
prev_tensor, current_tensor = init_model_and_get_cl_weights(init_processor_before, init_processor_after), None
print()
for i in range(100):
print(f'Current classifier weights:\n\t{prev_tensor}')
print(f'NaNs in tensor: {torch.isnan(prev_tensor).any()}')
if torch.isnan(prev_tensor).any():
print('here')
current_tensor = init_model_and_get_cl_weights(init_processor_before, init_processor_after)
allclose = torch.allclose(prev_tensor, current_tensor)
prev_tensor = current_tensor.clone().detach()
print(f'Initial weights are the same as previous init: {allclose}\n')
torch.cuda.empty_cache() So what happens under the hood:
Funnily enough, when testing P.S. I tried passing |
I'll try to reproduce without the The problem is unlikely linked to pytorch. Now that you pointed out the torch empty and kaming init I might have an idea. While exploring transformers/src/transformers/modeling_utils.py Line 1693 in 8c12690
And looking into CLIP,
I have a PR to correct |
A possible easy way to fix the init imo is to integrate a different init for the linear weights in the hook (which is done for a lot of other models including transformers/src/transformers/models/clip/modeling_clip.py Lines 457 to 459 in 8c12690
We would change it to something like: if isinstance(module, nn.Linear):
module.weight.data.normal_(mean=0.0, std=self.config.initializer_factor)
if module.bias is not None:
module.bias.data.zero_() Doesn't explain to me how the initial init behaves like this but at least we would have somewhat more consistent classifier heads. |
I see but what about a line like (that has less impact): elif isinstance(module, CLIPForImageClassification):
nn.init.normal_(
module.classifier.weight,
std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
) |
LGTM, except for one small thing: Not sure what their standards are regarding this, I just followed what |
Just noticed that I might have misunderstood you the first time when you explained the idea whoops. But, maybe you should open a separate PR for this to keep it "separate issue = separate PR". Doesn't look related to me. |
Hi @Xmaster6y, thanks for raising this issue and @vasqu for digging into this! Yes, it's in
Both init suggestions work. As @vasqu mentions, we'd need to use Happy to review a PR with this change! Any update that's made in CLIP should also be reflected in SigLip. |
There is still one problem that remains. What does the processor have to do with this bug? Is there some cache corruption or a config switch under the hood? I wrote 4 tests to check: def test_weight_init(self):
config, _ = self.model_tester.prepare_config_and_inputs()
config = CLIPConfig.from_pretrained("openai/clip-vit-base-patch32")
model = CLIPForImageClassification(config=config)
assert(model.classifier.weight <= 1e3).all()
assert(model.classifier.weight != 0.).any()
def test_weight_init_from_pretrained_1(self):
model_name = "openai/clip-vit-base-patch32"
config = CLIPConfig.from_pretrained(model_name)
CLIPProcessor.from_pretrained(model_name)
model = CLIPForImageClassification.from_pretrained(model_name, config=config)
assert(model.classifier.weight <= 1e3).all()
assert(model.classifier.weight != 0.).any()
def test_weight_init_from_pretrained_2(self):
model_name = "openai/clip-vit-base-patch32"
config = CLIPConfig.from_pretrained(model_name)
model = CLIPForImageClassification.from_pretrained(model_name, config=config)
CLIPProcessor.from_pretrained(model_name)
assert(model.classifier.weight <= 1e3).all()
assert(model.classifier.weight != 0.).any()
def test_weight_init_from_pretrained_3(self):
model_name = "openai/clip-vit-base-patch32"
config = CLIPConfig.from_pretrained(model_name)
model = CLIPForImageClassification.from_pretrained(model_name, config=config)
assert(model.classifier.weight <= 1e3).all()
assert(model.classifier.weight != 0.).any() My findings:
My conclusion: the processor impacts whether you get a torch.empty or torch.zeros 🤯. @vasqu I couldn't reproduce the error without involving the processor (your code works in Colab) but I wrote the following test that fails bc 0 when run individually: def test_weight_init_from_pretrained_custom(self):
model_name = "openai/clip-vit-base-patch32"
config = CLIPConfig.from_pretrained(model_name)
config.problem_type = "single_label_classification"
config.label2id = {
'apple': '0',
'banana': '1',
}
config.id2label = {
'0': 'apple',
'1': 'banana',
}
config.num_labels = 2
model = CLIPForImageClassification.from_pretrained(model_name, config=config)
assert(model.classifier.weight <= 1e1).all()
assert(model.classifier.weight != 0.).any() |
First, I assume you tried those tests without any of the aforementioned fixes. But yea, I can't tell you what the reason is tbh. I still wouldn't attribute it to the processor entirely tho. Running the whole test class already gives me mixed results: Oftentimes all _x fail, but there are also cases where only an individual or a couple fail. Tl;dr: You might be right, still get the feeling it's more complicated and we should be happy with the post hook fixing this. Do you want to open a PR for this or should I? |
I see. I can take care of that this evening (in 8 hours), and I'll tag you for review or, conversely, if you can do it sooner. |
Hi @Xmaster6y, thanks for sharing these examples! The processor doesn't have anything to do with the weight initialization. The differences in the example tests are happening because of the two different ways the model is being created. In Adapting your examples: from transformers import CLIPConfig, CLIPForImageClassification, CLIPProcessor, CLIPModel
CHECKPOINT = "openai/clip-vit-base-patch32"
def _test_weights(model):
assert(model.classifier.weight <= 1e3).all()
assert(model.classifier.weight != 0.).any()
# passes
def test_weight_init_from_config():
config = CLIPConfig.from_pretrained(CHECKPOINT)
model = CLIPForImageClassification(config=config)
_test_weights(model)
# fails
def test_weight_init_pretrained_and_config():
config = CLIPConfig.from_pretrained(CHECKPOINT)
model = CLIPForImageClassification.from_pretrained(CHECKPOINT, config=config)
_test_weights(model) What's causing the discrepancy here, is two things:
I actually don't know why we have this discrepancy cc'ing in @younesbelkada who knows more about the weight initialization of our models. It would be good to resolve this, as it could be causing issues for other models. However, the fix proposed above will resolve this: elif isinstance(module, CLIPForImageClassification):
nn.init.normal_(
module.classifier.weight,
std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
) |
* |
@vasqu Yep! sorry :) |
I get it, but sometimes the weights are all 0, and sometimes the weights are from the |
Are we sure it's not always This would also somewhat explain the variance in initialisation. It's complicated but wouldn't be surprised if it was something entirely different. |
@vasqu If you inspect the weights in the Init from config:
Init using from_pretrained
torch.empty is used I managed to track this down to the different ways the models are created. When creating from the config, all of the weights are initialized using torch's defaults, then re-initialized based on settings in However, when loading from a checkpoint, the from_pretrained method is used. Within this method, the layers are created but weight initialization deliberately disabled. This enables faster weight loading: there's no point in initializing if most of the weights are just going to be replaced by the checkpoint weights. Now, this is actually a problem, as highlighted with this issue, as we can silently have empty tensors which are never properly initialized if not specified in I'm going to check with the team to see how to best approach this. Technically we have tests for initialization, but as these test from config creation this wasn't caught. |
That totally makes to just allocate mem and let the checkpoint do the rest.
Yea, that explains the discrepancy. Thanks for looking into this! |
System Info
2024-04-21 16:25:02.913641: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 16:25:02.913701: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 16:25:02.915420: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-21 16:25:04.770200: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/transformers/commands/env.py:100: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use
tf.config.list_physical_devices('GPU')
instead.2024-04-21 16:25:08.126490: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.
transformers
version: 4.38.2Who can help?
@amyeroberts
Information
Tasks
examples
folder (such as GLUE/SQuAD, ...)Reproduction
Colab demo: https://colab.research.google.com/drive/1eQMGHFvw7GJpYxtyInJjG60t0sIXXwiJ?usp=sharing
The task
A custom classification problem using
CLIPForImageClassification
partially loading from pretrained.The problem (weight initialised with huge numbers and nans (invert somewhere?)) arises when loading the model as:
While loading the processor after the model doesn't initialise weirdly the weight (all 0, still weird but doable).
Expected behavior
Good weight init. And definitely not dependant on when the processor is loaded.
The text was updated successfully, but these errors were encountered: