Bad weight init dependant of a processor import #30374

Xmaster6y · 2024-04-21T16:32:22Z

System Info

2024-04-21 16:25:02.913641: E external/local_xla/xla/stream_executor/cuda/cuda_dnn.cc:9261] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
2024-04-21 16:25:02.913701: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:607] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
2024-04-21 16:25:02.915420: E external/local_xla/xla/stream_executor/cuda/cuda_blas.cc:1515] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
2024-04-21 16:25:04.770200: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Could not find TensorRT
WARNING:tensorflow:From /usr/local/lib/python3.10/dist-packages/transformers/commands/env.py:100: is_gpu_available (from tensorflow.python.framework.test_util) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.config.list_physical_devices('GPU') instead.
2024-04-21 16:25:08.126490: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:47] Overriding orig_value setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.

Copy-and-paste the text below in your GitHub issue and FILL OUT the two last points.

transformers version: 4.38.2
Platform: Linux-6.1.58+-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 0.20.3
Safetensors version: 0.4.3
Accelerate version: not installed
Accelerate config: not found
PyTorch version (GPU?): 2.2.1+cu121 (True)
Tensorflow version (GPU?): 2.15.0 (True)
Flax version (CPU?/GPU?/TPU?): 0.8.2 (gpu)
Jax version: 0.4.26
JaxLib version: 0.4.26
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@amyeroberts

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Colab demo: https://colab.research.google.com/drive/1eQMGHFvw7GJpYxtyInJjG60t0sIXXwiJ?usp=sharing

The task

A custom classification problem using CLIPForImageClassification partially loading from pretrained.

import torch
from transformers import AutoConfig, CLIPForImageClassification, CLIPProcessor


model_name="openai/clip-vit-base-patch32"

config = AutoConfig.from_pretrained(model_name)

The problem (weight initialised with huge numbers and nans (invert somewhere?)) arises when loading the model as:

processor = CLIPProcessor.from_pretrained(model_name)
model = CLIPForImageClassification.from_pretrained(
    model_name, config=config
)
print(model.classifier.weight)

While loading the processor after the model doesn't initialise weirdly the weight (all 0, still weird but doable).

Expected behavior

Good weight init. And definitely not dependant on when the processor is loaded.

The text was updated successfully, but these errors were encountered:

vasqu · 2024-04-21T21:06:57Z

Ok, this is a wild ride. I think the initialisation of the processor is independent of the clip model. I can reproduce NaNs and big/small numbers when loading with the processor loaded before, after, both and without. NaNs might be the bigger issue whereas big/small numbers can be mitigated with gradient clipping and the sort.

I follow your colab code with a bit more adjusting to test multiple inits:

import torch
from transformers import AutoConfig, CLIPForImageClassification, CLIPProcessor


model_name = "openai/clip-vit-base-patch32"
config = AutoConfig.from_pretrained(model_name, cache_dir='/datadisk1/av11/downloads/huggingface')
config.problem_type = "single_label_classification"
config.label2id = {
    'apple': '0',
    'banana': '1',
}
config.id2label = {
    '0': 'apple',
    '1': 'banana',
}
config.num_labels = 2


# loading flags to test processor relation
init_processor_before, init_processor_after = False, False

def init_model_and_get_cl_weights(init_processor_before=False, init_processor_after=False):
    if init_processor_before: CLIPProcessor.from_pretrained(model_name)
    model = CLIPForImageClassification.from_pretrained(
        model_name, config=config
    )
    if init_processor_after: CLIPProcessor.from_pretrained(model_name)
    return model.classifier.weight

prev_tensor, current_tensor = init_model_and_get_cl_weights(init_processor_before, init_processor_after), None
print()
for i in range(100):
    print(f'Current classifier weights:\n\t{prev_tensor}')
    print(f'NaNs in tensor: {torch.isnan(prev_tensor).any()}')
    if torch.isnan(prev_tensor).any():
        print('here')

    current_tensor = init_model_and_get_cl_weights(init_processor_before, init_processor_after)
    allclose = torch.allclose(prev_tensor, current_tensor)
    prev_tensor = current_tensor.clone().detach()

    print(f'Initial weights are the same as previous init: {allclose}\n')

    torch.cuda.empty_cache()

So what happens under the hood:

CLIPForImageClassification initialises its classifier head according to the given values in the config. The head is a simple linear layer.
(Post init has no influence, so I'll leave it out. - Might be wrong though, not sure)
So torch is responsible for the randomly initialised weights which can be seen explicitly here.
Torch uses torch.empty() initially for their first values which can create any value; see this forum post.
Torch then resets its weight parameters with kaiming uniform.
Kaiming should produce valid numbers.

Funnily enough, when testing torch.nn.Linear in isolation I cannot reproduce this so I'm wondering why it produces those when used for the classifier head. It does occur right after the initialisation of the linear classifier. Not sure what the root of this issue is but it does seem like a pretty weird bug that might have to do with torch or also some hooks and wrappers I'm missing.

P.S. I tried passing self.device and self.dtype when initialising the classifier head. Fp16 also doesn't help.

Xmaster6y · 2024-04-22T06:30:12Z

I'll try to reproduce without the processor then. I also thought it was super weird ^^.

The problem is unlikely linked to pytorch. Now that you pointed out the torch empty and kaming init I might have an idea. While exploring post_init I noticed this line:

transformers/src/transformers/modeling_utils.py

Line 1693 in 8c12690

def _init_weights(self, module):

And looking into CLIP, CLIPForImageClassification seems missing,

transformers/src/transformers/models/clip/modeling_clip.py

Line 409 in 8c12690

def _init_weights(self, module):

I have a PR to correct CLIPForImageClassification #30373, and I'll integrate this.

vasqu · 2024-04-22T07:55:42Z

CLIPForImageClassification does inherit from CLIPPreTrainedModel which in turn allows for those init hooks to be called. So I don't see the issue there and it's standard practice on many modelling files. It gives me an idea though.

A possible easy way to fix the init imo is to integrate a different init for the linear weights in the hook (which is done for a lot of other models including altclip and chinese_clip). I am not sure if this is wanted though. On the other hand, clip is also pretty "old" and maybe this has slipped through. So from this line:

transformers/src/transformers/models/clip/modeling_clip.py

Lines 457 to 459 in 8c12690

    
           if isinstance(module, nn.Linear) and module.bias is not None: 
        
               module.bias.data.zero_()

We would change it to something like:

if isinstance(module, nn.Linear):
    module.weight.data.normal_(mean=0.0, std=self.config.initializer_factor)
    if module.bias is not None:
        module.bias.data.zero_()

Doesn't explain to me how the initial init behaves like this but at least we would have somewhat more consistent classifier heads.

Xmaster6y · 2024-04-22T16:55:54Z

I see but what about a line like (that has less impact):

elif isinstance(module, CLIPForImageClassification):
    nn.init.normal_(
        module.classifier.weight,
        std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
    )

vasqu · 2024-04-22T17:27:40Z

LGTM, except for one small thing: self.config.hidden_size doesn't exist in this config setting rather use self.config.vision_config.hidden_size which is used to create the (first) projection dimension.

Not sure what their standards are regarding this, I just followed what altclip and chinese_clip did :D So I'd rather let some maintainer decide what the right approach is in this case.

vasqu · 2024-04-22T17:40:43Z

Just noticed that I might have misunderstood you the first time when you explained the idea whoops. But, maybe you should open a separate PR for this to keep it "separate issue = separate PR". Doesn't look related to me.

amyeroberts · 2024-04-22T17:55:35Z

Hi @Xmaster6y, thanks for raising this issue and @vasqu for digging into this!

Yes, it's in _init_weights where we'd want to address this. As the class CLIPForImageClassification was added many months (years!) after the initial CLIP, this was likely just an oversight.

see but what about a line like (that has less impact):

elif isinstance(module, CLIPForImageClassification):
    nn.init.normal_(
        module.classifier.weight,
        std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
    )

Both init suggestions work. As @vasqu mentions, we'd need to use config.vision_config.hidden_size as the config class for CLIPForImageClassification is the composite one. It's safe to assume this exists if we've entered into branch of the if/elif statements.

Happy to review a PR with this change! Any update that's made in CLIP should also be reflected in SigLip.

Xmaster6y · 2024-04-23T08:22:13Z

There is still one problem that remains. What does the processor have to do with this bug? Is there some cache corruption or a config switch under the hood?

I wrote 4 tests to check:

    def test_weight_init(self):
        config, _ = self.model_tester.prepare_config_and_inputs()
        config = CLIPConfig.from_pretrained("openai/clip-vit-base-patch32")
        model = CLIPForImageClassification(config=config)
        assert(model.classifier.weight <= 1e3).all()
        assert(model.classifier.weight != 0.).any()

    def test_weight_init_from_pretrained_1(self):
        model_name = "openai/clip-vit-base-patch32"
        config = CLIPConfig.from_pretrained(model_name)
        CLIPProcessor.from_pretrained(model_name)
        model = CLIPForImageClassification.from_pretrained(model_name, config=config)
        assert(model.classifier.weight <= 1e3).all()
        assert(model.classifier.weight != 0.).any()

    def test_weight_init_from_pretrained_2(self):
        model_name = "openai/clip-vit-base-patch32"
        config = CLIPConfig.from_pretrained(model_name)
        model = CLIPForImageClassification.from_pretrained(model_name, config=config)
        CLIPProcessor.from_pretrained(model_name)
        assert(model.classifier.weight <= 1e3).all()
        assert(model.classifier.weight != 0.).any()

    def test_weight_init_from_pretrained_3(self):
        model_name = "openai/clip-vit-base-patch32"
        config = CLIPConfig.from_pretrained(model_name)
        model = CLIPForImageClassification.from_pretrained(model_name, config=config)
        assert(model.classifier.weight <= 1e3).all()
        assert(model.classifier.weight != 0.).any()

My findings:

test_weight_init never fails.
test_weight_init_from_pretrained_x all fail (because of arbitrary numbers, i.e. >= 1e3) when running the whole file test.
test_weight_init_from_pretrained_1 fails bc arbitrary numbers when run individually.
test_weight_init_from_pretrained_2/3 fail bc the weight tensor only has 0.

My conclusion: the processor impacts whether you get a torch.empty or torch.zeros 🤯.

@vasqu I couldn't reproduce the error without involving the processor (your code works in Colab) but I wrote the following test that fails bc 0 when run individually:

    def test_weight_init_from_pretrained_custom(self):
        model_name = "openai/clip-vit-base-patch32"
        config = CLIPConfig.from_pretrained(model_name)
        config.problem_type = "single_label_classification"
        config.label2id = {
            'apple': '0',
            'banana': '1',
        }
        config.id2label = {
            '0': 'apple',
            '1': 'banana',
        }
        config.num_labels = 2
        model = CLIPForImageClassification.from_pretrained(model_name, config=config)
        assert(model.classifier.weight <= 1e1).all()
        assert(model.classifier.weight != 0.).any()

vasqu · 2024-04-23T10:11:12Z

First, I assume you tried those tests without any of the aforementioned fixes.

But yea, I can't tell you what the reason is tbh. I still wouldn't attribute it to the processor entirely tho. Running the whole test class already gives me mixed results: Oftentimes all _x fail, but there are also cases where only an individual or a couple fail. from_pretrained seems to have an influence, so until you don't check what happens there completely it's just a guessing game; especially since I could produce NaNs in any of the configurations in #30374 (comment), some more some less frequent. Running tests multiple times sucks imo, that's why I wrote the "for loop ish" style which allows us to see that the instantiation can fail arbitrarily.

Tl;dr: You might be right, still get the feeling it's more complicated and we should be happy with the post hook fixing this. Do you want to open a PR for this or should I?

Xmaster6y · 2024-04-23T10:39:44Z

I see.

I can take care of that this evening (in 8 hours), and I'll tag you for review or, conversely, if you can do it sooner.

amyeroberts · 2024-04-23T11:01:11Z

Hi @Xmaster6y, thanks for sharing these examples!

The processor doesn't have anything to do with the weight initialization. The differences in the example tests are happening because of the two different ways the model is being created.

In test_weight_init, the model is created straight from the config, whereas in in the other examples, the model is created using from_pretrained.

Adapting your examples:

from transformers import CLIPConfig, CLIPForImageClassification, CLIPProcessor, CLIPModel

CHECKPOINT = "openai/clip-vit-base-patch32"

def _test_weights(model):
    assert(model.classifier.weight <= 1e3).all()
    assert(model.classifier.weight != 0.).any()

# passes
def test_weight_init_from_config():
    config = CLIPConfig.from_pretrained(CHECKPOINT)
    model = CLIPForImageClassification(config=config)
    _test_weights(model)

# fails
def test_weight_init_pretrained_and_config():
    config = CLIPConfig.from_pretrained(CHECKPOINT)
    model = CLIPForImageClassification.from_pretrained(CHECKPOINT, config=config)
    _test_weights(model)

What's causing the discrepancy here, is two things:

The _init_weights method only specifies how to initialize the weights for the bias of the linear layer, not the weight
When initializing from the config, when we hit this line, the weights of the linear layer are already initialized with what looks like normally distributed values. However, when we init from_pretrained the layer weights are all 0. This means, in the from_pretrained case, the 0s are never overwritten

I actually don't know why we have this discrepancy cc'ing in @younesbelkada who knows more about the weight initialization of our models.

It would be good to resolve this, as it could be causing issues for other models. However, the fix proposed above will resolve this:

elif isinstance(module, CLIPForImageClassification):
    nn.init.normal_(
        module.classifier.weight,
        std=self.config.hidden_size**-0.5 * self.config.initializer_factor,
    )

vasqu · 2024-04-23T11:10:45Z

*std=self.config.vision_config.hidden_size**-0.5 * self.config.initializer_factor just as nit :p

amyeroberts · 2024-04-23T11:28:30Z

@vasqu Yep! sorry :)

Xmaster6y · 2024-04-23T11:32:16Z

I get it, but sometimes the weights are all 0, and sometimes the weights are from the torch.empty. We surely don't care for the fix, but I don't see any answer to this.

vasqu · 2024-04-23T12:06:04Z

Are we sure it's not always torch.empty. torch.empty has a huge variance since it snuggles whatever torch finds at that memory it allocated it to (see this). So it's entirely memory-dependent on your architecture, call stack etc.

This would also somewhat explain the variance in initialisation. It's complicated but wouldn't be surprised if it was something entirely different.

amyeroberts · 2024-04-23T14:16:13Z

@vasqu If you inspect the weights in the _init_weights method, you'll see that in the case when initializing from the config and using from_pretrained seem to have different init values. This difference is consistent with every time I've tried creating the model in the different ways:

Init from config:

Linear(in_features=768, out_features=2, bias=True)
Weight:  Parameter containing:
tensor([[-0.0168, -0.0015,  0.0166,  ..., -0.0178,  0.0135, -0.0136],
        [-0.0056,  0.0146, -0.0150,  ..., -0.0039, -0.0332,  0.0017]],
       requires_grad=True)
Bias:  Parameter containing:
tensor([ 0.0094, -0.0012], requires_grad=True)

Init using from_pretrained

Linear(in_features=768, out_features=2, bias=True)
Weight:  Parameter containing:
tensor([[0.0000e+00, 0.0000e+00, 5.1189e-42,  ..., 0.0000e+00, 0.0000e+00,
         0.0000e+00],
        [9.7671e-43, 2.2421e-44, 9.7671e-43,  ..., 6.7262e-44, 1.5835e-42,
         0.0000e+00]], requires_grad=True)
Bias:  Parameter containing:
tensor([0., 0.], requires_grad=True)

torch.empty is used CLIPForImageClassification.from_pretrained(checkpoint, config=config), but not for CLIPForImageClassification(config=config)

I managed to track this down to the different ways the models are created. When creating from the config, all of the weights are initialized using torch's defaults, then re-initialized based on settings in _init_weights.

However, when loading from a checkpoint, the from_pretrained method is used. Within this method, the layers are created but weight initialization deliberately disabled. This enables faster weight loading: there's no point in initializing if most of the weights are just going to be replaced by the checkpoint weights.

Now, this is actually a problem, as highlighted with this issue, as we can silently have empty tensors which are never properly initialized if not specified in _init_weights, AND a difference in init behaviour between initializing from the config and from a checkpoint.

I'm going to check with the team to see how to best approach this. Technically we have tests for initialization, but as these test from config creation this wasn't caught.

vasqu · 2024-04-23T14:36:21Z

However, when loading from a checkpoint, the from_pretrained method is used. Within this method, the layers are created but weight initialization deliberately disabled. This enables faster weight loading: there's no point in initializing if most of the weights are just going to be replaced by the checkpoint weights.

That totally makes to just allocate mem and let the checkpoint do the rest.

When creating from the config, all of the weights are initialized using torch's defaults, then re-initialized based on settings in _init_weights.

Yea, that explains the discrepancy.

Thanks for looking into this!

amyeroberts added the Multimodal label Apr 22, 2024

vasqu mentioned this issue Apr 23, 2024

fix uncaught init of linear layer in clip's/siglip's for image classification models #30435

Merged

5 tasks

amyeroberts mentioned this issue Apr 24, 2024

Refactor initialization tests to check init from from_pretrained #30451

Open

5 tasks

amyeroberts closed this as completed in #30435 Apr 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bad weight init dependant of a processor import #30374

Bad weight init dependant of a processor import #30374

Xmaster6y commented Apr 21, 2024 •

edited

vasqu commented Apr 21, 2024

Xmaster6y commented Apr 22, 2024

vasqu commented Apr 22, 2024

Xmaster6y commented Apr 22, 2024

vasqu commented Apr 22, 2024

vasqu commented Apr 22, 2024

amyeroberts commented Apr 22, 2024

Xmaster6y commented Apr 23, 2024

vasqu commented Apr 23, 2024

Xmaster6y commented Apr 23, 2024

amyeroberts commented Apr 23, 2024 •

edited

vasqu commented Apr 23, 2024

amyeroberts commented Apr 23, 2024

Xmaster6y commented Apr 23, 2024

vasqu commented Apr 23, 2024 •

edited

amyeroberts commented Apr 23, 2024

vasqu commented Apr 23, 2024

Bad weight init dependant of a processor import #30374

Bad weight init dependant of a processor import #30374

Comments

Xmaster6y commented Apr 21, 2024 • edited

System Info

Who can help?

Information

Tasks

Reproduction

The task

Expected behavior

vasqu commented Apr 21, 2024

Xmaster6y commented Apr 22, 2024

vasqu commented Apr 22, 2024

Xmaster6y commented Apr 22, 2024

vasqu commented Apr 22, 2024

vasqu commented Apr 22, 2024

amyeroberts commented Apr 22, 2024

Xmaster6y commented Apr 23, 2024

vasqu commented Apr 23, 2024

Xmaster6y commented Apr 23, 2024

amyeroberts commented Apr 23, 2024 • edited

vasqu commented Apr 23, 2024

amyeroberts commented Apr 23, 2024

Xmaster6y commented Apr 23, 2024

vasqu commented Apr 23, 2024 • edited

amyeroberts commented Apr 23, 2024

vasqu commented Apr 23, 2024

Xmaster6y commented Apr 21, 2024 •

edited

amyeroberts commented Apr 23, 2024 •

edited

vasqu commented Apr 23, 2024 •

edited