Skip to content

Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3#43574

Merged
vasqu merged 21 commits intohuggingface:mainfrom
jp1924:skip_siglip_init
Feb 4, 2026
Merged

Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3#43574
vasqu merged 21 commits intohuggingface:mainfrom
jp1924:skip_siglip_init

Conversation

@jp1924
Copy link
Contributor

@jp1924 jp1924 commented Jan 29, 2026

What does this PR do?

In the latest version of transformers, when initializing siglip with ZeRO3 applied, the following error occurs:

Fan in and fan out can not be computed for tensor with fewer than 2 dimensions
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 419, in _calculate_fan_in_and_fan_out
    raise ValueError(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 45, in variance_scaling_
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 67, in lecun_normal_
    variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 462, in _init_weights
    lecun_normal_(module.weight)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2344, in _initialize_weights
    self._init_weights(module)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2367, in smart_apply
    fn(self)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2366, in smart_apply
    module.smart_apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2366, in smart_apply
    module.smart_apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2373, in initialize_weights
    self.smart_apply(self._initialize_weights)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4548, in _initialize_missing_keys
    self.initialize_weights()
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4274, in _finalize_load_state_dict
    model._initialize_missing_keys(load_config.is_quantized)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4110, in from_pretrained
    load_info = cls._finalize_load_state_dict(model, load_config, load_info)
  File "/root/workspace/src/sft/main.py", line 331, in main
    model = architecture.from_pretrained(train_args.model_name_or_path, **model_kwargs).train()
  File "/root/workspace/src/sft/main.py", line 437, in <module>
    main(train_args)
ValueError: Fan in and fan out can not be computed for tensor with fewer than 2 dimensions

This error occurs in the process of calculating variance_scaling_ in siglip, from torch's _calculate_fan_in_and_fan_out.

To calculate _calculate_fan_in_and_fan_out, the tensor's size must have at least 2 dimensions.

However, in ZeRO3, since sharding is done in advance, the size becomes 0, causing the error.

Therefore, referring to the code in transformers > initialization.py > normal_, we want to add code that decides whether to initialize based on the _is_hf_initialized flag through this PR.


  • transformers version: 5.0.0.dev0
  • Platform: Linux-5.15.0-153-generic-x86_64-with-glibc2.35
  • Python version: 3.10.12
  • Huggingface_hub version: 1.3.3
  • Safetensors version: 0.7.0
  • Accelerate version: 1.12.0
  • Accelerate config: not found
  • DeepSpeed version: 0.18.4
  • PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA H100 NVL

I'm using the dev version via pip install git+https://github.com/huggingface/transformers.git@dfe30827b8ebdd974eb7ce69c7d5d8cf8e6cf852

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • Did you write any new necessary tests?

Who can review?

@yonigozlan

@jp1924 jp1924 changed the title preventing duplicate initialization in siglip's lecun_normal_, default_flax_embed_init Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 Jan 29, 2026
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes look good to me, since a few models already use this initialization, I'd rather we move this to the init file (so we don't repeat it for other models in the future)

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm, let's add a small test please

@vasqu
Copy link
Contributor

vasqu commented Jan 30, 2026

Also you can run make fix-repo for our CI - you might have modified only the modeling code (while we rely on modular file and propagate to the other files (modeling, config, etc))

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
@jp1924
Copy link
Contributor Author

jp1924 commented Jan 30, 2026

@vasqu
Oh, great info — thanks. I was just trying to figure out why the test kept failing and had no details, so that was a pain lol.
I'll write the test and send another review request, yep.

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect let's add a small test to deepspeed and we are ready to go

@jp1924
Copy link
Contributor Author

jp1924 commented Feb 3, 2026

@vasqu
Um... I spent a lot of time wondering how to write the test.

I originally planned to import initialization.py directly and test it, but that didn't seem very meaningful.

Since this issue ultimately occurs when loading the model, I instead loaded the SigLIP model directly to see whether the latest version of transformers raises an error.

For now, it works fine this way....

I'm not sure what you think.

Let me know if there's anything you want me to change.

@jp1924 jp1924 requested a review from vasqu February 3, 2026 06:37
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some smaller things but this looks pretty good! Thanks a lot for taking your time for this appreciate it

with mockenv_context(**self.dist_env_1_gpu):
logger = logging.get_logger("transformers.modeling_utils")
with CaptureLogger(logger) as cl:
model = AutoModel.from_pretrained("google/siglip-base-patch16-224")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test is looking good I just have one nit here: We do not need to have a full model, we can create a dummy model and save to a temporary directory and load from that - no hub calls needed and should run much faster.

You can also look into test modeling siglip2 file for some dummy values. It's just there to have something small

jp1924 and others added 6 commits February 4, 2026 08:50
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>
@jp1924
Copy link
Contributor Author

jp1924 commented Feb 4, 2026

@vasqu
Oh, thanks for the very detailed feedback!
I created a dummy model and made it load as you asked.

@jp1924 jp1924 requested a review from vasqu February 4, 2026 01:27
Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just changed the comment slightly. Merging now

@vasqu vasqu enabled auto-merge (squash) February 4, 2026 13:31
@github-actions
Copy link
Contributor

github-actions bot commented Feb 4, 2026

[For maintainers] Suggested jobs to run (before merge)

run-slow: phi4_multimodal, siglip, siglip2

@vasqu vasqu merged commit 225254c into huggingface:main Feb 4, 2026
20 checks passed
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jp1924
Copy link
Contributor Author

jp1924 commented Feb 5, 2026

thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants