Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 by jp1924 · Pull Request #43574 · huggingface/transformers

jp1924 · 2026-01-29T01:26:40Z

What does this PR do?

In the latest version of transformers, when initializing siglip with ZeRO3 applied, the following error occurs:

Fan in and fan out can not be computed for tensor with fewer than 2 dimensions
  File "/usr/local/lib/python3.10/dist-packages/torch/nn/init.py", line 419, in _calculate_fan_in_and_fan_out
    raise ValueError(
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 45, in variance_scaling_
    fan_in, fan_out = _calculate_fan_in_and_fan_out(tensor)
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 67, in lecun_normal_
    variance_scaling_(tensor, mode="fan_in", distribution="truncated_normal")
  File "/usr/local/lib/python3.10/dist-packages/transformers/models/siglip/modeling_siglip.py", line 462, in _init_weights
    lecun_normal_(module.weight)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2344, in _initialize_weights
    self._init_weights(module)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2367, in smart_apply
    fn(self)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2366, in smart_apply
    module.smart_apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2366, in smart_apply
    module.smart_apply(fn)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2364, in smart_apply
    module.smart_apply(module._initialize_weights)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 2373, in initialize_weights
    self.smart_apply(self._initialize_weights)
  File "/usr/lib/python3.10/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.10/dist-packages/torch/utils/_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4548, in _initialize_missing_keys
    self.initialize_weights()
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4274, in _finalize_load_state_dict
    model._initialize_missing_keys(load_config.is_quantized)
  File "/usr/local/lib/python3.10/dist-packages/transformers/modeling_utils.py", line 4110, in from_pretrained
    load_info = cls._finalize_load_state_dict(model, load_config, load_info)
  File "/root/workspace/src/sft/main.py", line 331, in main
    model = architecture.from_pretrained(train_args.model_name_or_path, **model_kwargs).train()
  File "/root/workspace/src/sft/main.py", line 437, in <module>
    main(train_args)
ValueError: Fan in and fan out can not be computed for tensor with fewer than 2 dimensions

This error occurs in the process of calculating variance_scaling_ in siglip, from torch's _calculate_fan_in_and_fan_out.

To calculate _calculate_fan_in_and_fan_out, the tensor's size must have at least 2 dimensions.

However, in ZeRO3, since sharding is done in advance, the size becomes 0, causing the error.

Therefore, referring to the code in transformers > initialization.py > normal_, we want to add code that decides whether to initialize based on the _is_hf_initialized flag through this PR.

transformers version: 5.0.0.dev0
Platform: Linux-5.15.0-153-generic-x86_64-with-glibc2.35
Python version: 3.10.12
Huggingface_hub version: 1.3.3
Safetensors version: 0.7.0
Accelerate version: 1.12.0
Accelerate config: not found
DeepSpeed version: 0.18.4
PyTorch version (accelerator?): 2.9.1+cu128 (CUDA)
Using distributed or parallel set-up in script?:
Using GPU in script?:
GPU type: NVIDIA H100 NVL

I'm using the dev version via pip install git+https://github.com/huggingface/transformers.git@dfe30827b8ebdd974eb7ce69c7d5d8cf8e6cf852

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@yonigozlan

…bed_init

vasqu

The changes look good to me, since a few models already use this initialization, I'd rather we move this to the init file (so we don't repeat it for other models in the future)

vasqu

Lgtm, let's add a small test please

src/transformers/initialization.py

vasqu · 2026-01-30T11:44:36Z

Also you can run make fix-repo for our CI - you might have modified only the modeling code (while we rely on modular file and propagate to the other files (modeling, config, etc))

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

jp1924 · 2026-01-30T12:21:32Z

@vasqu
Oh, great info — thanks. I was just trying to figure out why the test kept failing and had no details, so that was a pain lol.
I'll write the test and send another review request, yep.

…d update references

…`: remove redundant `init.` prefix for clarity

…date to use `init` namespace for clarity

vasqu

Perfect let's add a small test to deepspeed and we are ready to go

jp1924 · 2026-02-03T05:24:27Z

@vasqu
Um... I spent a lot of time wondering how to write the test.

I originally planned to import initialization.py directly and test it, but that didn't seem very meaningful.

Since this issue ultimately occurs when loading the model, I instead loaded the SigLIP model directly to see whether the latest version of transformers raises an error.

For now, it works fine this way....

I'm not sure what you think.

Let me know if there's anything you want me to change.

vasqu

Just some smaller things but this looks pretty good! Thanks a lot for taking your time for this appreciate it

vasqu · 2026-02-03T15:27:06Z

tests/deepspeed/test_deepspeed.py

+            with mockenv_context(**self.dist_env_1_gpu):
+                logger = logging.get_logger("transformers.modeling_utils")
+                with CaptureLogger(logger) as cl:
+                    model = AutoModel.from_pretrained("google/siglip-base-patch16-224")


The test is looking good I just have one nit here: We do not need to have a full model, we can create a dummy model and save to a temporary directory and load from that - no hub calls needed and should run much faster.

You can also look into test modeling siglip2 file for some dummy values. It's just there to have something small

tests/deepspeed/test_deepspeed.py

src/transformers/initialization.py

tests/deepspeed/test_deepspeed.py

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

… SigLIP models

jp1924 · 2026-02-04T01:23:20Z

@vasqu
Oh, thanks for the very detailed feedback!
I created a dummy model and made it load as you asked.

vasqu

LGTM, just changed the comment slightly. Merging now

github-actions · 2026-02-04T13:31:41Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: phi4_multimodal, siglip, siglip2

HuggingFaceDocBuilderDev · 2026-02-04T13:40:29Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

jp1924 · 2026-02-05T06:34:19Z

thank you!

Prevent redundant initialization in lecun_normal_ and default_flax_em…

64a4a5d

…bed_init

jp1924 changed the title ~~preventing duplicate initialization in siglip's lecun_normal_, default_flax_embed_init~~ Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 Jan 29, 2026

jp1924 added 3 commits January 29, 2026 10:31

apply style

05a1ec4

Merge branch 'main' into skip_siglip_init

8cc3642

fix check_repository_consistency

d4085f4

vasqu reviewed Jan 29, 2026

View reviewed changes

jp1924 added 2 commits January 30, 2026 17:06

Merge branch 'main' into skip_siglip_init

027d51c

lecun_normal_ & default_flax_embed init > initialization.py

a16f544

vasqu approved these changes Jan 30, 2026

View reviewed changes

src/transformers/initialization.py Outdated Show resolved Hide resolved

Update src/transformers/initialization.py

3db324e

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

Merge branch 'main' into skip_siglip_init

9d21007

Sebmono mentioned this pull request Feb 1, 2026

[PR] Preventing initialization of siglip's lecun_normal_, default_flax_embed_init in ZeRO3 Sandgarden-Demo/transformers#43

Open

5 tasks

jp1924 added 4 commits February 2, 2026 17:31

Merge branch 'main' into skip_siglip_init

a578604

Rename _variance_scaling_ to _variance_scaling for consistency an…

98fb262

…d update references

Refactor initialization calls in `Phi4MultimodalVisionPreTrainedModel…

3f2ab1e

…`: remove redundant `init.` prefix for clarity

Fix initialization calls in Phi4MultimodalVisionPreTrainedModel: up…

612d26a

…date to use `init` namespace for clarity

vasqu approved these changes Feb 2, 2026

View reviewed changes

jp1924 added 2 commits February 3, 2026 14:04

Add test for SigLIP model initialization with DeepSpeed ZeRO-3

04d6e5b

Merge branch 'main' into skip_siglip_init

a6b6c75

jp1924 requested a review from vasqu February 3, 2026 06:37

vasqu reviewed Feb 3, 2026

View reviewed changes

jp1924 and others added 6 commits February 4, 2026 08:50

Update tests/deepspeed/test_deepspeed.py

7ecf4ee

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

Update tests/deepspeed/test_deepspeed.py

96c253b

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

Merge branch 'main' into skip_siglip_init

ba4dbcb

Apply suggestion from @vasqu

ac1c57f

Co-authored-by: Anton Vlasjuk <73884904+vasqu@users.noreply.github.com>

fix: update embedding initialization function to use the correct suffix

2415b6d

add test for variance scaling initialization with DeepSpeed ZeRO-3 in…

db68042

… SigLIP models

jp1924 requested a review from vasqu February 4, 2026 01:27

small nits

68699e5

vasqu approved these changes Feb 4, 2026

View reviewed changes

vasqu enabled auto-merge (squash) February 4, 2026 13:31

vasqu merged commit 225254c into huggingface:main Feb 4, 2026
20 checks passed

Conversation

jp1924 commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

vasqu commented Jan 30, 2026

Uh oh!

jp1924 commented Jan 30, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

jp1924 commented Feb 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

vasqu Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jp1924 commented Feb 4, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Feb 4, 2026

Uh oh!

jp1924 commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

jp1924 commented Jan 29, 2026 •

edited

Loading

jp1924 commented Feb 3, 2026 •

edited

Loading