Skip to content

Correct Tapas initialization#44575

Merged
Rocketknight1 merged 1 commit intomainfrom
correct_tapas_initialization
Mar 10, 2026
Merged

Correct Tapas initialization#44575
Rocketknight1 merged 1 commit intomainfrom
correct_tapas_initialization

Conversation

@Rocketknight1
Copy link
Member

Some parameters in Tapas are initialized in __init__() and not reinitialized in _init_weights(), which means that if the model is created on the meta device, those parameters do not get a weight initialization. This causes a crash later if the uninitialized memory has some NaN values in it! This caused the test_all_tensors_are_parameter_or_buffer test to be flaky.

This PR leaves tensor creation in __init__() but moves initialization to _init_weights()

cc @vasqu

…e doesn't leave us with uninitialized tensors
@Rocketknight1 Rocketknight1 marked this pull request as ready for review March 10, 2026 14:43
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: tapas

@Rocketknight1
Copy link
Member Author

run-slow: tapas

@github-actions
Copy link
Contributor

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/tapas"]
quantizations: []

Copy link
Contributor

@vasqu vasqu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix, looks good to me. Makes me wonder how we can detect these misses more easily 😓

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@Rocketknight1
Copy link
Member Author

@vasqu we could expand test_all_tensors_are_parameter_or_buffer to assert no NaNs in the model parameters. Because it just runs a forward pass and doesn't check outputs, most operations complete fine even with NaNs. tapas only reports the failure because those parameters are used in torch.distribution stuff which breaks on NaN values.

This wouldn't catch every failure (because the NaNs only appear sometimes) but I think (?) it should create a flaky failure for any uninitialized model parameter. Not sure how to make that flaky failure more reliable, though!

@github-actions
Copy link
Contributor

CI Results

Workflow Run ⚙️

Commit Info

Context Commit Description
RUN 041416bc workflow commit (merge commit)
PR b1574744 branch commit (from PR)
main 1cbb9c2e base commit (on main)

✅ No failing test specific to this PR 🎉 👏 !

@Rocketknight1
Copy link
Member Author

Merging this and will think about a follow-up PR to surface this class of bug more reliably and with less flakiness.

@Rocketknight1 Rocketknight1 added this pull request to the merge queue Mar 10, 2026
Merged via the queue into main with commit 3b6c11d Mar 10, 2026
25 checks passed
@Rocketknight1 Rocketknight1 deleted the correct_tapas_initialization branch March 10, 2026 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants