Skip to content

Refactor tiny-model generation scripts#5637

Open
qgallouedec wants to merge 24 commits intomainfrom
new-tiny-model-generation
Open

Refactor tiny-model generation scripts#5637
qgallouedec wants to merge 24 commits intomainfrom
new-tiny-model-generation

Conversation

@qgallouedec
Copy link
Copy Markdown
Member

@qgallouedec qgallouedec commented Apr 24, 2026

Split tiny-model generation into per-model scripts

Replace the monolithic scripts/generate_tiny_models.py (one 437-line file, a big tuple-driven loop plus a growing if issubclass(...) ladder for VLMs) with a per-model layout:

scripts/generate_tiny_models/
├── _common.py                      # shared helpers: push_to_hub, smoke_test, print_config_diff, ...
├── for_causal_lm/                  # *ForCausalLM + GPT-2 LM head + PEFT + "small" vLLM variants
├── for_sequence_classification/    # reward models
└── for_conditional_generation/     # VLMs + T5 + Bart (encoder-decoder)

One script per tiny model. Each script is fully self-contained. Shared logic (push, smoke test, diff, weight init) stays in _common.py.

Why

  • The old file was bloated: 437 lines of tuple-driven loops stacked with one-off special cases for nearly every VLM (Qwen2.5-VL out_hidden_size, Qwen3-VL layer_types deletion, Qwen3.5 linear-attn fp32 cast, Gemma4 in-place mutation, llava-v1.6 dtype hotfix, …).
  • To re-push a single model you had to manually comment out all the others.
  • It didn't pin the transformers version
  • It didn't run any smoke test — a broken config only failed later when CI tried to load the tiny model.

With one file per model, each script reads top-to-bottom in 20–50 lines. Model-specific quirks stay scoped to the model that needs them.

New features added to every script

  • Smoke test. Before pushing, each script runs a minimal forward pass on a tiny dummy input. This catches config misspecification at generation time rather than when CI first imports the tiny model.
  • Config diff vs reference. print_config_diff(MODEL_ID, model) prints every flat-key difference between the reference Hub config and the tiny model's config before push. Makes it obvious when a shrink kwarg was silently ignored or when an unexpected field drifted.
  • Dtype pattern check. check_dtype_pattern(MODEL_ID, model) reads the reference safetensors header via the Hub API (no weight download) and flags any tensor whose dtype diverges from the reference — catches cases like models with mixed-precision weights (e.g. fp32 norms inside a bf16 checkpoint).
  • Exact transformers version pin. Each script declares TRANSFORMERS_VERSION = "X.Y.Z" and calls check_transformers_version(...) which raises unless the installed version matches exactly. The pinned value is max(introduction_version, trl_floor=4.56.2). Rationale: transformers is backward-compatible (a checkpoint saved by X loads on any ≥ X) but not forward-compatible; TRL CI runs against the floor, so tiny models must be saved with the oldest version that supports them to avoid config-field drift. Exact match prevents accidental regenerations with a newer transformers from silently breaking min-version CI.
  • --create-pr flag. When a tiny model already exists on the Hub, the default is to skip the push. Passing --create-pr opens a single PR against the existing repo instead (all artifacts bundled into one commit via HfApi.create_commit), so updates can be reviewed before landing on main.

How to run

# Regenerate a single tiny model (from repo root)
python -m scripts.generate_tiny_models.for_causal_lm.qwen3_for_causal_lm

# Open a PR against the existing Hub repo instead of skipping
python -m scripts.generate_tiny_models.for_causal_lm.qwen3_for_causal_lm --create-pr

See scripts/generate_tiny_models/README.md for full documentation.

Scope

This PR is refactor-only — the Python logic for each tiny model is preserved exactly. No tiny model on the Hub is regenerated by this PR; the existing Hub repos remain the source of truth for CI.

Follow-up PRs will use the new scripts to regenerate and push individual tiny models where the existing Hub checkpoint drifts from the reference (e.g. non-size config fields defaulting to wrong values, missing upstream-added fields, quantization parity, etc.). Each regeneration is one PR per tiny, with a refs/pr/N override in tests/conftest.py until merged on the Hub.


Note

Low Risk
Primarily a tooling refactor under scripts/ that doesn’t affect library runtime, but it changes the Hub upload path and adds new validation steps that could alter how future tiny models are regenerated.

Overview
Replaces the monolithic scripts/generate_tiny_models.py with a scripts/generate_tiny_models/ package containing one script per tiny model grouped by task (for_causal_lm, for_sequence_classification, for_conditional_generation).

Adds shared utilities in _common.py to standardize generation: exact transformers version pin checks, a minimal smoke_test forward pass, check_dtype_pattern against the reference safetensors metadata, and print_config_diff against the reference config.

Updates Hub publishing to support --create-pr and to upload all artifacts (model, tokenizer/processor, generation config, model card) in a single commit via HfApi.create_commit, with new documentation in scripts/generate_tiny_models/README.md.

Reviewed by Cursor Bugbot for commit 4730fec. Bugbot is set up for automated code reviews on this repo. Configure here.

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Comment thread tests/conftest.py Outdated
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 730c87629a

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +73 to +75
inputs = processor.apply_chat_template(
conversation=messages,
add_generation_prompt=True,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Use non-chat fallback for VLM smoke inputs

smoke_test unconditionally calls processor.apply_chat_template(...) for every processor with an image_processor, but some VLM checkpoints in this refactor are not chat models (for example google/paligemma-3b-pt-224, whose script calls smoke_test(model, processor) in paligemma_for_conditional_generation.py). In that case apply_chat_template can fail due to missing/unsupported chat templates, so the script aborts before push even though the model itself is valid with regular processor(text=..., images=...) inputs.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could address this in the future, google/paligemma-3b-pt-224 is this only vlm that doesn't have a chat template

Comment thread tests/conftest.py Outdated
Comment thread scripts/generate_tiny_models/_common.py
Comment thread scripts/generate_tiny_models/_common.py Outdated
Comment thread scripts/generate_tiny_models/_common.py Outdated
Comment thread scripts/generate_tiny_models/for_causal_lm/peft_qwen3_for_causal_lm.py Outdated
Comment thread tests/test_chat_template_utils.py
return_dict=True,
return_tensors="pt",
padding=True,
).to(device)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smoke_test crashes for PaliGemma lacking chat template

Medium Severity

The smoke_test VLM branch unconditionally calls processor.apply_chat_template(...) for any ProcessorMixin. The PaliGemma script passes its processor to smoke_test, but google/paligemma-3b-pt-224 has no chat template defined, so apply_chat_template will raise at runtime, making the PaliGemma generation script unusable. The PR discussion confirms PaliGemma is the only VLM without a chat template.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5cc7fc8. Configure here.

Comment thread scripts/generate_tiny_models/for_causal_lm/qwen3_for_causal_lm.py
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 4730fec. Configure here.


config = AutoConfig.from_pretrained(MODEL_ID, text_config=text_config, vision_config=vision_config)
model = PaliGemmaForConditionalGeneration(config).to(dtype=torch.float32)
smoke_test(model, processor)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PaliGemma smoke_test crashes due to missing chat template

Medium Severity

smoke_test(model, processor) passes a ProcessorMixin to smoke_test, which takes the VLM branch and calls processor.apply_chat_template(...). PaliGemma (google/paligemma-3b-pt-224) has no chat template, so this call will crash at runtime. The PR discussion confirms this: "google/paligemma-3b-pt-224 is the only VLM that doesn't have a chat template." The smoke_test function needs a fallback path for VLM processors that lack a chat template.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 4730fec. Configure here.

Copy link
Copy Markdown
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks.

You replaced 450 code lines with 2,765. Are you sure this is the right direction?

On the other hand, what if we want to regenerate all models? Or some specific family of models? After this PR there is no simple way other than running each model individually.

The pin of transformers for the lower supported version is repeated in every script. Why not defaulting to the lower supported version if no explicit version is set for a specific model?

Also, I think create_pr should be the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants