Support Modular (!!) + Configs in check_auto_docstrings#44803
Support Modular (!!) + Configs in check_auto_docstrings#44803yonigozlan merged 12 commits intohuggingface:mainfrom
check_auto_docstrings#44803Conversation
25eec1c to
5d253cd
Compare
check_auto_docstringscheck_auto_docstrings
check_auto_docstringscheck_auto_docstrings
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
…hub.com/yonigozlan/transformers into add-config-suport-check-auto-docstrings
ArthurZucker
left a comment
There was a problem hiding this comment.
Ty, IDK if its best to fix generate file then modular, it looks weird and slow!
| """After fixing docstrings in a generated file, propagate the same fixes to the | ||
| corresponding modular_*.py source file. | ||
|
|
There was a problem hiding this comment.
is it not simpler to fix modular first?
There was a problem hiding this comment.
or is it because its too complicated since modular does not have full scope?
There was a problem hiding this comment.
or is it because its too complicated since modular does not have full scope?
Yes exactly, with super kwargs, inherited docstrings, inherited attributes (in configs and other dataclasses), it's very difficult to fix modular first, much simpler and correct that way
| output_docstring_indent = 4 | ||
| source_args_doc = [ModelOutputArgs] | ||
| elif item.is_config: | ||
| # Config class (PreTrainedConfig subclass) - args are class-level type annotations, |
|
For the slow part at least, it's really not a problem imo since this is a tool for contributors to run a few times before merging. and not executed at every imports like auto_docstring. Also it's not that slow 😁 |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Super nice, thanks a lot for working on modular as well. I had it in mind for long but caught up with model reviews 😓
I checked a few first models to see if auto-generation is correct. I think it is deleting Example code in some cases and I am not sure if we should document explicitly those args that are in autodocstring::ConfigArgs
| from transformers import logging | ||
| from transformers.utils import direct_transformers_import | ||
| from transformers.utils.auto_docstring import ( | ||
| ConfigArgs, |
There was a problem hiding this comment.
oh nice, I noticed this too later and assumed all checks happen in auto_docstring.py file
utils/check_docstrings.py
Outdated
| """Return the modular_*.py path for any generated model file, or None. | ||
|
|
||
| Handles modeling_*, configuration_*, processing_*, image_processing_* (including | ||
| the _fast suffix variant). |
There was a problem hiding this comment.
ultra nit: _pil suffix after your refactor 😄 Also in code
| r""" | ||
| ```python | ||
| >>> from transformers import BitNetModel, BitNetConfig | ||
|
|
||
| >>> # Initializing a BitNet style configuration |
There was a problem hiding this comment.
I think that was accidentally deleted during one of the earlier commit to add config supports in check_auto_docstrings, which didn't work fully. We shouldn't have this problem with the last version, but I'll restore all the deleted examples and try running again to see if everything work as expected. Thanks for catching that!
| text_config (`dict`, *optional*): | ||
| Dictionary of configuration options used to initialize [`CLIPTextConfig`]. | ||
| vision_config (`dict`, *optional*): | ||
| Dictionary of configuration options used to initialize [`CLIPVisionConfig`]. |
There was a problem hiding this comment.
btw, does the "check" delete args that are in "autodoctsing"? IIRC all three args here are auto-documented
There was a problem hiding this comment.
It should, if the docstrings are exactly the same. If the docstring of args existing in auto_docstring are overriden with different description, they won't be deleted by check_auto_docstring
| r""" | ||
| ```python | ||
| >>> from transformers import CwmModel, CwmConfig | ||
|
|
||
| >>> # Initializing a Cwm cwm-7b style configuration | ||
| >>> configuration = CwmConfig() | ||
|
|
||
| >>> # Initializing a model from the cwm-7b style configuration | ||
| >>> model = CwmModel(configuration) | ||
|
|
||
| >>> # Accessing the model configuration | ||
| >>> configuration = model.config |
There was a problem hiding this comment.
same here, I think the check is not working correctly with bare "Example" docs.
edit: Actually not, I also saw the whole doc deleted with args and the Example, so no idea what might have gone wrong
| readout_type (`str`, *optional*, defaults to `"project"`): | ||
| The readout type to use when processing the readout token (CLS token) of the intermediate hidden states of | ||
| the ViT backbone. Can be one of [`"ignore"`, `"add"`, `"project"`]. | ||
| - "ignore" simply ignores the CLS token. | ||
| - "add" passes the information from the CLS token to all other tokens by adding the representations. | ||
| - "project" passes information to the other tokens by concatenating the readout to all other tokens before | ||
| projecting the | ||
| representation to the original feature dimension D using a linear layer followed by a GELU non-linearity. |
There was a problem hiding this comment.
I am not sure if the checker handles well huge blocks of descriptions. I had issues when they used colon inside a description 😅
In this case, readout_type is part of attr and is deleted. Can you check if smth went wrong when auto-generating?
There was a problem hiding this comment.
Looks like it was just moved above to fit the order of the attributes in the config no?
| r""" | ||
| Example: | ||
|
|
||
| ```python | ||
| >>> from transformers import GlmImageVisionConfig, GlmImageVisionModel | ||
|
|
||
| >>> # Initializing a GlmImageVisionConfig GLM-4.1V-9B style configuration | ||
| >>> configuration = GlmImageVisionConfig() | ||
|
|
||
| >>> # Initializing a model (with random weights) from the GLM-4.1V-9B configuration | ||
| >>> model = GlmImageVisionModel(configuration) | ||
|
|
There was a problem hiding this comment.
nice, if we can auto-add examples based on what we have. I wonder where the checkpoint comes from tho, because the auto_docstring has "zai-org/GLM-Image"
There was a problem hiding this comment.
I copied this from Glm4vVisionConfig as this is a bit of an edge case where the docstring inherited contains an arg that we remove here (out_hidden_size ). But agreed we can automatically add examples (we already do for models mapping to a pipeline)
| ```python | ||
| >>> from transformers import GraniteModel, GraniteConfig | ||
|
|
||
| >>> # Initializing a Granite granite-3b style configuration | ||
| >>> configuration = GraniteConfig() |
There was a problem hiding this comment.
i'd prefer to keep docs for "Example", after deleting the unnecessary args
| num_hidden_layers: int = 2 | ||
| hidden_act: str = "gelu" | ||
| image_token_embed_dim = 2048 | ||
| image_token_embed_dim: int = 2048 |
There was a problem hiding this comment.
is this auto-generated based on default value's type? 🤯
There was a problem hiding this comment.
No unfortunately 😅, I fixed this manually
|
Oh, and we could add one new line in https://huggingface.co/docs/transformers/en/auto_docstring at the end, that config subclasses are also to be decorated |
…heck-auto-docstrings
… document config attributes
|
@zucchini-nlp Thanks for the review! Everything should be good now :) |
zucchini-nlp
left a comment
There was a problem hiding this comment.
Niice, I just looked on auto-generated config and for the rest I'll trust in you
Btw, can you rebase main because I merged yesterday a PR changing annotations for architectures from ClassVar to normal field. I had to skip auto-doc complaining about it manually and left a TODO
docs/source/en/auto_docstring.md
Outdated
| </hfoption> | ||
| <hfoption id="config classes"> | ||
|
|
||
| Place `@auto_docstring` directly above a `PreTrainedConfig` subclass, alongside `@strict(accept_kwargs=True)` from `huggingface_hub.dataclasses`. Config parameters are declared as **class-level annotations** (not as `__init__` arguments) — the `@strict` dataclass pattern used throughout Transformers. The class docstring documents model-specific parameters and optionally a usage example. |
There was a problem hiding this comment.
Thanks for adding the strict one in docs as well
btw, we are removing the (accept_kwargs=True) and instead wrapping subclasses always. I'm merging the PR today, so let's remove it from docs
docs/source/en/auto_docstring.md
Outdated
| from ...utils import auto_docstring | ||
|
|
||
| @auto_docstring(checkpoint="org/my-model-checkpoint") | ||
| @strict(accept_kwargs=True) |
| <hfoption id="processor classes"> | ||
|
|
||
| **Multimodal processors** (`ProcessorMixin` subclasses, `processing_*.py`) always use bare `@auto_docstring`. The class intro is auto-generated. Only document `__init__` parameters not already covered by `ProcessorArgs` (`image_processor`, `tokenizer`, `chat_template`, etc.) — omit the docstring entirely if all parameters are standard. Decorate `__call__` with `@auto_docstring` as well; its body docstring contains only a `Returns:` section plus any extra model-specific call arguments. `return_tensors` is automatically appended. |
docs/source/en/auto_docstring.md
Outdated
| attributes = ["image_processor", "tokenizer"] | ||
| image_processor_class = "AutoImageProcessor" | ||
| tokenizer_class = "AutoTokenizer" |
There was a problem hiding this comment.
do we want to define these, iirc you deprecated it after v5 🤔
There was a problem hiding this comment.
Ah no thanks for catching that, I was too quick in rereading the claude generated docs :)
| text_config (`dict`, *optional*): | ||
| Dictionary of configuration options used to initialize [`CLIPTextConfig`]. | ||
| vision_config (`dict`, *optional*): | ||
| Dictionary of configuration options used to initialize [`CLIPVisionConfig`]. |
I think It's already up to date with this PR, but now the logic is to go through the mro to get the attributes to document (while still excluding ClassVar), and stop when we reach PreTrainedConfig, so it should fix this issue as well (I'm not getting any warning or error with auto_docstring) |
…heck-auto-docstrings
|
[For maintainers] Suggested jobs to run (before merge) run-slow: afmoe, albert, align, bamba, bigbird_pegasus, bit, bitnet, blip, blt, bridgetower, chameleon, chmv2, clap, clip, clvp, cohere |
What does this PR do?
(Finally) add support for checking+fixing both generated files and modular files in
check_auto_docstrings.Also
auto_docstringwas recently added to configs, and this PR updatescheck_auto_docstringsto support configs.Currently, auto_docstring documents all config attributes in the docs, which result in illegible docs. This PR also excludes attributes inherited from PreTrainedConfig in the docs (as it was without auto_docstring)
This means that we shouldn't have PR merged with docstring related errors/warning anymore + inconsistencies between modular and
check_auto_docstringsoverwriting eachother.Cc @zucchini-nlp for configs