Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions src/transformers/models/qwen3_vl/configuration_qwen3_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -110,7 +110,7 @@ class Qwen3VLTextConfig(PreTrainedConfig):
Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain
a value for `rope_theta` and optionally parameters used for scaling in case you want to use RoPE
with longer `max_position_embeddings`.
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
Expand Down Expand Up @@ -197,13 +197,13 @@ class Qwen3VLConfig(PreTrainedConfig):
vision_config (`Union[PreTrainedConfig, dict]`, *optional*, defaults to `Qwen3VLVisionConfig`):
The config object or dictionary of the vision backbone.
image_token_id (`int`, *optional*, defaults to 151655):
The image token index to encode the image prompt.
The token id used as the placeholder for image inputs.
video_token_id (`int`, *optional*, defaults to 151656):
The video token index to encode the image prompt.
The token id used as the placeholder for video inputs.
vision_start_token_id (`int`, *optional*, defaults to 151652):
The start token index to encode the image prompt.
The token id that marks the start of a vision segment (image or video).
vision_end_token_id (`int`, *optional*, defaults to 151653):
The end token index to encode the image prompt.
The token id that marks the end of a vision segment (image or video).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie the word embeddings.

Expand Down
16 changes: 8 additions & 8 deletions src/transformers/models/qwen3_vl/modular_qwen3_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -152,7 +152,7 @@ class Qwen3VLTextConfig(PreTrainedConfig):
Dictionary containing the configuration parameters for the RoPE embeddings. The dictionary should contain
a value for `rope_theta` and optionally parameters used for scaling in case you want to use RoPE
with longer `max_position_embeddings`.
attention_bias (`bool`, defaults to `False`, *optional*, defaults to `False`):
attention_bias (`bool`, *optional*, defaults to `False`):
Whether to use a bias in the query, key, value and output projection layers during self-attention.
attention_dropout (`float`, *optional*, defaults to 0.0):
The dropout ratio for the attention probabilities.
Expand Down Expand Up @@ -239,13 +239,13 @@ class Qwen3VLConfig(PreTrainedConfig):
vision_config (`Union[PreTrainedConfig, dict]`, *optional*, defaults to `Qwen3VLVisionConfig`):
The config object or dictionary of the vision backbone.
image_token_id (`int`, *optional*, defaults to 151655):
The image token index to encode the image prompt.
The token id used as the placeholder for image inputs.
video_token_id (`int`, *optional*, defaults to 151656):
The video token index to encode the image prompt.
The token id used as the placeholder for video inputs.
vision_start_token_id (`int`, *optional*, defaults to 151652):
The start token index to encode the image prompt.
The token id that marks the start of a vision segment (image or video).
vision_end_token_id (`int`, *optional*, defaults to 151653):
The end token index to encode the image prompt.
The token id that marks the end of a vision segment (image or video).
tie_word_embeddings (`bool`, *optional*, defaults to `False`):
Whether to tie the word embeddings.

Expand Down Expand Up @@ -1404,9 +1404,9 @@ def __call__(
**kwargs: Unpack[Qwen3VLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
Main method to prepare for the model one or several sequence(s) and image(s). This method forwards the `text`
and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwargs` arguments to
Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`] if `vision_infos` is not `None`.

Args:
Expand All @@ -1418,7 +1418,7 @@ def __call__(
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
videos (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
Expand Down
6 changes: 3 additions & 3 deletions src/transformers/models/qwen3_vl/processing_qwen3_vl.py
Original file line number Diff line number Diff line change
Expand Up @@ -99,9 +99,9 @@ def __call__(
**kwargs: Unpack[Qwen3VLProcessorKwargs],
) -> BatchFeature:
"""
Main method to prepare for the model one or several sequences(s) and image(s). This method forwards the `text`
Main method to prepare for the model one or several sequence(s) and image(s). This method forwards the `text`
and `kwargs` arguments to Qwen2TokenizerFast's [`~Qwen2TokenizerFast.__call__`] if `text` is not `None` to encode
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwrags` arguments to
the text. To prepare the vision inputs, this method forwards the `vision_infos` and `kwargs` arguments to
Qwen2VLImageProcessor's [`~Qwen2VLImageProcessor.__call__`] if `vision_infos` is not `None`.
Args:
Expand All @@ -113,7 +113,7 @@ def __call__(
(pretokenized string). If the sequences are provided as list of strings (pretokenized), you must set
`is_split_into_words=True` (to lift the ambiguity with a batch of sequences).
videos (`np.ndarray`, `torch.Tensor`, `list[np.ndarray]`, `list[torch.Tensor]`):
The image or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
The video or batch of videos to be prepared. Each video can be a 4D NumPy array or PyTorch
tensor, or a nested list of 3D frames. Both channels-first and channels-last formats are supported.
return_tensors (`str` or [`~utils.TensorType`], *optional*):
If set, will return tensors of a particular framework. Acceptable values are:
Expand Down