Skip to content

[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297

Open
ZLkanyo009 wants to merge 2 commits intohuggingface:mainfrom
ZLkanyo009:main
Open

[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297
ZLkanyo009 wants to merge 2 commits intohuggingface:mainfrom
ZLkanyo009:main

Conversation

@ZLkanyo009
Copy link
Copy Markdown

@ZLkanyo009 ZLkanyo009 commented Jan 15, 2026

By tokenizing only a single <image_pad> or <video_pad> into the input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.
In SGLang, we discovered that the tokenization process for the Qwen3VL model takes a long time. By applying the above method, we can reduce this time consumption.

profile

before:
image
after:
image

acc

server:

model_path=/models/nvme3/data/models/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic/

model_name=Qwen3-VL-235B-A22B-Instruct-FP8-dynamic

python3 -m sglang.launch_server \
        --model-path $model_path \
        --served-model-name ${model_name} \
        --host 0.0.0.0 \
        --port 8011 \
        --tp-size 8 \
        --trust-remote-code \
        --chunked-prefill-size 32768 \
        --mem-fraction-static 0.80 \
        --disable-radix-cache \
        --cuda-graph-max-bs 128 \
        --max-prefill-tokens 32768 \
        --max-running-requests 128 \
        --mm-attention-backend aiter_attn \
        --mm-enable-dp-encoder \
        --enable-aiter-allreduce-fusion \
        --disable-overlap-schedule

client:
python3 benchmark/mmmu/bench_sglang.py --port 9000 --concurrency 16
before and after acc:
0.594

…e input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.
@Rocketknight1
Copy link
Copy Markdown
Member

cc @itazap @zucchini-nlp

Comment on lines +233 to +240
for token_id in input_ids:
if token_id == self.video_token_id and global_video_token_idx < len(video_token_counts):
# Expand 1 video_token to N tokens
num_tokens = video_token_counts[global_video_token_idx]
new_input_ids.extend([self.video_token_id] * num_tokens)
global_video_token_idx += 1
else:
new_input_ids.append(token_id)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, I wonder if Qwen-VL uses rust tokenizer backend, which is supposed to be super fast/optimized

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have already used it, but it's not very effective for multimodal inputs with large images. For example, large images might have around 6000 <image_pad>, which significantly slows down the tokenization process.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, and the wins are quite big as per your benchmarks. I see that the current code assumes bs==1 and thus no padding. Expanding manually is a bit more involved than the current code, we have been doing it manually in the past during model's forward and it has costed us several bugs/issues. The main issue is to take into account padding and pad tokens to the correct side when needed. Same applies for attention mask, we can't simply set all to 1

I'd like to hear what @itazap thinks. If we decide to expand those after tokenizing, it would need a utility (inside a tokenizer or processor) that any multimodal model can call

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@itazap could you please review this pr, thanks

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for working on this! Agreed that the performance wins are attractive.bs==1and no padding support does make me question the complexity of a tokenizers utility for this. I'm wondering if maybe we can do a hybrid approach where we use your post-tokenization expansion when bs==1 and padding=False. Interested to know if you think that would be useful @zucchini-nlp ? Also on having done it manually in the past - is the performance of this a common reported issue?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the performance of this a common reported issue?

no, this is the first issue mentioning it. I dont' think it causes more overhead than other parts of processing such as "image_processing". I agree that this is hard to implement with bs>1 unless tokenizers have a prebuilt utility for that. I don't think we want to keep the code if it's going to run only under bs==1, padding=False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes understood - I would say let's reconsider this for now as it would add maintenance overhead for a solution that covers only one case for now (being bs==1, padding=False). Happy to revisit this if it gets more demand from the community

@github-actions
Copy link
Copy Markdown
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_vl

@github-actions
Copy link
Copy Markdown
Contributor

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43297&sha=6d7358

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants