[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL. by ZLkanyo009 · Pull Request #43297 · huggingface/transformers

ZLkanyo009 · 2026-01-15T03:17:45Z

By tokenizing only a single <image_pad> or <video_pad> into the input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.
In SGLang, we discovered that the tokenization process for the Qwen3VL model takes a long time. By applying the above method, we can reduce this time consumption.

profile

before:

after:

acc

server:

model_path=/models/nvme3/data/models/Qwen3-VL-235B-A22B-Instruct-FP8-dynamic/

model_name=Qwen3-VL-235B-A22B-Instruct-FP8-dynamic

python3 -m sglang.launch_server \
        --model-path $model_path \
        --served-model-name ${model_name} \
        --host 0.0.0.0 \
        --port 8011 \
        --tp-size 8 \
        --trust-remote-code \
        --chunked-prefill-size 32768 \
        --mem-fraction-static 0.80 \
        --disable-radix-cache \
        --cuda-graph-max-bs 128 \
        --max-prefill-tokens 32768 \
        --max-running-requests 128 \
        --mm-attention-backend aiter_attn \
        --mm-enable-dp-encoder \
        --enable-aiter-allreduce-fusion \
        --disable-overlap-schedule

client:
python3 benchmark/mmmu/bench_sglang.py --port 9000 --concurrency 16
before and after acc:
0.594

…e input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.

Rocketknight1 · 2026-01-15T13:36:30Z

cc @itazap @zucchini-nlp

zucchini-nlp · 2026-01-15T16:58:59Z

+                for token_id in input_ids:
+                    if token_id == self.video_token_id and global_video_token_idx < len(video_token_counts):
+                        # Expand 1 video_token to N tokens
+                        num_tokens = video_token_counts[global_video_token_idx]
+                        new_input_ids.extend([self.video_token_id] * num_tokens)
+                        global_video_token_idx += 1
+                    else:
+                        new_input_ids.append(token_id)


hmm, I wonder if Qwen-VL uses rust tokenizer backend, which is supposed to be super fast/optimized

Yes, I have already used it, but it's not very effective for multimodal inputs with large images. For example, large images might have around 6000 <image_pad>, which significantly slows down the tokenization process.

hmm, and the wins are quite big as per your benchmarks. I see that the current code assumes bs==1 and thus no padding. Expanding manually is a bit more involved than the current code, we have been doing it manually in the past during model's forward and it has costed us several bugs/issues. The main issue is to take into account padding and pad tokens to the correct side when needed. Same applies for attention mask, we can't simply set all to 1

I'd like to hear what @itazap thinks. If we decide to expand those after tokenizing, it would need a utility (inside a tokenizer or processor) that any multimodal model can call

@itazap could you please review this pr, thanks

thanks for working on this! Agreed that the performance wins are attractive.bs==1and no padding support does make me question the complexity of a tokenizers utility for this. I'm wondering if maybe we can do a hybrid approach where we use your post-tokenization expansion when bs==1 and padding=False. Interested to know if you think that would be useful @zucchini-nlp ? Also on having done it manually in the past - is the performance of this a common reported issue?

is the performance of this a common reported issue?

no, this is the first issue mentioning it. I dont' think it causes more overhead than other parts of processing such as "image_processing". I agree that this is hard to implement with bs>1 unless tokenizers have a prebuilt utility for that. I don't think we want to keep the code if it's going to run only under bs==1, padding=False

yes understood - I would say let's reconsider this for now as it would add maintenance overhead for a solution that covers only one case for now (being bs==1, padding=False). Happy to revisit this if it gets more demand from the community

github-actions · 2026-03-18T06:51:47Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: qwen3_vl

github-actions · 2026-03-18T06:57:15Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43297&sha=6d7358

[Feat] By tokenizing only a single <image_pad> or <video_pad> into th…

f4266ca

…e input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.

zucchini-nlp reviewed Jan 15, 2026

View reviewed changes

wuhuikx mentioned this pull request Feb 9, 2026

[Tracking][Performance][AMD] Qwen3 & Qwen3-VL Latency Optimization on AMD MI300 Series GPUs sgl-project/sglang#18466

Open

Merge branch 'main' into main

6d73584

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297

[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297
ZLkanyo009 wants to merge 2 commits intohuggingface:mainfrom
ZLkanyo009:main

ZLkanyo009 commented Jan 15, 2026 •

edited

Loading

Uh oh!

Rocketknight1 commented Jan 15, 2026

Uh oh!

zucchini-nlp Jan 15, 2026

Uh oh!

ZLkanyo009 Jan 16, 2026

Uh oh!

zucchini-nlp Jan 16, 2026

Uh oh!

ZLkanyo009 Jan 19, 2026

Uh oh!

itazap Feb 3, 2026

Uh oh!

zucchini-nlp Feb 3, 2026

Uh oh!

itazap Feb 4, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

ZLkanyo009 commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

profile

acc

Uh oh!

Rocketknight1 commented Jan 15, 2026

Uh oh!

zucchini-nlp Jan 15, 2026

Choose a reason for hiding this comment

Uh oh!

ZLkanyo009 Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

ZLkanyo009 Jan 19, 2026

Choose a reason for hiding this comment

Uh oh!

itazap Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

itazap Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ZLkanyo009 commented Jan 15, 2026 •

edited

Loading