[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297
[Feat] Reduces redundant tokenization of <pad> tags to accelerate Qwen3VL.#43297ZLkanyo009 wants to merge 2 commits intohuggingface:mainfrom
Conversation
…e input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.
| for token_id in input_ids: | ||
| if token_id == self.video_token_id and global_video_token_idx < len(video_token_counts): | ||
| # Expand 1 video_token to N tokens | ||
| num_tokens = video_token_counts[global_video_token_idx] | ||
| new_input_ids.extend([self.video_token_id] * num_tokens) | ||
| global_video_token_idx += 1 | ||
| else: | ||
| new_input_ids.append(token_id) |
There was a problem hiding this comment.
hmm, I wonder if Qwen-VL uses rust tokenizer backend, which is supposed to be super fast/optimized
There was a problem hiding this comment.
Yes, I have already used it, but it's not very effective for multimodal inputs with large images. For example, large images might have around 6000 <image_pad>, which significantly slows down the tokenization process.
There was a problem hiding this comment.
hmm, and the wins are quite big as per your benchmarks. I see that the current code assumes bs==1 and thus no padding. Expanding manually is a bit more involved than the current code, we have been doing it manually in the past during model's forward and it has costed us several bugs/issues. The main issue is to take into account padding and pad tokens to the correct side when needed. Same applies for attention mask, we can't simply set all to 1
I'd like to hear what @itazap thinks. If we decide to expand those after tokenizing, it would need a utility (inside a tokenizer or processor) that any multimodal model can call
There was a problem hiding this comment.
thanks for working on this! Agreed that the performance wins are attractive.bs==1and no padding support does make me question the complexity of a tokenizers utility for this. I'm wondering if maybe we can do a hybrid approach where we use your post-tokenization expansion when bs==1 and padding=False. Interested to know if you think that would be useful @zucchini-nlp ? Also on having done it manually in the past - is the performance of this a common reported issue?
There was a problem hiding this comment.
is the performance of this a common reported issue?
no, this is the first issue mentioning it. I dont' think it causes more overhead than other parts of processing such as "image_processing". I agree that this is hard to implement with bs>1 unless tokenizers have a prebuilt utility for that. I don't think we want to keep the code if it's going to run only under bs==1, padding=False
There was a problem hiding this comment.
yes understood - I would say let's reconsider this for now as it would add maintenance overhead for a solution that covers only one case for now (being bs==1, padding=False). Happy to revisit this if it gets more demand from the community
|
[For maintainers] Suggested jobs to run (before merge) run-slow: qwen3_vl |
|
View the CircleCI Test Summary for this PR: https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=43297&sha=6d7358 |
By tokenizing only a single <image_pad> or <video_pad> into the input_ids and inserting them in the form of n * grid_thw at the end, the tokenization of Qwen3VL is accelerated.
In SGLang, we discovered that the tokenization process for the Qwen3VL model takes a long time. By applying the above method, we can reduce this time consumption.
profile
before:


after:
acc
server:
client:
python3 benchmark/mmmu/bench_sglang.py --port 9000 --concurrency 16before and after acc:
0.594