Qwen3-VL Processor combined images in the same batch as one message

### System Info


- `transformers` version: 5.0.0.dev0
- Platform: Linux-6.8.0-85-generic-x86_64-with-glibc2.35
- Python version: 3.10.17
- Huggingface_hub version: 1.0.0.rc6
- Safetensors version: 0.5.3
- Accelerate version: 1.10.1
- Accelerate config:    not found
- DeepSpeed version: 0.18.0
- PyTorch version (accelerator?): 2.8.0+cu126 (CUDA)
- Using distributed or parallel set-up in script?: <fill in>
- Using GPU in script?: <fill in>
- GPU type: NVIDIA RTX A6000

### Who can help?

@ArthurZucker and @itazap


### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

It is a simple custom head on top of Qwen-VL-4B
```python
import json
with open("prompts.json", "r") as f:
    prompt_dict = json.load(f)
    
def format_data(sample):
    images = []
    dims_selected = []
    # print(len(sample["image"]), len(sample["annotation"]))
    for image in range(len(sample['image'])):
        images.append(sample['image'][image])
        try:
            if random.random()>0.5:
                # sample a dim with score>=0 
                dims_selected.append(random.choice(list([i for i in dims if sample['annotation'][image][i]>=0])))
            else:
                # sample a dim with score<0
                dims_selected.append(random.choice(list([i for i in dims if sample['annotation'][image][i]<0])))
        except IndexError:
            dims_selected.append(random.choice(dims))
            

    prompts = [prompt_dict[dim] for i, dim in enumerate(dims_selected)]
    images = list(sample['image'])
    n_images = len(images) 
    n_prompts = len(prompts) 
    messages = [[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": images[i],
                },
                {"type": "text", "text": prompt},
            ],
        }
    ] for i, prompt in enumerate(prompts)]

    text = processor.apply_chat_template(
        messages,
        add_generation_prompt=True, 
        return_dict=True,
        return_tensors="pt",
        padding=True
    ) 
    inputs = processor(
        images=[[images[i]] for i in range(n_images)],
        text=text,
        return_tensors="pt",
        padding=True,
    )
    print(inputs['pixel_values'].shape) #torch.Size([45312, 1536])
    inputs["text"] = text
    answers = [1 if i[dim]<0 else (0 if i[dim]>0 else 0.5)for i, dim in zip(sample["annotation"], dims_selected)]
    labels = torch.tensor(answers)
    inputs['labels'] = labels
    inputs['dim'] = [dims.index(dim) for dim in dims_selected]
    return inputs
```


### Expected behavior


The shape should be BxLxD, but got (BxL)xD, I also tried put the image in messages, same results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Qwen3-VL Processor combined images in the same batch as one message #41709

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Qwen3-VL Processor combined images in the same batch as one message #41709

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions