Skip to content

Qwen3-VL Processor combined images in the same batch as one message #41709

@weathon

Description

@weathon

System Info

  • transformers version: 5.0.0.dev0
  • Platform: Linux-6.8.0-85-generic-x86_64-with-glibc2.35
  • Python version: 3.10.17
  • Huggingface_hub version: 1.0.0.rc6
  • Safetensors version: 0.5.3
  • Accelerate version: 1.10.1
  • Accelerate config: not found
  • DeepSpeed version: 0.18.0
  • PyTorch version (accelerator?): 2.8.0+cu126 (CUDA)
  • Using distributed or parallel set-up in script?:
  • Using GPU in script?:
  • GPU type: NVIDIA RTX A6000

Who can help?

@ArthurZucker and @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

It is a simple custom head on top of Qwen-VL-4B

import json
with open("prompts.json", "r") as f:
    prompt_dict = json.load(f)
    
def format_data(sample):
    images = []
    dims_selected = []
    # print(len(sample["image"]), len(sample["annotation"]))
    for image in range(len(sample['image'])):
        images.append(sample['image'][image])
        try:
            if random.random()>0.5:
                # sample a dim with score>=0 
                dims_selected.append(random.choice(list([i for i in dims if sample['annotation'][image][i]>=0])))
            else:
                # sample a dim with score<0
                dims_selected.append(random.choice(list([i for i in dims if sample['annotation'][image][i]<0])))
        except IndexError:
            dims_selected.append(random.choice(dims))
            

    prompts = [prompt_dict[dim] for i, dim in enumerate(dims_selected)]
    images = list(sample['image'])
    n_images = len(images) 
    n_prompts = len(prompts) 
    messages = [[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": images[i],
                },
                {"type": "text", "text": prompt},
            ],
        }
    ] for i, prompt in enumerate(prompts)]

    text = processor.apply_chat_template(
        messages,
        add_generation_prompt=True, 
        return_dict=True,
        return_tensors="pt",
        padding=True
    ) 
    inputs = processor(
        images=[[images[i]] for i in range(n_images)],
        text=text,
        return_tensors="pt",
        padding=True,
    )
    print(inputs['pixel_values'].shape) #torch.Size([45312, 1536])
    inputs["text"] = text
    answers = [1 if i[dim]<0 else (0 if i[dim]>0 else 0.5)for i, dim in zip(sample["annotation"], dims_selected)]
    labels = torch.tensor(answers)
    inputs['labels'] = labels
    inputs['dim'] = [dims.index(dim) for dim in dims_selected]
    return inputs

Expected behavior

The shape should be BxLxD, but got (BxL)xD, I also tried put the image in messages, same results

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions