# Multi-Image Generation

In this example, you will learn how to generate text from multiple images using the supported models: `Qwen2-VL`, `Pixtral` and `llava-interleaved`.

Multi-image generation allows you to pass a list of images to the model and generate text conditioned on all the images.


In [1]:
from mlx_vlm import load, apply_chat_template, generate
from mlx_vlm.utils import load_image
from mlx_vlm.utils import process_image

  from .autonotebook import tqdm as notebook_tqdm


In [2]:
images = ["images/cats.jpg", "images/desktop_setup.png"]

messages = [
    {"role": "user", "content": "Describe what you see in the images."}
]

## Qwen2-VL

In [3]:
# Load model and processor
qwen_vl_model, qwen_vl_processor = load("mlx-community/Qwen2.5-VL-7B-Instruct-4bit")
# qwen_vl_model, qwen_vl_processor = load("mlx-community/Qwen2.5-VL-3B-Instruct-3bit")
qwen_vl_config = qwen_vl_model.config

Fetching 12 files: 100%|██████████| 12/12 [00:00<00:00, 14961.85it/s]
The image processor of type `Qwen2VLImageProcessor` is now loaded as a fast processor by default, even if the model checkpoint was saved with a slow processor. This is a breaking change and may produce slightly different outputs. To continue using the slow processor, instantiate this class with `use_fast=False`. Note that this behavior will be extended to all models in a future release.
You have video processor config saved in `preprocessor.json` file which is deprecated. Video processor configs should be saved in their own `video_preprocessor.json` file. You can rename the file or load and save the processor back which renames it automatically. Loading from `preprocessor.json` will be removed in v5.0.


In [4]:
prompt = apply_chat_template(qwen_vl_processor, qwen_vl_config, messages, num_images=len(images))

In [None]:
qwen_vl_output = generate(
    qwen_vl_model,
    qwen_vl_processor,
    prompt,
    images,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)

Files: ['images/cats.jpg', 'images/desktop_setup.png'] 

Prompt: <|im_start|>system
You are a helpful assistant.<|im_end|>
<|im_start|>user
<|vision_start|><|image_pad|><|vision_end|><|vision_start|><|image_pad|><|vision_end|>Describe what you see in the images.<|im_end|>
<|im_start|>assistant



## Pixtral

In [None]:
# Load model and processor
pixtral_model, pixtral_processor = load("mlx-community/pixtral-12b-4bit")
pixtral_config = pixtral_model.config

In [None]:

prompt = apply_chat_template(pixtral_processor, pixtral_config, messages, num_images=len(images))

In [None]:
# Pixtral requires images to be resized to the same shape in multi-image generation
resized_images = [process_image(load_image(image), (560, 560), None) for image in images]

In [None]:
pixtral_output = generate(
    pixtral_model,
    pixtral_processor,
    prompt,
    resized_images,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)

## Llava-Interleaved

In [None]:
# Load model and processor
llava_model, llava_processor = load("mlx-community/llava-interleave-qwen-0.5b-bf16")
llava_config = llava_model.config

In [None]:
prompt = apply_chat_template(llava_processor, llava_config, messages, num_images=len(images))

In [None]:
llava_output = generate(
    llava_model,
    llava_processor,
    prompt,
    images,
    max_tokens=1000,
    temperature=0.7,
    verbose=True
)