# Python VLM (Vision Language Model) Tutorial

This tutorial demonstrates how to use the VLM Python API for running Vision Language Models on Hailo hardware.

The VLM API provides multi-modal capabilities, combining text and image inputs for comprehensive AI interactions. It supports both streaming and non-streaming text generation, context management, and various generation parameters.

**Key Features:**
- Multi-modal input processing (text + images)
- Streaming and non-streaming text generation
- Support for multiple images per prompt
- Automatic frame validation and conversion
- Context management for multi-turn conversations

**Best Practice: Structured Prompts**
This tutorial uses **structured prompts** (list of JSON messages) exclusively. Structured prompts provide better control, consistency, and leverage the model's chat template effectively.

**Best Practice: Context Manager**
This tutorial does not use context-manager to share resources between different cells. Make sure to create VDevice and VLM using 'with' statements whenever possible. When not using 'with', use VDevice.release() and VLM.release() to clean up resources.

**Requirements**

* Run the notebook inside the Python virtual environment: ```source hailo_virtualenv/bin/activate```
* A VLM HEF file (Hailo Executable Format for Vision Language Models)
* Image files for testing (JPEG/PNG format)
* OpenCV for image processing: ```pip install opencv-python```

**Memory Optimization (Optional):**

* For large models that may exceed device memory, enable client-side tokenization
* Requires libhailort to be compiled with `HAILO_BUILD_CLIENT_TOKENIZER=ON`
* Requires Rust toolchain (cargo, rustup) to be installed on the build machine
* Set `OPTIMIZE_MEMORY_ON_DEVICE = True` in the configuration section below

**Tutorial Structure:**

* Basic VLM initialization and image frame requirements
* Image conversion using OpenCV (JPEG/PNG to numpy arrays)
* Text-only generation (VLM without frames)
* Single image processing with structured prompts
* Multiple image processing
* Generation parameters and context management
* Advanced features: templates, tokenization, stop tokens

When inside the ```virtualenv```, use the command ``jupyter-notebook <tutorial-dir>`` to open a Jupyter server that contains the tutorials (default folder on GitHub: ``hailort/libhailort/bindings/python/platform/hailo_tutorials/notebooks/``).


## VLM Tutorial: Setup and Configuration


In [None]:
import numpy as np
import cv2
import os
from hailo_platform import VDevice
from hailo_platform.genai import VLM

# Configuration - Update these paths for your setup
MODEL_PATH = "/your/hef/path/vlm.hef"  # Update this path
SAMPLE_IMAGE_PATH = "/path/to/your/image.jpg"  # Update this path
# Memory Optimization: Enable client-side tokenization for large models
# This reduces device memory usage by moving tokenization to the host
# Requires libhailort to be compiled with HAILO_BUILD_CLIENT_TOKENIZER=ON
OPTIMIZE_MEMORY_ON_DEVICE = False  # Set to True for memory optimization

print("Model path: {}".format(MODEL_PATH))
print("Sample image: {}".format(SAMPLE_IMAGE_PATH))
vdevice = VDevice()
print("Initializing VLM... this may take a moment...")
vlm = VLM(vdevice, MODEL_PATH, OPTIMIZE_MEMORY_ON_DEVICE)
print("VLM initialized successfully!")


## Understanding VLM Frame Requirements


In [None]:
# Get frame requirements from the model
frame_shape = vlm.input_frame_shape()
frame_dtype = vlm.input_frame_format_type()
frame_size = vlm.input_frame_size()
frame_order = vlm.input_frame_format_order()

print("Frame requirements:")
print("  Shape: {} (height, width, channels)".format(frame_shape))
print("  Data type: {}".format(frame_dtype))
print("  Size in bytes: {}".format(frame_size))
print("  Format order: {}".format(frame_order))

height, width, channels = frame_shape
print("  Expected frame format: {}x{}x{} {}".format(height, width, channels, frame_dtype))


## Image Conversion Functions Using OpenCV


In [None]:
def convert_image_to_numpy(image_path, target_shape, target_dtype=np.uint8):
    """
    Convert JPEG/PNG image to numpy array with required shape and dtype.
    
    Args:
        image_path (str): Path to the image file
        target_shape (tuple): Target shape (height, width, channels)
        target_dtype: Target numpy data type
        
    Returns:
        numpy.ndarray: Converted image array
    """
    if not os.path.exists(image_path):
        raise FileNotFoundError("Image file not found: {}".format(image_path))
    
    target_height, target_width, target_channels = target_shape
    
    # Read image using OpenCV
    img = cv2.imread(image_path, cv2.IMREAD_COLOR)
    if img is None:
        raise ValueError("Failed to load image: {}".format(image_path))
    
    # Convert BGR to RGB (OpenCV uses BGR by default)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    
    # Resize if needed
    if img.shape[0] != target_height or img.shape[1] != target_width:
        img = cv2.resize(img, (target_width, target_height), interpolation=cv2.INTER_LINEAR)
        print("Image resized to: {}x{}".format(target_width, target_height))
    
    # Handle channel conversion if needed
    if target_channels == 1 and img.shape[2] == 3:
        img = cv2.cvtColor(img, cv2.COLOR_RGB2GRAY)
        img = np.expand_dims(img, axis=2)
        print("Converted to grayscale")
    elif target_channels == 3 and len(img.shape) == 2:
        img = cv2.cvtColor(img, cv2.COLOR_GRAY2RGB)
        print("Converted to RGB")
    
    # Convert to target data type
    img = img.astype(target_dtype)
    
    print("Final image shape: {}, dtype: {}".format(img.shape, img.dtype))
    return img

def create_sample_image(shape, dtype=np.uint8):
    """
    Create a sample image for testing when no image file is available.
    
    Args:
        shape (tuple): Image shape (height, width, channels)
        dtype: Data type for the image
        
    Returns:
        numpy.ndarray: Generated sample image
    """
    height, width, channels = shape
    
    # Create a gradient pattern
    img = np.zeros(shape, dtype=dtype)
    for i in range(height):
        for j in range(width):
            if channels == 3:
                img[i, j, 0] = int((i / height) * 255)  # Red gradient
                img[i, j, 1] = int((j / width) * 255)   # Green gradient
                img[i, j, 2] = 128                      # Blue constant
            else:
                img[i, j] = int(((i + j) / (height + width)) * 255)
    
    print("Created sample image with shape: {}, dtype: {}".format(img.shape, img.dtype))
    return img

# Test the conversion function
print("Testing image conversion functions...")
if os.path.exists(SAMPLE_IMAGE_PATH):
    test_frame = convert_image_to_numpy(SAMPLE_IMAGE_PATH, frame_shape, frame_dtype)
else:
    print("Sample image not found, creating synthetic image...")
    test_frame = create_sample_image(frame_shape, frame_dtype)
    
print("Test frame ready for VLM processing!")


## Single Image Processing with Streaming Generation

**Important:** The number of images in the 'frames' list must match the number of image entries in the structured prompt content.


In [None]:
# Single image structured prompt
single_image_prompt = [
    {"role": "user", "content": [
        {"type": "image"},  # One image placeholder
        {"type": "text", "text": "Describe this image in detail."}
    ]}
]

# Streaming generation with one image
with vlm.generate(prompt=single_image_prompt, frames=[test_frame], max_generated_tokens=40) as generation:
    for token in generation:
        print(token, end="", flush=True)



## Single Image Processing with Non-Streaming Generation


In [None]:
# Clear context for fresh conversation
vlm.clear_context()

# Non-streaming generation with one image
analysis_prompt = [
    {"role": "user", "content": [
        {"type": "text", "text": "What colors do you see in this image? List the main objects."}, 
        {"type": "image"}  # One image placeholder
    ]}
]

print(vlm.generate_all(prompt=analysis_prompt, frames=[test_frame], max_generated_tokens=40))


## Text-Only Generation (VLM without Images)

VLM can work without any images by providing an empty frames list. In this case, it behaves like a regular LLM.


In [None]:
# Text-only structured prompt (no images)
text_only_prompt = [
    {"role": "system", "content": [{"type": "text", "text": "You are a helpful AI assistant."}]},
    {"role": "user", "content": [{"type": "text", "text": "Explain what a vision language model is in 2 sentences."}]}
]

# Generate with empty frames list
print(vlm.generate_all(prompt=text_only_prompt, frames=[], max_generated_tokens=50))


## Multiple Image Processing

When using multiple images, ensure the frames list contains exactly the same number of images as image placeholders in the prompt.


In [None]:
# Create a second test image (different pattern)
second_test_frame = create_sample_image(frame_shape, frame_dtype)
# Modify the second image to make it different
second_test_frame = second_test_frame * 0.7  # Darker version
second_test_frame = second_test_frame.astype(frame_dtype)

# Multiple images structured prompt
multi_image_prompt = [
    {"role": "user", "content": [
        {"type": "image"},  # First image placeholder
        {"type": "image"},  # Second image placeholder  
        {"type": "text", "text": "Compare these two images. What are the main differences?"}
    ]}
]

# Clear context for fresh conversation
vlm.clear_context()

# Generate with two images - IMPORTANT: frames list must match image count in prompt
with vlm.generate(prompt=multi_image_prompt,
    frames=[test_frame, second_test_frame],  # Two frames for two image placeholders
    max_generated_tokens=100) as gen:
    print("".join(gen))


## Multi-Turn Conversations with Images


In [None]:
# Clear context for fresh conversation
vlm.clear_context()

# Turn 1: Initial image analysis
turn1_prompt = [
    {"role": "system", "content": [{"type": "text", "text": "You are an expert image analyst."}]},
    {"role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "text": "What's the dominant color in this image?"}
    ]}
]

with vlm.generate(prompt=turn1_prompt, frames=[test_frame], max_generated_tokens=50) as gen:
    print("".join(gen))

# Turn 2: Follow-up question (context maintained automatically)
turn2_prompt = [
    {"role": "user", "content": [{"type": "text", "text": "Can you suggest what type of scene this might be?"}]}
]

print(vlm.generate_all(prompt=turn2_prompt, frames=[], max_generated_tokens=50))


## Generation Parameters Example
Temperature - used to control the model's creativity.
More configurable parameters can be found in the API documentation.



In [None]:
test_prompt = [
    {"role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "text": "Describe this image creatively."}
    ]}
]

# Test different temperature settings

# Low temperature (more deterministic)
vlm.clear_context()
response_low = vlm.generate_all(
    prompt=test_prompt, 
    frames=[test_frame], 
    temperature=0.1, 
    seed=42, 
    max_generated_tokens=40
)
print("Low temperature (0.1): {}".format(response_low))

# High temperature (more creative)
vlm.clear_context()
response_high = vlm.generate_all(
    prompt=test_prompt, 
    frames=[test_frame], 
    temperature=0.9, 
    seed=42, 
    max_generated_tokens=40
)
print("High temperature (0.9): {}".format(response_high))


## Raw Prompts vs Structured Prompts Example

VLM supports both structured prompts (recommended) and raw text prompts with special tokens.
Here we demonstrate the tokens for QWEN family. Special tokens and prompts structures can be obtained using 'vlm.prompt_template()'


In [None]:
vlm.clear_context()

# Raw prompt with vision tokens (model-specific format)
raw_prompt = "<|im_start|>system\\nYou are a helpful assistant.<|im_end|>\\n<|im_start|>user\\nDescribe this image<|vision_start|><|image_pad|><|vision_end|><|im_end|>\\n<|im_start|>assistant\\n"

print(vlm.generate_all(prompt=raw_prompt, frames=[test_frame], max_generated_tokens=40, seed=100))
print()

vlm.clear_context()

# Structured prompt (recommended)
structured_prompt = [
    {"role": "user", "content": [
        {"type": "text", "text": "Describe this image"}, 
        {"type": "image"}
    ]}
]

print(vlm.generate_all(prompt=structured_prompt, frames=[test_frame], max_generated_tokens=40, seed=100))
print()

print("Model prompt template:")
print(vlm.prompt_template())


## Tokenization Example


In [None]:
test_texts = [
    "Describe this image",
    "Vision language model with Hailo",
    "What colors do you see in this image?"
]

print("Tokenization examples:")
for text in test_texts:
    tokens = vlm.tokenize(text)
    print("'{}' -> {} tokens: {}".format(text, len(tokens), tokens))
    
print()


## Stop Tokens and Recovery Sequence Example


In [None]:
# Get current stop tokens
original_stop_tokens = vlm.get_stop_tokens()
print("Original stop tokens: {}".format(original_stop_tokens))

# Test with custom stop tokens
custom_stop_tokens = [".", "END"]
vlm.set_stop_tokens(custom_stop_tokens)
print("Custom stop tokens: {}".format(vlm.get_stop_tokens()))

test_prompt = [
    {"role": "user", "content": [
        {"type": "image"}, 
        {"type": "text", "text": "List three things you see. 1. Color patterns 2. Shapes 3. Textures."}
    ]}
]

vlm.clear_context()
response = vlm.generate_all(prompt=test_prompt, frames=[test_frame], max_generated_tokens=50)
print("Response with custom stop tokens: {}".format(response))

# Reset stop tokens
vlm.set_stop_tokens(original_stop_tokens)
print("Reset stop tokens: {}".format(vlm.get_stop_tokens()))


## Summary and Best Practices

**Key Points to Remember:**

1. **Frame Count Matching**: Always ensure the number of frames matches the number of image placeholders in your structured prompt
2. **Image Format**: Use `vlm.input_frame_shape()`, `vlm.input_frame_format_type()` to get correct format requirements
3. **OpenCV Conversion**: Use BGR→RGB conversion when loading images with OpenCV
4. **Text-Only Mode**: VLM can work without images by using empty frames list `[]`
5. **Context Management**: Use `vlm.clear_context()` to start fresh conversations
6. **Resource Cleanup**: Always call `vlm.release()` and `vdevice.release()` when done

**Structured Prompt Format for VLM:**
```python
prompt = [
    {"role": "user", "content": [
        {"type": "text", "text": "Your text here"},
        {"type": "image"},  # One per image in frames list
        {"type": "text", "text": "More text if needed"}
    ]}
]
```


In [None]:
# Clean up resources (best practice: use context managers when possible)
vlm.release()
vdevice.release()
print("Resources released successfully")
print("VLM tutorial completed!")
