UVLM is an open-source Google Colab framework for reproducible benchmarking of Vision-Language Models (VLMs). It provides a unified interface for loading, configuring, and evaluating multiple VLM architectures on custom image analysis tasks — without writing model-specific inference code.
UVLM currently supports two major model families — LLaVA-NeXT and Qwen2.5-VL — which differ fundamentally in their vision encoding, tokenization, and decoding strategies. The framework abstracts these differences behind a single inference function, enabling researchers to compare models using identical prompts and evaluation protocols.
💡 Unified. Reproducible. Accessible. No coding required.
UVLM combines model loading, prompt engineering, and batch evaluation into a single notebook:
- ✅ 11 VLM checkpoints — 7 LLaVA-NeXT + 4 Qwen2.5-VL models, from 3B to 110B parameters
- 🔧 Dual-backend abstraction — automatically routes inference to the correct pipeline (LLaVA or Qwen)
- 📝 Multi-task prompt builder — configure up to 10 analysis tasks per run with a widget-based UI
- 🔁 Consensus validation — majority voting across 2–5 repeated inferences for improved reliability
- 🧠 Flexible reasoning support — adjustable token budget (up to 1,500) for custom chain-of-thought prompts, plus a built-in CoT reference mode for benchmarking
- 🚨 Truncation detection — exact token counting flags responses that hit the generation limit, with per-task CSV diagnostics
- 📊 Batch execution — process entire image folders with resume capability and CSV output
- ⚡ Quantization support — FP16, 8-bit, and 4-bit precision via BitsAndBytes
It requires no local hardware — everything runs on Google Colab with free-tier GPU resources.
UVLM is organized into three sequential blocks, each handling a distinct stage of the benchmarking workflow:
| Family | Model | Parameters | Checkpoint ID |
|---|---|---|---|
| LLaVA-NeXT | Mistral 7B | 7B | llava-hf/llava-v1.6-mistral-7b-hf |
| Vicuna 7B | 7B | llava-hf/llava-v1.6-vicuna-7b-hf |
|
| Vicuna 13B | 13B | llava-hf/llava-v1.6-vicuna-13b-hf |
|
| 34B | 34B | llava-hf/llava-v1.6-34b-hf |
|
| LLaMA3 8B | 8B | llava-hf/llama3-llava-next-8b-hf |
|
| 72B | 72B | llava-hf/llava-next-72b-hf |
|
| 110B | 110B | llava-hf/llava-next-110b-hf |
|
| Qwen2.5-VL | 3B Instruct | 3B | Qwen/Qwen2.5-VL-3B-Instruct |
| 7B Instruct | 7B | Qwen/Qwen2.5-VL-7B-Instruct |
|
| 32B Instruct | 32B | Qwen/Qwen2.5-VL-32B-Instruct |
|
| 72B Instruct | 72B | Qwen/Qwen2.5-VL-72B-Instruct |
⚠️ Note: Models with 72B+ parameters exceed single-GPU memory even with 4-bit quantization and require multi-GPU environments. In practice, models up to 34B can be loaded on a single Colab GPU (T4 or A100) with 4-bit quantization.
| Type | Description | Parser |
|---|---|---|
numeric |
Integer/float extraction | Extracts last number via regex |
category |
Classification labels | Strips common prefixes, returns cleaned text |
boolean |
Yes/no answers | Normalizes to 1/0 |
text |
Free-form responses | Returns cleaned text |
UVLM runs entirely in Google Colab — no local installation needed.
-
Open the notebook in Google Colab:
-
Select a GPU runtime:
Runtime→Change runtime type→T4 GPU -
Run Block 1: Select a model from the dropdown, choose a precision mode (4-bit recommended), and click "Load model"
-
Run Block 2: Define your analysis tasks using the prompt builder form — specify column names, prompts, task types, and optionally enable consensus validation. Adjust the max-token slider (up to 1,500) if your prompts require longer outputs.
-
Run Block 3: Point to an image folder on Google Drive and execute — results are saved as CSV
⚠️ Hugging Face token: Some models (e.g., LLaMA3-based) require authentication. Enable the "Use Hugging Face token" checkbox in Block 1 and paste your token.
UVLM automatically detects the model family and routes to the correct pipeline:
- LLaVA:
LlavaNextProcessor→ joint tokenization →model.generate()→ full decode → string-based response cleaning - Qwen:
AutoProcessor+process_vision_info()→ separate vision preprocessing →model.generate(GenerationConfig)→ token trimming → batch decode
Run each task 2–5 times per image, with majority voting to determine the final answer. NA values from failed parses are filtered before voting. Agreement ratio tracks reliability across all runs.
UVLM supports two approaches to chain-of-thought reasoning:
- User-defined: Write task prompts that request step-by-step explanations and use the max-token slider (up to 1,500) to provide adequate generation budget. This gives full control over reasoning structure.
- Built-in reference mode: Enable per-task to trigger a standardized CoT template. The token budget is automatically set to 1,024. Primarily intended for benchmarking — in practice, users are encouraged to design their own reasoning prompts tailored to their specific tasks.
Both approaches store the reasoning trace in a dedicated {column}_reasoning CSV column for inspection.
After every inference call, the exact number of generated tokens (counted directly from the model output tensor) is compared against the token limit. Truncated responses are flagged in per-task {column}_truncated CSV columns and trigger console warnings, allowing users to identify insufficient token budgets without post-hoc analysis.
Block 3 detects already-processed images and skips them. New tasks added between runs trigger automatic CSV schema upgrading. Checkpoints saved every 5 images.
UVLM has been benchmarked on 120 French streetscape images across 8 models × 2 inference modes (16 configurations), covering five urban analysis tasks: sidewalk detection, motor vehicle counting, pedestrian entrance counting, street frontage length estimation, and vegetation type classification.
Key findings: Qwen2.5-VL-32B with reasoning achieves the best overall performance (88.0% proximity score), while LLaVA Vicuna 7B in standard mode offers a competitive alternative (83.1%) at a fraction of the computation cost. Model size does not predict performance — LLaVA 34B ranks last (62.2%).
📄 Full benchmark details, dataset, and supplementary materials: [link to arXiv paper — to be added upon publication]
| File | Description |
|---|---|
UVLM.ipynb |
Main notebook (all three blocks) |
figure1_architecture.svg |
Architecture diagram (Figure 1) |
figure2_prompt_form.svg |
Prompt builder example (Figure 2) |
UVLM_Project_Complete_Documentation.md |
Full technical documentation |
VERSIONS.txt |
Version history |
LICENSE |
Apache License 2.0 |
README.md |
This file |
If you use UVLM in your research, please cite:
Perez, J. and Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893. Available at: https://arxiv.org/abs/2603.13893
Perez, J. and Fusco, G. (2025). Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes. Geomatica, 77(2), 100063. Available at: https://www.sciencedirect.com/science/article/pii/S1195103625000199
UVLM is released under the Apache License 2.0. This allows use, modification, and redistribution in academic, commercial, and open-source contexts.
Third-party components used in UVLM:
- LLaVA-NeXT — Visual instruction tuning models (Apache 2.0)
- Qwen2.5-VL — Vision-language models (Apache 2.0)
- Hugging Face Transformers — Model loading and inference (Apache 2.0)
- BitsAndBytes — Quantization library (MIT)
- CLIP — Vision encoder used in LLaVA (MIT)
See NOTICE.md for complete attribution details.
This research is supported by the emc2 project co-funded by ANR (France), FFG (Austria), MUR (Italy), and Vinnova (Sweden) under the Driving Urban Transition Partnership, which has been co-funded by the European Commission.
UVLM is developed by Joan Perez, founder of Urban Geo Analytics — an independent research and consulting practice focused on geospatial modeling, AI for cities, and open-source urban analytics. 🌐 urbangeoanalytics.com
Feel free to open an issue or pull request. Contributions and forks are welcome!
🔗 GitHub Discussions — Share use cases, ideas, and extensions.