UVLM: Universal Vision-Language Model Loader

UVLM is an open-source Google Colab framework for reproducible benchmarking of Vision-Language Models (VLMs). It provides a unified interface for loading, configuring, and evaluating multiple VLM architectures on custom image analysis tasks — without writing model-specific inference code.

UVLM currently supports two major model families — LLaVA-NeXT and Qwen2.5-VL — which differ fundamentally in their vision encoding, tokenization, and decoding strategies. The framework abstracts these differences behind a single inference function, enabling researchers to compare models using identical prompts and evaluation protocols.

💡 Unified. Reproducible. Accessible. No coding required.

🧠 What does UVLM do?

UVLM combines model loading, prompt engineering, and batch evaluation into a single notebook:

✅ 11 VLM checkpoints — 7 LLaVA-NeXT + 4 Qwen2.5-VL models, from 3B to 110B parameters
🔧 Dual-backend abstraction — automatically routes inference to the correct pipeline (LLaVA or Qwen)
📝 Multi-task prompt builder — configure up to 10 analysis tasks per run with a widget-based UI
🔁 Consensus validation — majority voting across 2–5 repeated inferences for improved reliability
🧠 Flexible reasoning support — adjustable token budget (up to 1,500) for custom chain-of-thought prompts, plus a built-in CoT reference mode for benchmarking
🚨 Truncation detection — exact token counting flags responses that hit the generation limit, with per-task CSV diagnostics
📊 Batch execution — process entire image folders with resume capability and CSV output
⚡ Quantization support — FP16, 8-bit, and 4-bit precision via BitsAndBytes

It requires no local hardware — everything runs on Google Colab with free-tier GPU resources.

📐 Architecture

UVLM is organized into three sequential blocks, each handling a distinct stage of the benchmarking workflow:

Supported Models

Family	Model	Parameters	Checkpoint ID
LLaVA-NeXT	Mistral 7B	7B	`llava-hf/llava-v1.6-mistral-7b-hf`
	Vicuna 7B	7B	`llava-hf/llava-v1.6-vicuna-7b-hf`
	Vicuna 13B	13B	`llava-hf/llava-v1.6-vicuna-13b-hf`
	34B	34B	`llava-hf/llava-v1.6-34b-hf`
	LLaMA3 8B	8B	`llava-hf/llama3-llava-next-8b-hf`
	72B	72B	`llava-hf/llava-next-72b-hf`
	110B	110B	`llava-hf/llava-next-110b-hf`
Qwen2.5-VL	3B Instruct	3B	`Qwen/Qwen2.5-VL-3B-Instruct`
	7B Instruct	7B	`Qwen/Qwen2.5-VL-7B-Instruct`
	32B Instruct	32B	`Qwen/Qwen2.5-VL-32B-Instruct`
	72B Instruct	72B	`Qwen/Qwen2.5-VL-72B-Instruct`

⚠️ Note: Models with 72B+ parameters exceed single-GPU memory even with 4-bit quantization and require multi-GPU environments. In practice, models up to 34B can be loaded on a single Colab GPU (T4 or A100) with 4-bit quantization.

Task Types

Type	Description	Parser
`numeric`	Integer/float extraction	Extracts last number via regex
`category`	Classification labels	Strips common prefixes, returns cleaned text
`boolean`	Yes/no answers	Normalizes to 1/0
`text`	Free-form responses	Returns cleaned text

🚀 Quick Start

UVLM runs entirely in Google Colab — no local installation needed.

Open the notebook in Google Colab:
Select a GPU runtime: Runtime → Change runtime type → T4 GPU
Run Block 1: Select a model from the dropdown, choose a precision mode (4-bit recommended), and click "Load model"
Run Block 2: Define your analysis tasks using the prompt builder form — specify column names, prompts, task types, and optionally enable consensus validation. Adjust the max-token slider (up to 1,500) if your prompts require longer outputs.
Run Block 3: Point to an image folder on Google Drive and execute — results are saved as CSV

⚠️ Hugging Face token: Some models (e.g., LLaMA3-based) require authentication. Enable the "Use Hugging Face token" checkbox in Block 1 and paste your token.

🔑 Key Features

Dual-Backend Inference

UVLM automatically detects the model family and routes to the correct pipeline:

LLaVA: LlavaNextProcessor → joint tokenization → model.generate() → full decode → string-based response cleaning
Qwen: AutoProcessor + process_vision_info() → separate vision preprocessing → model.generate(GenerationConfig) → token trimming → batch decode

Consensus Validation

Run each task 2–5 times per image, with majority voting to determine the final answer. NA values from failed parses are filtered before voting. Agreement ratio tracks reliability across all runs.

Reasoning Support

UVLM supports two approaches to chain-of-thought reasoning:

User-defined: Write task prompts that request step-by-step explanations and use the max-token slider (up to 1,500) to provide adequate generation budget. This gives full control over reasoning structure.
Built-in reference mode: Enable per-task to trigger a standardized CoT template. The token budget is automatically set to 1,024. Primarily intended for benchmarking — in practice, users are encouraged to design their own reasoning prompts tailored to their specific tasks.

Both approaches store the reasoning trace in a dedicated {column}_reasoning CSV column for inspection.

Truncation Detection

After every inference call, the exact number of generated tokens (counted directly from the model output tensor) is compared against the token limit. Truncated responses are flagged in per-task {column}_truncated CSV columns and trigger console warnings, allowing users to identify insufficient token budgets without post-hoc analysis.

Resume-Safe Batch Processing

Block 3 detects already-processed images and skips them. New tasks added between runs trigger automatic CSV schema upgrading. Checkpoints saved every 5 images.

🧪 Benchmark

UVLM has been benchmarked on 120 French streetscape images across 8 models × 2 inference modes (16 configurations), covering five urban analysis tasks: sidewalk detection, motor vehicle counting, pedestrian entrance counting, street frontage length estimation, and vegetation type classification.

Key findings: Qwen2.5-VL-32B with reasoning achieves the best overall performance (88.0% proximity score), while LLaVA Vicuna 7B in standard mode offers a competitive alternative (83.1%) at a fraction of the computation cost. Model size does not predict performance — LLaVA 34B ranks last (62.2%).

📄 Full benchmark details, dataset, and supplementary materials: [link to arXiv paper — to be added upon publication]

📦 Repository Contents

File	Description
`UVLM.ipynb`	Main notebook (all three blocks)
`figure1_architecture.svg`	Architecture diagram (Figure 1)
`figure2_prompt_form.svg`	Prompt builder example (Figure 2)
`UVLM_Project_Complete_Documentation.md`	Full technical documentation
`VERSIONS.txt`	Version history
`LICENSE`	Apache License 2.0
`README.md`	This file

📚 Citation

If you use UVLM in your research, please cite:

Perez, J. and Fusco, G. (2026). UVLM: A Universal Vision-Language Model Loader for Reproducible Multimodal Benchmarking. arXiv:2603.13893. Available at: https://arxiv.org/abs/2603.13893

Related Publications

Perez, J. and Fusco, G. (2025). Streetscape Analysis with Generative AI (SAGAI): Vision-Language Assessment and Mapping of Urban Scenes. Geomatica, 77(2), 100063. Available at: https://www.sciencedirect.com/science/article/pii/S1195103625000199

🪪 License and Attribution

UVLM is released under the Apache License 2.0. This allows use, modification, and redistribution in academic, commercial, and open-source contexts.

Third-party components used in UVLM:

LLaVA-NeXT — Visual instruction tuning models (Apache 2.0)
Qwen2.5-VL — Vision-language models (Apache 2.0)
Hugging Face Transformers — Model loading and inference (Apache 2.0)
BitsAndBytes — Quantization library (MIT)
CLIP — Vision encoder used in LLaVA (MIT)

See NOTICE.md for complete attribution details.

✨ Acknowledgments

This research is supported by the emc2 project co-funded by ANR (France), FFG (Austria), MUR (Italy), and Vinnova (Sweden) under the Driving Urban Transition Partnership, which has been co-funded by the European Commission.

🏢 Developer

UVLM is developed by Joan Perez, founder of Urban Geo Analytics — an independent research and consulting practice focused on geospatial modeling, AI for cities, and open-source urban analytics. 🌐 urbangeoanalytics.com

📫 Feedback and Contributions

Feel free to open an issue or pull request. Contributions and forks are welcome!

🔗 GitHub Discussions — Share use cases, ideas, and extensions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

UVLM: Universal Vision-Language Model Loader

🧠 What does UVLM do?

📐 Architecture

Supported Models

Task Types

🚀 Quick Start

🔑 Key Features

Dual-Backend Inference

Consensus Validation

Reasoning Support

Truncation Detection

Resume-Safe Batch Processing

🧪 Benchmark

📦 Repository Contents

📚 Citation

Related Publications

🪪 License and Attribution

✨ Acknowledgments

🏢 Developer

📫 Feedback and Contributions

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
LICENSE		LICENSE
README.md		README.md
UVLM.ipynb		UVLM.ipynb
UVLM_Project_Complete_Documentation.md		UVLM_Project_Complete_Documentation.md
VERSIONS.txt		VERSIONS.txt
figure1_architecture.svg		figure1_architecture.svg
figure2_prompt_form.svg		figure2_prompt_form.svg

Folders and files

Latest commit

History

Repository files navigation

UVLM: Universal Vision-Language Model Loader

🧠 What does UVLM do?

📐 Architecture

Supported Models

Task Types

🚀 Quick Start

🔑 Key Features

Dual-Backend Inference

Consensus Validation

Reasoning Support

Truncation Detection

Resume-Safe Batch Processing

🧪 Benchmark

📦 Repository Contents

📚 Citation

Related Publications

🪪 License and Attribution

✨ Acknowledgments

🏢 Developer

📫 Feedback and Contributions

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages