MM-WebAgent is a hierarchical agentic framework for multimodal webpage generation. Instead of treating webpage creation as pure HTML/CSS coding, it turns the task into an automated design process that can natively coordinate AIGC tools for generating images, videos, charts, and webpage layouts.
Compared with traditional code-based agents, MM-WebAgent produces webpages with better multimodal integration, stronger style consistency, and more visually harmonious designs. Rather than generating code and assets in isolation, it explicitly models webpage creation as a structured process: first planning the page globally, then generating local multimodal elements under contextual constraints, and finally refining the full page through hierarchical multi-level reflection.
This hierarchical design is the key difference. MM-WebAgent jointly improves:
- global layout coherence at the page level,
- local asset quality for images, videos, and charts,
- and cross-element consistency between generated content and surrounding HTML/CSS.
As a result, the framework can make substantially better use of AIGC tools than standard code-centric agents, leading to webpages that are not only functional, but also more polished, more coherent, and more aesthetically pleasing.
To support research in this direction, we also introduce MM-WebGEN-Bench, a multi-level benchmark for evaluating multimodal webpage generation across diverse scenes, styles, and element compositions.
conda create -n webgen python=3.12 -y
conda activate webgen
pip install -r requirements.txtpython -m playwright install chromiumGeneration pipeline uses npx http-server to serve local files for screenshots. Install Node.js if not available:
# macOS (Homebrew)
brew install node
# Ubuntu / Debian
sudo apt install nodejs npm
# Or use conda
conda install -c conda-forge nodejsVerify installation:
node --version
npx --versionCharts are rendered via ECharts and load the runtime from CDN (jsdelivr). Chart rendering and evaluation require network access.
Set the following environment variables:
# Required
export OPENAI_API_KEY="<your-openai-api-key>"
# Optional: override base URL and model names
export OPENAI_BASE_URL="https://api.openai.com/v1"
export OPENAI_MODEL_GPT52="gpt-5.2"
export OPENAI_MODEL_GPT51="gpt-5.1"
export OPENAI_MODEL_GPT41="gpt-4.1"
export OPENAI_MODEL_GPT4O="gpt-4o"
export OPENAI_IMAGE_MODEL="gpt-image-1"
export OPENAI_IMAGE_EDIT_MODEL="gpt-image-1"
# Optional: video generation (requires --enable-video flag)
export OPENAI_VIDEO_API_KEY="$OPENAI_API_KEY"
export OPENAI_VIDEO_BASE_URL="https://api.openai.com/v1"
export OPENAI_VIDEO_MODEL="sora-2"python workflow/run_generation.py \
--data-path datasets/evaluation_dataset.jsonl \
--limit 1 \
--planner-model gpt-5.2 \
--save-dir outputs/mm_webagent/gpt-5.2Key options:
| Flag | Default | Description |
|---|---|---|
--data-path |
datasets/evaluation_dataset.jsonl |
Input JSONL dataset |
--save-dir |
outputs/workflow_v3 |
Output directory |
--planner-model |
gpt-5.2 |
Model for planning |
--limit |
all | Number of samples to generate |
--start_idx |
0 |
Start index |
--enable-video |
off | Enable video generation (mp4) |
--planner-workers |
8 |
Parallel planner calls |
Output structure (outputs/<exp_name>/<model_name>/<case_id>/):
├── main.html # Assembled webpage
├── planner_output.json # Structured plan
├── run_summary.json # Generation log
├── *.png / *.mp4 # Generated images/videos
└── *.html # Chart subpages
python benchmark/run_benchmark_eval.py \
--exp_dir outputs/mm_webagentBehavior is configurable via benchmark/configs/experiment.yaml. Defaults:
- Multi-level evaluation (global, image, video, chart)
- Chart reflection — 3 rounds
- Image reflection — 3 rounds
- Global reflection — 3 rounds
- Outputs
eval_result_final.json/eval_best.json
MM-WebGEN-Bench (datasets/evaluation_dataset.jsonl) — 120 curated webpage design prompts covering 11 scene categories, 11 visual styles, and diverse multimodal compositions (4 video types, 8 image types, 17 chart types).
JSONL format:
{"file_id": "001", "input": "Create a modern landing page for a robotics startup..."}If you find this work useful, please cite:
@article{li2026mmwebagent,
title={MM-WebAgent: A Hierarchical Multimodal Web Agent for Webpage Generation},
author={Yan Li and Zezi Zeng and Yifan Yang and Yuqing Yang and Ning Liao and Weiwei Guo and Lili Qiu and Mingxi Cheng and Qi Dai and Zhendong Wang and Zhengyuan Yang and Xue Yang and Ji Li and Lijuan Wang and Chong Luo},
journal={arXiv preprint arXiv:2604.15309},
year={2026}
}This project is licensed under the MIT License.

