Accepted at the 2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition - FINDINGS Track (CVPRF).
SPHINX is a synthetic environment for visual perception and reasoning. It combines procedurally generated motifs, tilings, charts, icons, and geometric primitives into 25 benchmark tasks with verifiable answers, enabling both precise evaluation and large-scale training data generation for multimodal models.
- 25 procedurally generated visual reasoning tasks with verifiable answers
- 32,000 training examples and 2,500 evaluation examples
- A large human-model gap on controlled multimodal reasoning tasks
- Released generator code, dataset, project page, interactive demo, and model checkpoints
| Resource | Link |
|---|---|
| Project page | https://maveryn.github.io/sphinx/ |
| Interactive demo | https://maveryn.github.io/sphinx/demo/ |
| Paper | https://arxiv.org/abs/2511.20814 |
| Dataset | https://huggingface.co/datasets/maveryn/sphinx |
| Models collection | https://huggingface.co/collections/maveryn/sphinx-models |
| Qwen3 4B model | https://huggingface.co/maveryn/sphinx-qwen3-4b |
| Qwen3 8B model | https://huggingface.co/maveryn/sphinx-qwen3-8b |
| Path | Description |
|---|---|
src/sphinx/ |
Generator library under the sphinx namespace |
benchmark/table1/ |
Summary TSVs for the main benchmark table |
benchmark/table3/ |
Summary TSV for the RLVR benchmark table |
demo/ |
Published 200-example static subset and demo build utilities |
tests/ |
Registry and generation smoke tests |
docs/ |
Generated project page and interactive demo for GitHub Pages |
This repo is intentionally scoped to the 25 SPHINX benchmark tasks. Scratch
tasks, unused benchmark scripts, and extra source notes from the original
mmr_gym workspace are not included.
Use a clean Python environment because this package uses the sphinx import
namespace.
git clone https://github.com/maveryn/sphinx.git
cd sphinx
pip install -e .List the 25 supported task names:
python -m sphinx.generate --list-tasksGenerate task-specific examples:
python -m sphinx.generate \
--out generated/tasks \
--samples-per-task 10 \
--tasks charts_pie shape_count transform_pair_inferGenerate a random sample:
python -m sphinx.engine --n 100 --out generated/random --workers 4Generate smoke-test samples for all 25 tasks:
python -m sphinx.smoke --out smoke_outputs --samples-per-task 25 --seed 42This writes:
smoke_outputs/<task>/sample_XXXX.pngsmoke_outputs/<task>/metadata.jsonlsmoke_outputs/<task>/contact_sheet.pngsmoke_outputs/summary.json
Generate a subset of tasks:
python -m sphinx.smoke \
--out smoke_outputs_subset \
--samples-per-task 25 \
--tasks charts_pie shape_count transform_pair_inferBy default, sphinx.generate uses a faster retry policy for expensive tasks.
If you want the original full retry behavior, add --full-retries.
Installed console scripts are also available after pip install -e .:
sphinx-generate --list-tasks
sphinx-smoke --out smoke_outputsRun the lightweight test suite:
pytestThe tests cover:
- exact registration of the 25 SPHINX tasks
- consistency between the icon manifest and the shipped SVG subset
- one-example generation smoke checks for representative tasks
For a full 25-task generation check, run:
python -m sphinx.smoke --out smoke_outputs --samples-per-task 25If you use SPHINX, please cite:
@inproceedings{alam2026sphinx,
author = {Md Tanvirul Alam and Saksham Aggarwal and Justin Yang Chae and Nidhi Rastogi},
title = {SPHINX: A Synthetic Environment for Visual Perception and Reasoning},
booktitle = {2026 IEEE/CVF Conference on Computer Vision and Pattern Recognition- FINDINGS Track (CVPRF)},
year = {2026}
}
