PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

✨ Overview

PRISM is a benchmark for studying programmatic spatial-temporal reasoning in Manim code generation. This repository bundles both the released dataset and an evaluation-first toolkit for rendering, deterministic spatial audit, PADVC / TD scoring, and text-expansion analysis.

The released PRISM dataset is bundled under data/. It contains 10,372 examples in total, with 5,199 English prompts and 5,173 Chinese prompts.

Note

PRISM bundles both the released bilingual benchmark and the full evaluation toolkit. If you only want to score an existing Manim output directory, you can skip the optional generation components entirely.

🧪 Evaluation

PRISM evaluates outputs through a progressive funnel:

Execution: code-level reliability, including API misuse, syntax errors, and LaTeX failures.
Spatial: geometric reasoning quality, including overlap, leak, and out-of-bounds issues.
PADVC & TD: visual richness and temporal density once basic correctness is satisfied.

Warning

PADVC_center and TD_center are not plug-and-play defaults. To obtain meaningful comparisons, prepare the required local OCR / embedding assets and fit reference parameters on your own curated reference set first.

🧭 Workflow

Generated Manim scripts (.py)
        |
        v
render_directory.py  ->  videos/*.mp4
        |
        v
audit_batch.py       ->  audit/results.json
        |
        +--> score_padvc.py  ->  padvc_final/padvc_scores.jsonl
        |
        +--> score_td.py     ->  td_final/td_center_scores.jsonl
        |
        +--> compute_text_expansion.py  ->  info_expand_final/

The main end-to-end entrypoint is:

scripts/run_evaluation_pipeline.sh \
  your_model_run/cleaned_scripts \
  results/eval_your_model \
  your_model_run/task_manifest.json \
  data/your_prompts.jsonl \
  results/reference_padvc/padvc_norm_params.json \
  results/reference_td/td_center_params.json \
  your_model_name

📦 Repository Layout

Path	Purpose
`scripts/`	Command-line tools for evaluation, metric fitting, and optional generation
`manim_bench/llm_call/`	Minimal LLM client wrapper used by the optional generation flow
`docs/`	Technical notes for data format, metrics, and spatial audit semantics
`examples/`	Toy inputs and example configs for smoke tests
`data/`	Released PRISM dataset plus local dataset workspace
`results/`	Default local workspace for generated outputs and summaries

🚀 Quickstart

1. Install Python dependencies

python3 -m venv .venv
source .venv/bin/activate
python -m pip install --upgrade pip
python -m pip install -r requirements.txt

2. Check the runtime

python scripts/check_environment.py

3. Prepare OCR and embedding models for PADVC

Default OCR backend:

rapidocr-onnxruntime
optional fallback: paddleocr

Embedding models used by scripts/padvc.py:

Chinese: shibing624/text2vec-base-chinese
English / multilingual: sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2

The PADVC pipeline defaults to offline Hugging Face mode:

export PADVC_HF_CACHE=/path/to/huggingface/hub
# or point directly to local snapshots
export PADVC_ZH_MODEL=/path/to/text2vec-base-chinese
export PADVC_EN_MODEL=/path/to/paraphrase-multilingual-MiniLM-L12-v2

export PADVC_OCR_BACKEND=rapidocr
export PADVC_OCR_CACHE_DIR=results/ocr_cache
export PADVC_DEBUG=0

4. Evaluate an existing model run

Expected inputs:

your_model_run/cleaned_scripts/*.py
your_model_run/task_manifest.json
a prompt JSONL such as data/your_prompts.jsonl
fitted reference parameters for PADVC and TD

Then run:

scripts/run_evaluation_pipeline.sh \
  your_model_run/cleaned_scripts \
  results/eval_your_model \
  your_model_run/task_manifest.json \
  data/your_prompts.jsonl \
  results/reference_padvc/padvc_norm_params.json \
  results/reference_td/td_center_params.json \
  your_model_name

If your outputs were not produced by scripts/generate_code.py, prepare the minimal manifest and prompt JSONL formats described in docs/data_format.md.

🧪 Fit Reference Parameters

PADVC_center and TD_center require reference statistics fitted on your own curated reference set:

python scripts/fit_reference_padvc.py \
  --dataset-jsonl data/your_reference_dataset.jsonl \
  --video-root results/reference_videos \
  --output-dir results/reference_padvc

python scripts/fit_reference_td.py \
  --dataset-jsonl results/reference_padvc/padvc_reference_raw_scores.jsonl \
  --output-dir results/reference_td

The example files under examples/params/ are placeholders for smoke tests only.

🤖 Optional Generation

If you also want to generate model outputs inside this repository:

cp manim_bench/llm_call/config.example.json manim_bench/llm_call/config.json

You can also choose a different config path:

export MANIM_BENCH_LLM_CONFIG=/path/to/config.json

Then run:

python scripts/generate_code.py \
  --input-jsonl examples/sample_prompts.jsonl \
  --instruction-field instruction \
  --model your-model-name \
  --workers 2 \
  --temperature 0.7 \
  --output-dir results/example_generation

🛠️ System Requirements

Recommended environment:

Linux or macOS
Python 3.10+
Manim Community Edition 0.19.0
FFmpeg
Cairo / Pango / pkg-config
LaTeX toolchain for Tex and MathTex
CJK-capable fonts if you render Chinese text

Platform package examples

Ubuntu:

sudo apt-get update
sudo apt-get install -y \
  ffmpeg pkg-config libcairo2-dev libpango1.0-dev \
  texlive texlive-latex-extra texlive-fonts-recommended \
  texlive-xetex dvisvgm ghostscript \
  fonts-noto-cjk fontconfig

macOS:

brew install ffmpeg cairo pango pkg-config mactex-no-gui font-noto-sans-cjk

📚 Documentation

docs/data_format.md: expected JSON / JSONL layouts
docs/spatial_audit.md: spatial-audit semantics
docs/metrics.md: PADVC, TD, and text-expansion overview
docs/code_error_taxonomy.md: code-failure categories
scripts/README.md: script inventory and command examples

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

✨ Overview

🧪 Evaluation

🧭 Workflow

📦 Repository Layout

🚀 Quickstart

1. Install Python dependencies

2. Check the runtime

3. Prepare OCR and embedding models for PADVC

4. Evaluate an existing model run

🧪 Fit Reference Parameters

🤖 Optional Generation

🛠️ System Requirements

📚 Documentation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
data		data
docs		docs
examples		examples
figs		figs
manim_bench		manim_bench
results		results
scripts		scripts
.gitignore		.gitignore
README.md		README.md
README.zh.md		README.zh.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

PRISM: A Benchmark for Programmatic Spatial-Temporal Reasoning

✨ Overview

🧪 Evaluation

🧭 Workflow

📦 Repository Layout

🚀 Quickstart

1. Install Python dependencies

2. Check the runtime

3. Prepare OCR and embedding models for PADVC

4. Evaluate an existing model run

🧪 Fit Reference Parameters

🤖 Optional Generation

🛠️ System Requirements

📚 Documentation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages