Skip to content

Evaluation

Dipkumar Patel edited this page Feb 4, 2026 · 1 revision

Evaluation

PaperBanana includes a VLM-as-a-Judge evaluation system for comparing generated diagrams against human references. This follows the evaluation methodology described in the original paper.

How It Works

The evaluator uses Gemini as a judge to score a generated diagram on four dimensions by comparing it to a human-drawn reference diagram, the source methodology text, and the figure caption.

Scoring uses a hierarchical aggregation scheme:

Primary dimensions (higher weight):

  • Faithfulness: Does the diagram accurately represent the methodology described in the text? Are all components present? Are connections correct?
  • Readability: Can a reader understand the diagram without referring back to the text? Are labels clear? Is the visual hierarchy logical?

Secondary dimensions (lower weight):

  • Conciseness: Does the diagram avoid unnecessary elements? Is it free of visual clutter?
  • Aesthetics: Color choices, layout balance, typography quality, overall visual polish.

The overall score is a weighted combination, with primary dimensions weighted more heavily than secondary ones.

Using the Evaluator

CLI

paperbanana evaluate \
  --generated output.png \
  --reference human_diagram.png \
  --context method.txt \
  --caption "Overview of our framework"

Python API

from paperbanana.evaluation import evaluate_diagram
from paperbanana.core.config import Settings

scores = asyncio.run(evaluate_diagram(
    generated_path="output.png",
    reference_path="human_diagram.png",
    source_context="Our framework consists of...",
    caption="Overview of our method",
    settings=Settings(),
))

# scores = {
#   "faithfulness": 0.82,
#   "readability": 0.75,
#   "conciseness": 0.88,
#   "aesthetics": 0.71,
#   "overall": 0.79
# }

Prompt Templates

The evaluation prompts live in prompts/evaluation/. Each dimension has its own prompt template that instructs the VLM judge on scoring criteria:

prompts/evaluation/
├── faithfulness.txt
├── readability.txt
├── conciseness.txt
└── aesthetics.txt

These can be modified to adjust scoring behavior or emphasis.

Limitations

VLM-as-a-Judge evaluation is approximate. It correlates reasonably with human preferences but is not a substitute for human review, especially for domain-specific diagrams where Gemini may not have the context to judge faithfulness accurately.

The evaluator requires a human reference image. If you don't have one, you can still assess output quality manually, but the automated scoring won't be available.

Scores across different runs are comparable only when using the same VLM model and prompt templates. Changing models or prompts changes the scoring baseline.

Clone this wiki locally