-
Notifications
You must be signed in to change notification settings - Fork 294
Evaluation
PaperBanana includes a VLM-as-a-Judge evaluation system for comparing generated diagrams against human references. This follows the evaluation methodology described in the original paper.
The evaluator uses Gemini as a judge to score a generated diagram on four dimensions by comparing it to a human-drawn reference diagram, the source methodology text, and the figure caption.
Scoring uses a hierarchical aggregation scheme:
Primary dimensions (higher weight):
- Faithfulness: Does the diagram accurately represent the methodology described in the text? Are all components present? Are connections correct?
- Readability: Can a reader understand the diagram without referring back to the text? Are labels clear? Is the visual hierarchy logical?
Secondary dimensions (lower weight):
- Conciseness: Does the diagram avoid unnecessary elements? Is it free of visual clutter?
- Aesthetics: Color choices, layout balance, typography quality, overall visual polish.
The overall score is a weighted combination, with primary dimensions weighted more heavily than secondary ones.
paperbanana evaluate \
--generated output.png \
--reference human_diagram.png \
--context method.txt \
--caption "Overview of our framework"from paperbanana.evaluation import evaluate_diagram
from paperbanana.core.config import Settings
scores = asyncio.run(evaluate_diagram(
generated_path="output.png",
reference_path="human_diagram.png",
source_context="Our framework consists of...",
caption="Overview of our method",
settings=Settings(),
))
# scores = {
# "faithfulness": 0.82,
# "readability": 0.75,
# "conciseness": 0.88,
# "aesthetics": 0.71,
# "overall": 0.79
# }The evaluation prompts live in prompts/evaluation/. Each dimension has its own prompt template that instructs the VLM judge on scoring criteria:
prompts/evaluation/
├── faithfulness.txt
├── readability.txt
├── conciseness.txt
└── aesthetics.txt
These can be modified to adjust scoring behavior or emphasis.
VLM-as-a-Judge evaluation is approximate. It correlates reasonably with human preferences but is not a substitute for human review, especially for domain-specific diagrams where Gemini may not have the context to judge faithfulness accurately.
The evaluator requires a human reference image. If you don't have one, you can still assess output quality manually, but the automated scoring won't be available.
Scores across different runs are comparable only when using the same VLM model and prompt templates. Changing models or prompts changes the scoring baseline.