Research directions for improving the diagram quality #4
Replies: 1 comment 1 reply
-
|
After looking through the codebase, here’s the order I’d tackle these items based on what’s already implemented and what will deliver the biggest payoff. 1) Start with #1 - Eval HarnessThis needs to come first because it makes every other improvement measurable. The judge system, 4-dimension scoring, and the ablation runner are already there. The main missing piece is a batch runner that can run across a full test set and summarize results. Without that, it’s hard to separate real progress from gut feel. 2) #3 (Reference Expansion) and #4 (Multi‑Venue Styles)These are high-impact and mostly content work, not deep engineering.
3) #7 - Prompt OptimizationThe prompt system is already template-driven and parameterized, so running prompt variants and comparing them is a natural next step. This is also where a lot of quality tends to come from. 4) #2 - Open‑Source BackendsThis fits well once evaluation is in place. The provider architecture is clean, and the OpenAI provider already supports custom base URLs, which gets you close to supporting setups like vLLM or Ollama. 5) #5 (Adaptive Routing) and #8 (Hybrid Generation)These are solid upgrades, but they’ll require more refactoring. The current structure can support them, but they’re not “drop-in” changes. 6) Last, #6 - Hierarchical DecompositionThis is the biggest lift. Right now the planner outputs flat text with no component model, so doing real hierarchical decomposition would mean reworking a significant chunk of the pipeline. It’s worth doing, but only once the rest is stable. In short: measure first (#1), then low-effort quality gains (#3, #4, #7), then infrastructure (#2), and finally the deeper architectural work (#5, #6, #8). I can handle this. I’ll open separate PRs for each task step by step. I’ll start by raising a PR for the first task (Eval Harness). For tasks #5 and #6, we can discuss more detail later after we’ve completed up to #4. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Now that the core pipeline is functional, I want to open a thread to collect ideas for where this implementation could go next, both as a practical tool and as a research vehicle.
A few directions I've been thinking about:
1. Evaluation and benchmarking
The paper introduces PaperBananaBench (292 test cases from NeurIPS 2025), but we don't yet have an automated way to run it end-to-end against our implementation. Building a reproducible eval harness, possibly with VLM-as-judge scoring across faithfulness, conciseness, readability, and aesthetics, would let us measure the effect of any changes we make rather than eyeballing outputs.
2. Open-source model backends
Right now the pipeline runs on Gemini. Supporting open-weight VLMs (Qwen-VL, LLaVA, etc.) for the planning and critique agents would make it more accessible and let us study how much output quality depends on the backbone model vs. the orchestration and prompts.
3. Reference set expansion
The Retriever currently works with 13 curated methodology diagrams. There's likely room to grow this, both in quantity and in domain coverage, to improve retrieval quality for papers outside the current categories.
4. Multi-venue style support
The Stylist agent uses NeurIPS-style guidelines. Extending this to other venues (ICML, ACL, EMNLP, IEEE, etc.) or letting users define custom style profiles would broaden the practical applicability.
5. Adaptive critique routing
The current refinement loop is fixed: Critic always feeds back to Visualizer. But sometimes the problem isn't rendering, it's that the Planner's description was ambiguous or the Stylist made a poor layout choice. Letting the Critic route feedback to the appropriate upstream agent (Planner, Stylist, or Visualizer) based on the failure mode could reduce wasted iterations and improve convergence. This is a non-trivial control flow problem in multi-agent systems.
6. Hierarchical decomposition for complex diagrams
Some diagrams contain multiple logical substructures (e.g., a training pipeline next to an inference pipeline, or nested modules within a larger architecture). Planning these as a flat description hits a ceiling. Decomposing the diagram into sub-components, planning and rendering each independently, then composing them spatially could handle more complex illustrations than the current single-pass planner supports.
7. Planner and Critic prompt optimization
From building this, the Planner and Critic prompts turned out to be where most of the output quality lives. Systematic prompt optimization (DSPy, few-shot tuning, etc.) could yield measurable gains here.
8. Image generation vs. code generation tradeoffs
The paper notes that image generation models produce more visually appealing plots but are prone to numerical hallucination, while code generation (Matplotlib) is faithful but less polished. Finding a better balance here, or a hybrid approach, is an open problem.
If you have ideas, related work, or want to pick up any of these directions, drop them here. Contributions and discussion welcome.
Beta Was this translation helpful? Give feedback.
All reactions