Research directions for improving the diagram quality #4

dippatel1994 · 2026-02-06T15:17:58Z

dippatel1994
Feb 6, 2026
Maintainer

Now that the core pipeline is functional, I want to open a thread to collect ideas for where this implementation could go next, both as a practical tool and as a research vehicle.

A few directions I've been thinking about:

1. Evaluation and benchmarking
The paper introduces PaperBananaBench (292 test cases from NeurIPS 2025), but we don't yet have an automated way to run it end-to-end against our implementation. Building a reproducible eval harness, possibly with VLM-as-judge scoring across faithfulness, conciseness, readability, and aesthetics, would let us measure the effect of any changes we make rather than eyeballing outputs.

2. Open-source model backends
Right now the pipeline runs on Gemini. Supporting open-weight VLMs (Qwen-VL, LLaVA, etc.) for the planning and critique agents would make it more accessible and let us study how much output quality depends on the backbone model vs. the orchestration and prompts.

3. Reference set expansion
The Retriever currently works with 13 curated methodology diagrams. There's likely room to grow this, both in quantity and in domain coverage, to improve retrieval quality for papers outside the current categories.

4. Multi-venue style support
The Stylist agent uses NeurIPS-style guidelines. Extending this to other venues (ICML, ACL, EMNLP, IEEE, etc.) or letting users define custom style profiles would broaden the practical applicability.

5. Adaptive critique routing
The current refinement loop is fixed: Critic always feeds back to Visualizer. But sometimes the problem isn't rendering, it's that the Planner's description was ambiguous or the Stylist made a poor layout choice. Letting the Critic route feedback to the appropriate upstream agent (Planner, Stylist, or Visualizer) based on the failure mode could reduce wasted iterations and improve convergence. This is a non-trivial control flow problem in multi-agent systems.

6. Hierarchical decomposition for complex diagrams
Some diagrams contain multiple logical substructures (e.g., a training pipeline next to an inference pipeline, or nested modules within a larger architecture). Planning these as a flat description hits a ceiling. Decomposing the diagram into sub-components, planning and rendering each independently, then composing them spatially could handle more complex illustrations than the current single-pass planner supports.

7. Planner and Critic prompt optimization
From building this, the Planner and Critic prompts turned out to be where most of the output quality lives. Systematic prompt optimization (DSPy, few-shot tuning, etc.) could yield measurable gains here.

8. Image generation vs. code generation tradeoffs
The paper notes that image generation models produce more visually appealing plots but are prone to numerical hallucination, while code generation (Matplotlib) is faithful but less polished. Finding a better balance here, or a hybrid approach, is an open problem.

If you have ideas, related work, or want to pick up any of these directions, drop them here. Contributions and discussion welcome.

statxc · 2026-03-06T03:19:57Z

statxc
Mar 6, 2026

After looking through the codebase, here’s the order I’d tackle these items based on what’s already implemented and what will deliver the biggest payoff.

1) Start with #1 - Eval Harness

This needs to come first because it makes every other improvement measurable. The judge system, 4-dimension scoring, and the ablation runner are already there. The main missing piece is a batch runner that can run across a full test set and summarize results. Without that, it’s hard to separate real progress from gut feel.

2) #3 (Reference Expansion) and #4 (Multi‑Venue Styles)

These are high-impact and mostly content work, not deep engineering.

The reference store is already built to scale (JSON index + images), so adding more references is straightforward.
Venue styles are also easy to extend since the guideline loader already supports custom file paths.
Both should noticeably improve output quality with relatively low effort.

3) #7 - Prompt Optimization

The prompt system is already template-driven and parameterized, so running prompt variants and comparing them is a natural next step. This is also where a lot of quality tends to come from.

4) #2 - Open‑Source Backends

This fits well once evaluation is in place. The provider architecture is clean, and the OpenAI provider already supports custom base URLs, which gets you close to supporting setups like vLLM or Ollama.

5) #5 (Adaptive Routing) and #8 (Hybrid Generation)

These are solid upgrades, but they’ll require more refactoring. The current structure can support them, but they’re not “drop-in” changes.

6) Last, #6 - Hierarchical Decomposition

This is the biggest lift. Right now the planner outputs flat text with no component model, so doing real hierarchical decomposition would mean reworking a significant chunk of the pipeline. It’s worth doing, but only once the rest is stable.

In short: measure first (#1), then low-effort quality gains (#3, #4, #7), then infrastructure (#2), and finally the deeper architectural work (#5, #6, #8).

I can handle this. I’ll open separate PRs for each task step by step. I’ll start by raising a PR for the first task (Eval Harness). For tasks #5 and #6, we can discuss more detail later after we’ve completed up to #4.
What do you think? Is it okay for me to get started contributing? Thanks

1 reply

dippatel1994 Mar 22, 2026
Maintainer Author

Thanks a lot @statxc you set a strong direction, looking forward for your contribution. Thanks again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Research directions for improving the diagram quality #4

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Research directions for improving the diagram quality #4

Uh oh!

dippatel1994 Feb 6, 2026 Maintainer

Replies: 1 comment · 1 reply

Uh oh!

statxc Mar 6, 2026

1) Start with #1 - Eval Harness

2) #3 (Reference Expansion) and #4 (Multi‑Venue Styles)

3) #7 - Prompt Optimization

4) #2 - Open‑Source Backends

5) #5 (Adaptive Routing) and #8 (Hybrid Generation)

6) Last, #6 - Hierarchical Decomposition

Uh oh!

dippatel1994 Mar 22, 2026 Maintainer Author

dippatel1994
Feb 6, 2026
Maintainer

Replies: 1 comment 1 reply

statxc
Mar 6, 2026

dippatel1994 Mar 22, 2026
Maintainer Author