💡 Pre-Agentic Data Gathering Pipeline for 60%+ Token Savings #563

2026-06-11T00:59:09Z

github-actions[bot]
Bot Jun 11, 2026

Summary

Restructure agent workflows to move all deterministic data fetching (git diff, CI status, label checks, file listing, PR metadata) into shell-based preflight steps before invoking the LLM. Feed pre-gathered context as structured input rather than making the agent discover it via tool calls, eliminating tool-invocation token overhead and reasoning-loop waste.

Market Signal

GitHub's own engineering team achieved 62% Effective Token (ET) reduction in their agentic workflows by moving pre-agentic data gathering into workflow setup steps (documented in their May 2026 blog post "Improving token efficiency in GitHub Agentic Workflows"). Their Auto-Triage workflow showed sustained 62% ET savings across 109 post-fix runs. Security Guard achieved 43%, Smoke Claude 59%. The pattern is now considered a production best practice for LLM-based CI/CD. Key insight: every MCP tool schema adds 8-12KB of context per call, and unused tools waste context budget without providing value.

User Signal

The Token Cost Observatory discussion (#332) already identified pre-agentic optimization as a goal. Weekly token reports (issue #464) show ongoing cost visibility and awareness. The project already has dev-lead-preflight.sh proving the pattern works at small scale, but it has not been systematically applied to the PR review pipeline — the highest-volume agent workflow.

Technical Opportunity

The review pipeline currently has the LLM fetch diffs, CI status, and PR metadata via tool calls inside the reasoning loop. Each tool call adds:

Schema overhead (8-12KB per MCP tool definition, sent with every API request)
Reasoning tokens for the agent to decide which tool to use
Round-trip latency for tool execution

Moving these to shell steps in review-one-pr.sh and passing as structured context would eliminate this overhead. The existing dev-lead-preflight.sh pattern provides a proven template. GitHub's approach:

Run gh CLI commands in a setup step to gather all deterministic data
Format as structured markdown/JSON context
Pass to the LLM as pre-populated context, not as discoverable tools
Remove unused MCP tool definitions from agent configurations

Assessment

Dimension	Score	Rationale
Feasibility	high	dev-lead-preflight.sh already proves the pattern; review pipeline refactoring is bounded engineering work
Impact	high	62% ET savings demonstrated by GitHub on comparable workloads; directly reduces the largest cost driver
Urgency	high	Every day without this optimization burns 2-3x more tokens than necessary on deterministic reads

Adversarial Review

Strongest objection: Pre-gathered data could become stale if a force push occurs between the preflight step and the agent's analysis. Refactoring existing review scripts risks introducing bugs in a production pipeline.

Rebuttal: Staleness is mitigable — gather data at the start of each agent turn, not minutes before. The triage tier (cheap, fast) can validate freshness by checking the HEAD SHA matches what was pre-fetched before the expensive deep review runs. Engineering risk is bounded because dev-lead-preflight.sh already proves the pattern, and the review pipeline's multi-tier architecture means changes can be validated at the triage tier before rolling to deeper tiers. Start with the highest-volume, most deterministic reads (PR metadata, labels, file list) and expand incrementally.

Suggested Next Step

Audit review-one-pr.sh and review-batch.sh for all LLM-initiated data fetches. Categorize each as deterministic (can preflight) vs. dynamic (must stay in-loop). Implement preflight gathering for deterministic reads and measure ET savings via the Token Cost Observatory.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

💡 Pre-Agentic Data Gathering Pipeline for 60%+ Token Savings #563

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

💡 Pre-Agentic Data Gathering Pipeline for 60%+ Token Savings #563

Uh oh!

github-actions[bot] Bot Jun 11, 2026

Summary

Market Signal

User Signal

Technical Opportunity

Assessment

Adversarial Review

Suggested Next Step

Replies: 0 comments

github-actions[bot]
Bot Jun 11, 2026