Skip to content

fix(triage-panel): paginate scheduled sweep oldest-first via MCP#1193

Merged
danielmeppiel merged 1 commit into
mainfrom
fix/triage-sweep-pagination
May 7, 2026
Merged

fix(triage-panel): paginate scheduled sweep oldest-first via MCP#1193
danielmeppiel merged 1 commit into
mainfrom
fix/triage-sweep-pagination

Conversation

@danielmeppiel
Copy link
Copy Markdown
Collaborator

TL;DR

The triage-panel scheduled sweep has been processing one issue per daily cron run instead of the documented MAX_ISSUES_PER_RUN=10, silently letting the untriaged backlog grow. Root cause is a prompt bug in triage-panel.md, not anything in the engine. Fix is a .md-only edit (lockfile imports the prompt at runtime).

Problem (WHY)

Maintainer flagged: "I have not seen [the panel] labelling issues lately."

Investigation:

From the agent stdio trace of the 2026-05-06 run:

list_issues (MCP: github) · owner: "microsoft", repo: "apm", state: "OPEN",
  perPage: 100, orderBy...
  Output too large to read at once (140.7 KB). Saved to: /tmp/...

Then:

Now I have all context to run the triage panel. BATCH_ALLOW_LIST = [1161].

So the agent:

  1. Made one MCP list_issues call (~140 KB response).
  2. Couldn't read the response in one tool turn.
  3. Picked BATCH_ALLOW_LIST = [1161] from the head of whatever slice it did see.
  4. Never paginated. Never adjusted perPage. Never re-ordered oldest-first.

Why the agent went off-script: the SCHEDULED_SWEEP prompt example showed gh issue list --limit 200, but shared/apm.md tells the agent that shell gh is unauthenticated and to use the GitHub MCP list_issues tool. The MCP tool defaults to newest-first and lacks "exclude label" semantics, neither of which the prompt addresses.

Approach (WHAT)

Rewrite the SCHEDULED_SWEEP "Step 1: Gather candidates" subsection in .github/workflows/triage-panel.md to:

  • Drop the misleading gh issue list example.
  • Specify the actual MCP call: list_issues with orderBy: CREATED_AT, direction: ASC, perPage: 30.
  • Mandate pagination via after: <endCursor> until either 10 eligibles found, or hasNextPage: false, or 5 pages scanned (a healthy sweep is allowed to be small).
  • Tell the agent to drop perPage to 15 if a single page can't be read in one tool response.

The lockfile uses {{#runtime-import .github/workflows/triage-panel.md}} to pull the prompt at runtime, so gh aw compile produces zero lockfile diff. Verified locally.

Validation

Trade-offs

  • The 5-page (150-issue) cap is a defensive ceiling against runaway pagination if the queue is huge. With a real backlog around 13, this is comfortably above what's needed.
  • We could have added a hard "exit non-zero if BATCH_ALLOW_LIST has fewer than min(N_eligibles, 10) entries" check, but that would risk noisy CI failures from edge cases. Prompt clarity is the cheaper fix.
  • This does not change the cost ceiling: the cron cap is still 10 runs/day, and safe-output-tools still cap at add_labels(max:70), which is ~10 issues at 7 labels each.

How to test

After merge, watch the next daily scheduled run of .github/workflows/triage-panel.lock.yml. Expected: the agent processes up to 10 untriaged issues (oldest first) in a single sweep, draining the current 13-issue queue across two cron ticks.

The scheduled sweep agent was processing only one issue per daily cron
run despite a documented MAX_ISSUES_PER_RUN=10 ceiling and a 13-deep
human-authored untriaged backlog. The 2026-05-06 sweep
(run 25437842915) labelled exactly one issue (#1161) and exited with
'All other open issues were already status/triaged or authored by bots
(skipped)' even though most older issues lacked status/triaged.

Root cause from the agent stdio trace:

- The prompt example showed 'gh issue list --limit 200' but
  shared/apm.md tells the agent shell gh is unauthenticated, so the
  agent calls the MCP list_issues tool instead.
- That single MCP call returned ~140 KB ('Output too large to read at
  once') and the agent set BATCH_ALLOW_LIST = [<head of list>] from
  the truncated slice without paginating.
- The MCP tool defaults sort newest-first, contradicting the prompt's
  oldest-first requirement, and the labels filter only matches
  positive labels (no 'lacks status/triaged' semantics).

Update the SCHEDULED_SWEEP step to:

- Call list_issues with orderBy=CREATED_AT, direction=ASC, perPage=30.
- Mandate pagination via 'after: endCursor' until 10 eligibles found,
  hasNextPage=false, or 5 pages scanned.
- Tell the agent to drop perPage to 15 if a single page can't be read
  in one tool response.
- Drop the misleading shell gh example.

Lockfile imports the prompt at runtime
({{#runtime-import .github/workflows/triage-panel.md}}) so this is a
.md-only change; gh aw compile produces no lockfile diff.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 7, 2026 18:32
@danielmeppiel danielmeppiel merged commit 801b501 into main May 7, 2026
17 checks passed
@danielmeppiel danielmeppiel deleted the fix/triage-sweep-pagination branch May 7, 2026 18:35
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates the triage-panel scheduled sweep prompt so the agent uses GitHub MCP pagination (oldest-first) to reliably find up to 10 eligible untriaged issues per run, preventing silent backlog growth.

Changes:

  • Rewrites the SCHEDULED_SWEEP candidate-gathering instructions to use GitHub MCP list_issues with oldest-first ordering and mandatory pagination.
  • Adds guardrails for tool-output size by reducing perPage when responses are too large.
  • Documents the behavior change in the changelog.
Show a summary per file
File Description
CHANGELOG.md Notes the scheduled sweep pagination/ordering behavior change for release visibility.
.github/workflows/triage-panel.md Updates the prompt to use MCP list_issues with pagination/ordering and clarifies filtering steps.

Copilot's findings

  • Files reviewed: 2/2 changed files
  • Comments generated: 3

Comment thread CHANGELOG.md
- `shared/apm.md` no longer wraps the `target` input in a `|| 'all'` fallback. The defensive expression broke gh-aw's bare-expression substitution regex, causing consumer-supplied `target:` values to be silently dropped; the `import-schema` default already covers the omitted-input case. (#1185)
- `apm install --target all` no longer enumerates the experimental `copilot-cowork` target, which was crashing project-scope installs with a "requires --global" error and made `gh aw` workflows that pin `target: all` unusable. (#1191)
- Stabilized `test_install_over_defer_threshold_starts_live_once` on slow CI runners by joining the deferred-start timer thread instead of relying on a 100ms grace window. (#1191)
- `triage-panel` scheduled sweep now paginates the candidate query oldest-first via the GitHub MCP `list_issues` tool instead of a single 200-issue page, so daily runs actually drain the untriaged backlog rather than processing one issue per cron tick.
Comment on lines +281 to +288
list_issues(
owner: "microsoft",
repo: "apm",
state: "OPEN",
orderBy: "CREATED_AT",
direction: "ASC",
perPage: 30,
)
state: "OPEN",
orderBy: "CREATED_AT",
direction: "ASC",
perPage: 30,
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants