[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop

## 🔴 Required Information

**Describe the Bug:**

`LoadSkillResourceTool.run_async` returns `RESOURCE_NOT_FOUND` as a structured soft-error string when a path passed by the LLM does not exist inside the skill's bundled resources. Because the response is a normal tool result (not an exception or terminal signal), the LLM treats it as a transient/recoverable failure and retries the same path. Nothing in `SkillToolset` distinguishes the first failure from the Nth, so the loop continues until `RunConfig.max_llm_calls` is exhausted.

`max_llm_calls` defaults to **500** (`src/google/adk/agents/run_config.py:314`). This means a single hallucinated path can silently consume the entire per-invocation call budget on a single failing tool name before the framework intervenes — and `max_llm_calls` is a global cap on legitimate reasoning, not a defense against a repeated-failure loop on one specific tool.

The loop is reachable through ordinary use of the Skills feature, not adversarial inputs:

1. The L2 `load_skill` response intentionally omits a manifest of available files (the agentskills.io progressive-disclosure design — correct for token economy). The LLM must therefore *infer* paths from the prose inside `SKILL.md`, and inferred paths are routinely wrong.
2. `RESOURCE_NOT_FOUND` is structurally indistinguishable from a transient error to the model, so retry is its default response.
3. The default system instruction does not draw a scope boundary between *skill-bundled files* (the legitimate target of `load_skill_resource`) and *runtime user inputs* (e.g., a PDF the user is processing), so the model sometimes routes runtime documents through `load_skill_resource`, hits `RESOURCE_NOT_FOUND`, and loops on a path that was never a skill resource to begin with.

**Steps to Reproduce:**

1. Install `google-adk` (any version that ships `SkillToolset` — verified on `1.31.0`).
2. Construct an agent with a `SkillToolset` that contains at least one skill whose `SKILL.md` references files in `references/` or `assets/`.
3. Issue a query that prompts the model to read one of those resources, but craft the `SKILL.md` so the prose strongly implies a path that does not literally exist (e.g., the file is named `references/guide.md` but the prose says \"see the user guide\" without specifying the filename — common for human-authored skills).
4. Observe in the trace that the model calls `load_skill_resource` with a hallucinated path (`references/user_guide.md`, `references/userguide.md`, etc.), receives `RESOURCE_NOT_FOUND`, and retries with another plausible variant. The loop continues until the `max_llm_calls` cap is hit.

A simpler synthetic repro at the unit-test level: call `LoadSkillResourceTool.run_async` twice with the same nonexistent path under the same `tool_context`. On `main`, both calls return identical `RESOURCE_NOT_FOUND` responses; nothing escalates.

**Expected Behavior:**

Repeated identical failures within a single invocation should be terminal. The framework should signal to the LLM — both via the tool response and via the system prompt — that the path will not start working and the model should stop retrying it. The agent's overall reasoning budget (`max_llm_calls`) should not be the only thing standing between an imperfect prompt and a runaway invocation.

**Observed Behavior:**

The same `RESOURCE_NOT_FOUND` soft error is returned on every attempt regardless of how many times the same path has already failed in the same invocation. There is no escalation, no terminal error code, and no instruction to the model to stop. The loop terminates only when `max_llm_calls` is exceeded, by which point ~500 LLM calls have been spent on one wrong path.

```text
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
  → {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
  → {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
  → {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
... (continues until max_llm_calls=500 is hit)
Error: Number of llm calls limit `500` exceeded
```

**Environment Details:**

- ADK Library Version (`pip show google-adk`): `1.31.0` (issue exists on `main` as of commit `2d61cb69`)
- Desktop OS: Linux (reproducible cross-platform — defect is in framework logic, not OS-specific)
- Python Version: `3.12.3`

**Model Information:**

- Are you using LiteLLM: N/A (defect is provider-agnostic; reproducible with any model that follows tool-use semantics)
- Which model is being used: N/A — observed across Gemini and Claude families. The behavior depends on the LLM treating soft errors as retryable, which is the default for every modern function-calling model.

---

## 🟡 Optional Information

**Regression:**

Not a regression. The defect has existed since `SkillToolset` was introduced — `LoadSkillResourceTool.run_async` has never had any retry-guard logic. The risk surface grew as the Skills feature became more widely used.

**Additional Context:**

The four compounding factors — no resource manifest at L2, soft-string error code, no terminal signal, no scope boundary in the default prompt — are individually defensible design decisions but combine into a loop reachable by ordinary use. A defensive framework should not depend on a perfect upstream system prompt to avoid unbounded loops on a known error path.

Considered and rejected during design discussion:

| Alternative | Why not |
|---|---|
| Tighten or default-lower `max_llm_calls` | Caps the agent's overall reasoning budget; punishes legitimate long-running tasks; doesn't address the specific defect |
| User-side `after_tool_callback` workaround | Symptomatic; pushes the fix onto every user of `SkillToolset`; the framework still ships with the loop |
| Add `available_resources` manifest to the L2 `load_skill` response | Defeats the lazy-loading / token-saving design that the Skills spec is built around |
| Introduce a new `list_skill_resources` tool | Violates the L1→L2→L3 progressive disclosure contract from agentskills.io |
| Include available paths in the fatal response | Re-introduces the manifest cost; contradicts the "stop" semantic the fatal code is meant to enforce |

**Minimal Reproduction Code:**

```python
import asyncio
from unittest import mock
from google.adk.skills import models
from google.adk.tools import skill_toolset, tool_context

skill = mock.create_autospec(models.Skill, instance=True)
skill.name = "demo"
skill.resources = mock.MagicMock()
skill.resources.get_reference.return_value = None  # every reference path "missing"

ctx = mock.MagicMock(spec=tool_context.ToolContext)
ctx.state = {}
ctx.invocation_id = "inv1"
ctx._invocation_context = mock.MagicMock()
ctx.agent_name = "agent"

toolset = skill_toolset.SkillToolset([skill])
tool = skill_toolset.LoadSkillResourceTool(toolset)

async def main():
    for i in range(5):
        r = await tool.run_async(
            args={"skill_name": "demo", "file_path": "references/missing.md"},
            tool_context=ctx,
        )
        print(i, r["error_code"])  # all 5 print RESOURCE_NOT_FOUND on main; the LLM has no reason to stop

asyncio.run(main())
```

**How often has this issue occurred?:**

- **Always (100%)** — deterministic given (a) any skill whose `SKILL.md` lets the model infer plausible-looking paths that don't literally exist, or (b) any prompt that doesn't explicitly forbid retrying after `RESOURCE_NOT_FOUND`.

---

## Proposed Fix

A two-layer fix is proposed in the linked PR (#5651): an invocation-scoped retry guard inside `LoadSkillResourceTool.run_async` that escalates a repeated `(skill, path)` failure to a new `RESOURCE_NOT_FOUND_FATAL` terminal code, plus two additions to `_DEFAULT_SKILL_SYSTEM_INSTRUCTION` (a no-retry rule and a scope boundary clarifying that `load_skill_resource` is for skill-bundled files only). Defense-in-depth: code-only termination produces confusing downstream behavior, prompt-only termination relies on the LLM following the rule. Both layers are required.

The retry-guard state is keyed under `temp:_adk_skill_resource_failed_paths_<invocation_id>`. The `temp:` prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage. The `<invocation_id>` suffix ensures correctness on **in-memory** session backends as well, where `temp:` keys are added to `session.state` and are not auto-cleared between invocations — without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.

Linked PR: google/adk-python#5651

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop #5652

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Alternative	Why not
Tighten or default-lower `max_llm_calls`	Caps the agent's overall reasoning budget; punishes legitimate long-running tasks; doesn't address the specific defect
User-side `after_tool_callback` workaround	Symptomatic; pushes the fix onto every user of `SkillToolset`; the framework still ships with the loop
Add `available_resources` manifest to the L2 `load_skill` response	Defeats the lazy-loading / token-saving design that the Skills spec is built around
Introduce a new `list_skill_resources` tool	Violates the L1→L2→L3 progressive disclosure contract from agentskills.io
Include available paths in the fatal response	Re-introduces the manifest cost; contradicts the "stop" semantic the fatal code is meant to enforce

[Bug] LoadSkillResourceTool retries RESOURCE_NOT_FOUND indefinitely; default max_llm_calls=500 is the only backstop #5652

Description

🔴 Required Information

🟡 Optional Information

Proposed Fix

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions