🔴 Required Information
Describe the Bug:
LoadSkillResourceTool.run_async returns RESOURCE_NOT_FOUND as a structured soft-error string when a path passed by the LLM does not exist inside the skill's bundled resources. Because the response is a normal tool result (not an exception or terminal signal), the LLM treats it as a transient/recoverable failure and retries the same path. Nothing in SkillToolset distinguishes the first failure from the Nth, so the loop continues until RunConfig.max_llm_calls is exhausted.
max_llm_calls defaults to 500 (src/google/adk/agents/run_config.py:314). This means a single hallucinated path can silently consume the entire per-invocation call budget on a single failing tool name before the framework intervenes — and max_llm_calls is a global cap on legitimate reasoning, not a defense against a repeated-failure loop on one specific tool.
The loop is reachable through ordinary use of the Skills feature, not adversarial inputs:
- The L2
load_skill response intentionally omits a manifest of available files (the agentskills.io progressive-disclosure design — correct for token economy). The LLM must therefore infer paths from the prose inside SKILL.md, and inferred paths are routinely wrong.
RESOURCE_NOT_FOUND is structurally indistinguishable from a transient error to the model, so retry is its default response.
- The default system instruction does not draw a scope boundary between skill-bundled files (the legitimate target of
load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing), so the model sometimes routes runtime documents through load_skill_resource, hits RESOURCE_NOT_FOUND, and loops on a path that was never a skill resource to begin with.
Steps to Reproduce:
- Install
google-adk (any version that ships SkillToolset — verified on 1.31.0).
- Construct an agent with a
SkillToolset that contains at least one skill whose SKILL.md references files in references/ or assets/.
- Issue a query that prompts the model to read one of those resources, but craft the
SKILL.md so the prose strongly implies a path that does not literally exist (e.g., the file is named references/guide.md but the prose says "see the user guide" without specifying the filename — common for human-authored skills).
- Observe in the trace that the model calls
load_skill_resource with a hallucinated path (references/user_guide.md, references/userguide.md, etc.), receives RESOURCE_NOT_FOUND, and retries with another plausible variant. The loop continues until the max_llm_calls cap is hit.
A simpler synthetic repro at the unit-test level: call LoadSkillResourceTool.run_async twice with the same nonexistent path under the same tool_context. On main, both calls return identical RESOURCE_NOT_FOUND responses; nothing escalates.
Expected Behavior:
Repeated identical failures within a single invocation should be terminal. The framework should signal to the LLM — both via the tool response and via the system prompt — that the path will not start working and the model should stop retrying it. The agent's overall reasoning budget (max_llm_calls) should not be the only thing standing between an imperfect prompt and a runaway invocation.
Observed Behavior:
The same RESOURCE_NOT_FOUND soft error is returned on every attempt regardless of how many times the same path has already failed in the same invocation. There is no escalation, no terminal error code, and no instruction to the model to stop. The loop terminates only when max_llm_calls is exceeded, by which point ~500 LLM calls have been spent on one wrong path.
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
→ {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
→ {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
load_skill_resource(skill_name="writer", file_path="references/style_guide.md")
→ {"error": "Resource 'references/style_guide.md' not found in skill 'writer'.", "error_code": "RESOURCE_NOT_FOUND"}
... (continues until max_llm_calls=500 is hit)
Error: Number of llm calls limit `500` exceeded
Environment Details:
- ADK Library Version (
pip show google-adk): 1.31.0 (issue exists on main as of commit 2d61cb69)
- Desktop OS: Linux (reproducible cross-platform — defect is in framework logic, not OS-specific)
- Python Version:
3.12.3
Model Information:
- Are you using LiteLLM: N/A (defect is provider-agnostic; reproducible with any model that follows tool-use semantics)
- Which model is being used: N/A — observed across Gemini and Claude families. The behavior depends on the LLM treating soft errors as retryable, which is the default for every modern function-calling model.
🟡 Optional Information
Regression:
Not a regression. The defect has existed since SkillToolset was introduced — LoadSkillResourceTool.run_async has never had any retry-guard logic. The risk surface grew as the Skills feature became more widely used.
Additional Context:
The four compounding factors — no resource manifest at L2, soft-string error code, no terminal signal, no scope boundary in the default prompt — are individually defensible design decisions but combine into a loop reachable by ordinary use. A defensive framework should not depend on a perfect upstream system prompt to avoid unbounded loops on a known error path.
Considered and rejected during design discussion:
| Alternative |
Why not |
Tighten or default-lower max_llm_calls |
Caps the agent's overall reasoning budget; punishes legitimate long-running tasks; doesn't address the specific defect |
User-side after_tool_callback workaround |
Symptomatic; pushes the fix onto every user of SkillToolset; the framework still ships with the loop |
Add available_resources manifest to the L2 load_skill response |
Defeats the lazy-loading / token-saving design that the Skills spec is built around |
Introduce a new list_skill_resources tool |
Violates the L1→L2→L3 progressive disclosure contract from agentskills.io |
| Include available paths in the fatal response |
Re-introduces the manifest cost; contradicts the "stop" semantic the fatal code is meant to enforce |
Minimal Reproduction Code:
import asyncio
from unittest import mock
from google.adk.skills import models
from google.adk.tools import skill_toolset, tool_context
skill = mock.create_autospec(models.Skill, instance=True)
skill.name = "demo"
skill.resources = mock.MagicMock()
skill.resources.get_reference.return_value = None # every reference path "missing"
ctx = mock.MagicMock(spec=tool_context.ToolContext)
ctx.state = {}
ctx.invocation_id = "inv1"
ctx._invocation_context = mock.MagicMock()
ctx.agent_name = "agent"
toolset = skill_toolset.SkillToolset([skill])
tool = skill_toolset.LoadSkillResourceTool(toolset)
async def main():
for i in range(5):
r = await tool.run_async(
args={"skill_name": "demo", "file_path": "references/missing.md"},
tool_context=ctx,
)
print(i, r["error_code"]) # all 5 print RESOURCE_NOT_FOUND on main; the LLM has no reason to stop
asyncio.run(main())
How often has this issue occurred?:
- Always (100%) — deterministic given (a) any skill whose
SKILL.md lets the model infer plausible-looking paths that don't literally exist, or (b) any prompt that doesn't explicitly forbid retrying after RESOURCE_NOT_FOUND.
Proposed Fix
A two-layer fix is proposed in the linked PR (#5651): an invocation-scoped retry guard inside LoadSkillResourceTool.run_async that escalates a repeated (skill, path) failure to a new RESOURCE_NOT_FOUND_FATAL terminal code, plus two additions to _DEFAULT_SKILL_SYSTEM_INSTRUCTION (a no-retry rule and a scope boundary clarifying that load_skill_resource is for skill-bundled files only). Defense-in-depth: code-only termination produces confusing downstream behavior, prompt-only termination relies on the LLM following the rule. Both layers are required.
The retry-guard state is keyed under temp:_adk_skill_resource_failed_paths_<invocation_id>. The temp: prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage. The <invocation_id> suffix ensures correctness on in-memory session backends as well, where temp: keys are added to session.state and are not auto-cleared between invocations — without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.
Linked PR: #5651
🔴 Required Information
Describe the Bug:
LoadSkillResourceTool.run_asyncreturnsRESOURCE_NOT_FOUNDas a structured soft-error string when a path passed by the LLM does not exist inside the skill's bundled resources. Because the response is a normal tool result (not an exception or terminal signal), the LLM treats it as a transient/recoverable failure and retries the same path. Nothing inSkillToolsetdistinguishes the first failure from the Nth, so the loop continues untilRunConfig.max_llm_callsis exhausted.max_llm_callsdefaults to 500 (src/google/adk/agents/run_config.py:314). This means a single hallucinated path can silently consume the entire per-invocation call budget on a single failing tool name before the framework intervenes — andmax_llm_callsis a global cap on legitimate reasoning, not a defense against a repeated-failure loop on one specific tool.The loop is reachable through ordinary use of the Skills feature, not adversarial inputs:
load_skillresponse intentionally omits a manifest of available files (the agentskills.io progressive-disclosure design — correct for token economy). The LLM must therefore infer paths from the prose insideSKILL.md, and inferred paths are routinely wrong.RESOURCE_NOT_FOUNDis structurally indistinguishable from a transient error to the model, so retry is its default response.load_skill_resource) and runtime user inputs (e.g., a PDF the user is processing), so the model sometimes routes runtime documents throughload_skill_resource, hitsRESOURCE_NOT_FOUND, and loops on a path that was never a skill resource to begin with.Steps to Reproduce:
google-adk(any version that shipsSkillToolset— verified on1.31.0).SkillToolsetthat contains at least one skill whoseSKILL.mdreferences files inreferences/orassets/.SKILL.mdso the prose strongly implies a path that does not literally exist (e.g., the file is namedreferences/guide.mdbut the prose says "see the user guide" without specifying the filename — common for human-authored skills).load_skill_resourcewith a hallucinated path (references/user_guide.md,references/userguide.md, etc.), receivesRESOURCE_NOT_FOUND, and retries with another plausible variant. The loop continues until themax_llm_callscap is hit.A simpler synthetic repro at the unit-test level: call
LoadSkillResourceTool.run_asynctwice with the same nonexistent path under the sametool_context. Onmain, both calls return identicalRESOURCE_NOT_FOUNDresponses; nothing escalates.Expected Behavior:
Repeated identical failures within a single invocation should be terminal. The framework should signal to the LLM — both via the tool response and via the system prompt — that the path will not start working and the model should stop retrying it. The agent's overall reasoning budget (
max_llm_calls) should not be the only thing standing between an imperfect prompt and a runaway invocation.Observed Behavior:
The same
RESOURCE_NOT_FOUNDsoft error is returned on every attempt regardless of how many times the same path has already failed in the same invocation. There is no escalation, no terminal error code, and no instruction to the model to stop. The loop terminates only whenmax_llm_callsis exceeded, by which point ~500 LLM calls have been spent on one wrong path.Environment Details:
pip show google-adk):1.31.0(issue exists onmainas of commit2d61cb69)3.12.3Model Information:
🟡 Optional Information
Regression:
Not a regression. The defect has existed since
SkillToolsetwas introduced —LoadSkillResourceTool.run_asynchas never had any retry-guard logic. The risk surface grew as the Skills feature became more widely used.Additional Context:
The four compounding factors — no resource manifest at L2, soft-string error code, no terminal signal, no scope boundary in the default prompt — are individually defensible design decisions but combine into a loop reachable by ordinary use. A defensive framework should not depend on a perfect upstream system prompt to avoid unbounded loops on a known error path.
Considered and rejected during design discussion:
max_llm_callsafter_tool_callbackworkaroundSkillToolset; the framework still ships with the loopavailable_resourcesmanifest to the L2load_skillresponselist_skill_resourcestoolMinimal Reproduction Code:
How often has this issue occurred?:
SKILL.mdlets the model infer plausible-looking paths that don't literally exist, or (b) any prompt that doesn't explicitly forbid retrying afterRESOURCE_NOT_FOUND.Proposed Fix
A two-layer fix is proposed in the linked PR (#5651): an invocation-scoped retry guard inside
LoadSkillResourceTool.run_asyncthat escalates a repeated(skill, path)failure to a newRESOURCE_NOT_FOUND_FATALterminal code, plus two additions to_DEFAULT_SKILL_SYSTEM_INSTRUCTION(a no-retry rule and a scope boundary clarifying thatload_skill_resourceis for skill-bundled files only). Defense-in-depth: code-only termination produces confusing downstream behavior, prompt-only termination relies on the LLM following the rule. Both layers are required.The retry-guard state is keyed under
temp:_adk_skill_resource_failed_paths_<invocation_id>. Thetemp:prefix uses ADK's existing convention so the value is trimmed from the persisted event delta and never reaches durable session storage. The<invocation_id>suffix ensures correctness on in-memory session backends as well, wheretemp:keys are added tosession.stateand are not auto-cleared between invocations — without the suffix, a path that legitimately failed in invocation A would spuriously hit the fatal path on its first attempt in invocation B.Linked PR: #5651