Guardrails via before_model_callback — implementation patterns, error handling & agentic evaluation datasets #4963

ecanlar · 2026-03-23T14:14:13Z

ecanlar
Mar 23, 2026

Context

I'm implementing guardrails in an ADK Python agent using before_model_callback to intercept requests before they reach the model — blocking off-topic, harmful, or policy-violating inputs by returning early with a refusal Content object.

Questions

1. Implementation

Has anyone implemented guardrails this way in production or a serious project?

What backs your classifier inside the callback — a lightweight LLM call, a rule-based filter, an embedding similarity check, a third-party library (Llama Guard, Guardrails AI…)?
Do you return a structured Content object, raise an exception, or something else to signal the block?
Any gotchas with the callback signature or the way ADK handles the early return?

2. Testing that the guardrail actually fires

This is where I'm most uncertain. How do you reliably verify that the callback:

Blocks what it should (returns a refusal, not a partial model response)?
Doesn't over-block legitimate queries?

Concretely:

Do you unit-test the callback in isolation (mocking callback_context and llm_request), or run end-to-end through the full agent?
What do you assert on — the Content object returned, an exception type, a specific field?
Are you using pytest + mocks, the ADK eval framework (adk eval), or something else?

3. Agentic evaluation dataset

Most safety benchmarks (AdvBench, HarmBench, WildGuard…) are designed for single-turn chat and don't capture the attack surface that is specific to agents. I'm looking for datasets or test suites that target agentic weaknesses specifically — things like:

Multi-turn prompt injection — the adversarial payload arrives across several turns, not in a single message
Tool-mediated attacks — malicious instructions embedded in tool outputs or retrieved documents (indirect prompt injection)
Goal hijacking — inputs that subtly redirect the agent's objective without triggering obvious keyword filters
Context manipulation — leveraging the agent's memory or session state to bypass a guardrail that would fire on turn 1

Has anyone found a dataset — or built one internally — that specifically stresses an ADK agent in these ways? And does before_model_callback actually catch these, or are some of them invisible at that interception point?

That last question feels important: are there attack patterns where before_model_callback is structurally blind (e.g. injections that arrive via tool results, which never pass through the callback)?

Any code snippets, patterns, or links to related examples are very welcome. Thanks!

adk-bot · 2026-03-23T14:16:09Z

adk-bot
Mar 23, 2026
Maintainer

Response from ADK Answering Agent (experimental, answer may be inaccurate)

TLDR: Use before_model_callback to return an LlmResponse for input guardrails. Test this with a mix of pytest unit tests (mocking context) and adk eval for end-to-end validation. Be aware that before_model_callback is blind to tool-mediated attacks; you'll need before_tool_callback or other mechanisms to inspect tool arguments and outputs.

Hello! These are excellent and very important questions about implementing robust guardrails for agents. Here is a breakdown of patterns and best practices based on ADK's design.

1. Implementation

Mechanism: You are on the right track. The idiomatic way to implement an input guardrail in ADK is to use before_model_callback. [1, 4] Inside this callback, your logic can inspect the llm_request. If you decide to block the request, you should construct and return an LlmResponse object. [4, 8] This cleanly bypasses the actual model call and sends your predefined refusal back through the agent's execution flow. Returning None allows the request to proceed normally. [4, 8]
Classifier Backend: The choice of classifier is a trade-off.
- Rule-Based: Fastest and cheapest. Good for simple keyword filtering or regex matching.
- Lightweight LLM Call ("Judge" Model): More powerful for semantic understanding. The ADK documentation suggests using a fast and inexpensive model (like Gemini Flash Lite) as a "judge" to screen inputs and outputs. [3] This can be implemented within a callback.
- Plugins: For policies that apply across multiple agents, ADK recommends encapsulating guardrail logic within a Plugin. [3, 6] Plugins are registered at the runner level and can implement before_model_callback, ensuring consistent enforcement without repetitive code. [6, 16]
Gotchas: The main thing to watch for is performance. Since this callback runs before every model call, any significant latency will slow down your agent. Ensure your validation logic is as efficient as possible.

2. Testing Guardrails

A multi-layered testing strategy is recommended. [27]

Unit Testing: This is ideal for testing the guardrail logic in isolation.
- Use a framework like pytest. [2, 9]
- In your test, you can create mock CallbackContext and LlmRequest objects to simulate different scenarios (e.g., harmful input, safe input). [2, 25, 26]
- You would then call your callback function directly and assert that it either returns the expected LlmResponse (for a block) or None (for an allow). [22, 36]
End-to-End Evaluation: This verifies that the guardrail works correctly within the full agent.
- The adk eval framework is designed for this. [9, 21] You can create an evalset file containing test cases with both malicious and benign prompts. [21]
- For the malicious prompts, you can define the final_response to be your agent's refusal message. The response_match_score metric can then automatically score whether the agent's output matches the expected refusal. [27, 33]
- You can run evaluations via the command line (adk eval) or the interactive adk web UI. [13, 21]

3. Agentic Evaluation and Blind Spots

This is a critical point and you've correctly identified a major challenge in agent security.

Agentic Datasets: While general safety benchmarks exist, datasets specifically targeting multi-turn and tool-mediated attacks on agents are still an emerging area of research. Your intuition here is correct.
before_model_callback Blind Spots: You are absolutely right to question this. The before_model_callback is structurally blind to attacks that do not appear in the prompt sent to the model.
- Tool-Mediated Attacks: If a tool returns malicious content (e.g., a document retrieved from a compromised source that contains a prompt injection), this content arrives after the before_model_callback has already run for that turn.
- Sanitizing Tool Inputs & Outputs: To mitigate this, you need to use other callbacks.
  - before_tool_callback: Use this to inspect and validate the arguments the LLM is trying to send to a tool. [1] You can block suspicious tool calls here by returning a dictionary that will be treated as the tool's output. [1, 4]
  - after_tool_callback: Use this to inspect and sanitize the result coming back from a tool before it is processed by the agent and potentially included in the context for the next LLM call. [4]

By combining before_model_callback for user input with before_tool_callback and after_tool_callback for tool interactions, you can create a much more comprehensive, multi-layered defense.

I hope this helps provide some clarity and patterns for your implementation!

0 replies

KOKOSde · 2026-03-24T05:40:52Z

KOKOSde
Mar 24, 2026

before_model_callback is the right place if you want a hard block before any tokens are generated. I would keep the guardrail stack simple. Run cheap deterministic checks first, then only call a smaller classifier model for borderline cases. When you block, return a normal refusal Content object instead of raising, unless you want the whole run treated as an error. For testing, build a fixed set of allowed and disallowed prompts and assert on the callback result directly. That catches regressions much better than manual spot checks.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Guardrails via before_model_callback — implementation patterns, error handling & agentic evaluation datasets #4963

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Guardrails via before_model_callback — implementation patterns, error handling & agentic evaluation datasets #4963

Uh oh!

ecanlar Mar 23, 2026

Context

Questions

1. Implementation

2. Testing that the guardrail actually fires

3. Agentic evaluation dataset

Replies: 2 comments

Uh oh!

adk-bot Mar 23, 2026 Maintainer

1. Implementation

2. Testing Guardrails

3. Agentic Evaluation and Blind Spots

Uh oh!

KOKOSde Mar 24, 2026

ecanlar
Mar 23, 2026

adk-bot
Mar 23, 2026
Maintainer

KOKOSde
Mar 24, 2026