fix(core): filter unsupported multimodal types from tool responses#26352
fix(core): filter unsupported multimodal types from tool responses#26352aishaneeshah merged 2 commits intomainfrom
Conversation
b4c5a79 to
77c7d7a
Compare
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a critical issue where the Gemini API rejects tool responses containing binary audio or video data, leading to infinite retry loops in autonomous mode. The solution introduces a filtering mechanism for these unsupported MIME types and provides explicit instructions to the agent on how to correctly reference such files for multimodal analysis, thereby improving API compatibility and agent behavior. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces logic to filter unsupported audio and video MIME types from tool responses, replacing them with a steering message that instructs the model to request the files using standard multimodal syntax. Feedback includes a critical security recommendation to sanitize MIME types to prevent prompt injection, a suggestion to use case-insensitive matching for MIME type filtering, and a request to revert extensive unrelated changes in package-lock.json to comply with the repository's style guide regarding PR focus.
I am having trouble creating individual review comments. Click here to see my feedback.
packages/core/src/utils/generateContentResponseUtilities.ts (109)
The convertToFunctionResponse function extracts the mimeType from tool responses and includes it directly in a steeringMessage that is sent to the LLM. If a tool returns a malicious MIME type containing newline characters and instructions (e.g., audio/mpeg\n\n[SYSTEM INSTRUCTION: ...]), these instructions will be injected into the prompt. This allows for prompt injection attacks where a tool can manipulate the LLM's behavior. To mitigate this, sanitize the MIME types to ensure they only contain valid characters and no newlines.
const uniqueMimes = Array.from(new Set(unsupportedMimeTypes)).map((m) => m.replace(/[^\w/+. -]/g, '')).join(', ');
References
- Sanitize data from LLM-driven tools before injecting it into a system prompt to prevent prompt injection. At a minimum, remove newlines and context-breaking characters (e.g., ']').
package-lock.json (452-453)
The package-lock.json file contains extensive unrelated changes, specifically adding "peer": true to numerous packages across the monorepo. This violates the repository style guide's requirement to keep pull requests focused and small. Please revert these environmental changes to ensure the PR only contains the necessary fix for multimodal type filtering.
References
- Pull Requests: Keep PRs small, focused, and linked to an existing issue. (link)
packages/core/src/utils/generateContentResponseUtilities.ts (97-101)
MIME types should be checked case-insensitively to ensure that all variations (e.g., AUDIO/MPEG) are correctly filtered. This prevents potential 400 Bad Request errors from the Gemini API if a tool returns non-lowercase MIME types.
const mimeType = part.inlineData?.mimeType;
const lowerMime = mimeType?.toLowerCase();
if (
lowerMime?.startsWith('audio/') || lowerMime?.startsWith('video/')
) {
|
Size Change: +2.56 kB (+0.01%) Total Size: 33.9 MB
ℹ️ View Unchanged
|
861bc90 to
b8ca3be
Compare
This change prevents 400 Bad Request errors from the Gemini API when a tool returns binary audio or video data. It implements a 'Smart Redirect' that filters these types and instructs the agent to use the @ syntax instead. Permanent unit and integration tests have been integrated and verified.
b8ca3be to
9083553
Compare
There was a problem hiding this comment.
Code Review
This pull request implements a 'Synthetic Turn Exchange' mechanism to handle binary data (audio/video) from tools like read_file. When binary content is detected, the history is expanded with a cleaned tool response, a synthetic model acknowledgment, and a new user turn containing the binary data. Feedback highlights critical security concerns regarding prompt injection, as the current implementation does not validate the types of injected parts or sanitize tool data. Furthermore, the synthetic turns bypass the ChatRecordingService, leading to incomplete audit logs. There are also suggestions to improve type safety and remove unnecessary eslint-disable comments.
I am having trouble creating individual review comments. Click here to see my feedback.
packages/core/src/core/geminiChat.ts (523-547)
The extractBinaryInjections method is vulnerable to prompt injection as it extracts parts from the __binary_injection__ key without validating the content. Per repository guidelines (Rule 2), data from LLM-driven tools must be sanitized (e.g., removing newlines and context-breaking characters like ']') before injection into prompts. Additionally, the response object being typed as object leads to compilation errors and an eslint-disable on line 536, which contradicts the PR's claim of 'Zero eslint-disable'. To mitigate this, validate the content of __binary_injection__ and cast response to Record<string, unknown> for type-safe access.
if (response && BINARY_INJECTION_KEY in response) {
const responseObj = response as Record<string, unknown>;
const binaryParts = responseObj[BINARY_INJECTION_KEY] as Part[];
delete responseObj[BINARY_INJECTION_KEY];References
- Sanitize data from LLM-driven tools before injecting it into a system prompt to prevent prompt injection. At a minimum, remove newlines and context-breaking characters (e.g., ']').
packages/core/src/core/geminiChat.ts (355-377)
The implementation of the 'Synthetic Turn Exchange' mechanism for binary data injection is vulnerable to prompt injection and audit log bypass.
-
Prompt Injection: The
extractBinaryInjectionsmethod (lines 523-547) extracts anyPartobjects associated with the__binary_injection__key without validating their type. SincesendMessageStreamaccepts themessageparameter from untrusted sources, an attacker can craft a message containing afunctionResponsewith this key to inject arbitrarytextparts into the conversation history. These injected parts are then pushed as a newuserturn (lines 373-376), which the model will treat as a fresh instruction. Per Rule 2, tool data must be sanitized before injection. -
Logging Bypass: The synthetic turns (the model acknowledgment and the injected user data) are pushed directly to
agentHistory(lines 358, 361, 379) but bypass theChatRecordingService.recordMessagecall (lines 331-352). This allows an attacker to inject prompts that do not appear in the audit logs, facilitating stealthy attacks.
Remediation:
- In
extractBinaryInjections, strictly validate that extracted parts are only multimodal types (e.g.,inlineDataorfileData) and explicitly rejecttextparts. - Ensure that data from LLM-driven tools is sanitized (e.g., removing newlines and context-breaking characters) before injection.
- Ensure that all turns in the synthetic exchange are properly recorded by the
ChatRecordingServiceto maintain audit integrity. - Verify that the
__binary_injection__key is only processed when it originates from a trusted tool execution flow, handled at the tool or subagent level (Rule 1).
References
- Prompt injection sanitization should be handled at the tool or subagent level, not on a per-tool basis within the agent executor.
- Sanitize data from LLM-driven tools before injecting it into a system prompt to prevent prompt injection. At a minimum, remove newlines and context-breaking characters (e.g., ']').
Summary
This PR addresses a critical protocol limitation where the Gemini API returns a
400 Bad Requestwhen binary audio or video data (e.g.,audio/mpeg,video/mp4) is included in afunctionResponsepart.The fix implements an automated "One-Go" Synthetic Turn Exchange for the
read_fileandread_many_filestools. This allows the agent to analyze multimodal content in a single interaction without user intervention or protocol violations.Details
Problem
When a tool (like
read_file) returns binary audio/video content, the CLI currently attempts to pass that data directly into thefunctionResponsepart of the next turn. The Gemini API explicitly rejects these types in this specific protocol context, causing a 400 error and triggering infinite retry loops in autonomous mode.Solution
read_file/read_many_files):userturn where it is fully supported by the API.eslint-disable: The final implementation satisfies all strict linting rules without suppressions.generateContentResponseUtilities.test.tsandgeminiChat.test.ts.Related Issues
Fixes #25214
How to Validate
npm run test -w @google/gemini-cli-core -- src/utils/generateContentResponseUtilities.test.ts src/core/geminiChat.test.tsPre-Merge Checklist