fix(llm): convert per-turn instructions on the very first turn too#5828
Conversation
`convert_mid_conversation_instructions` previously only rewrote system/developer messages to a user-role `<instructions>` turn AFTER the first user or assistant turn had landed (`seen_non_system`). On the very first `generate_reply(instructions=...)` of a session, the chat context contains only the agent's base prompt plus the freshly appended per-turn instructions, both system, with no user/assistant content yet. The condition fails, both stay `system`, and providers that require the conversation to end on a `user`/`tool` turn fall back to `inject_dummy_user_message` β a literal `"."` user message the model frequently answers with "you didn't say anything". Simpler rule: the first system message is the preamble, every subsequent system message is per-turn instructions and gets converted. Same template, same target role.
| if item.type == "message" and item.role in ("system", "developer"): | ||
| first_system_seen = True |
There was a problem hiding this comment.
π‘ Mid-conversation system message not converted when no system message precedes it
The new first_system_seen flag only starts converting system messages after encountering the first system/developer message. If a ChatContext has no leading system message but has one mid-conversation (e.g., [user, system, assistant]), the old code would convert that system message to a user-role message (preserving its positional context), but the new code treats it as the "first" system message and keeps it as-is.
Trace through the regression scenario
With [user("hello"), system("be concise"), assistant("hi")]:
- Old code:
userβseen_non_system=True,systemβ converted to user role (position preserved),assistantβ kept. - New code:
userβ kept (not system,first_system_seenstays False),systemβ falls toelsebranch sincefirst_system_seenis False β setsfirst_system_seen=True, kept as-is.assistantβ kept.
Downstream formatters (google.py:33-35, anthropic.py:36-38) extract all remaining system messages into a preamble system_messages list, so the mid-conversation system message loses its positional context entirely β it gets hoisted to the preamble instead of staying inline as a user-role message.
This affects any caller that passes a ChatContext without a leading system/developer message to Google, Anthropic, or AWS LLM plugins (which all call convert_mid_conversation_instructions). In the standard Agent flow this is unlikely because update_instructions(add_if_missing=True) always inserts a system message at index 0, but direct LLM usage with custom ChatContext objects can trigger it.
Prompt for agents
The root issue is that `first_system_seen` conflates two distinct concerns: (1) identifying the preamble system message to preserve, and (2) detecting that we are past the preamble. The old code used seen_non_system which correctly handled mid-conversation system messages even when no system message appeared at the start. A possible fix is to combine both signals: keep a system message as-is only if it is both the first system message AND no non-system item has been seen yet. For example, track both `first_system_seen` and `seen_non_system`, and only skip conversion when `not first_system_seen and not seen_non_system`. Alternatively, explicitly check that a system/developer message is at position 0 or part of the leading system block. The relevant function is `convert_mid_conversation_instructions` in `livekit-agents/livekit/agents/llm/_provider_format/utils.py`.
Was this helpful? React with π or π to provide feedback.
Summary
convert_mid_conversation_instructionsonly rewrote system messages to a user-role<instructions>turn after the conversation had already seen a user/assistant turn (theseen_non_systemflag).On the very first
generate_reply(instructions=...)of a session, the chat context contains:system: agent's base promptsystem: per-turn instructions just appended bygenerate_replyβ¦and no user/assistant turn yet. The condition fails, both stay
system, and the providers that require the conversation to end on auser/toolturn (Gemini, Anthropic, AWS) fall back toinject_dummy_user_message, which appends a literal{"role": "user", "parts": [{"text": "."}]}. The model β Gemini especially β sees the lone.and responds with things like "You haven't said anything yet, did you mean to ask something?"Fix
Drop the
seen_non_systemtracking and apply a single rule: the first system/developer message is the preamble, every subsequent system/developer message is per-turn instructions and gets converted using the existing role + template.This keeps:
".", which is the intentional behaviour when there is genuinely nothing for the model to respond to);β¦and adds the new case: base + per-turn before any user turn now also gets converted, so providers never have to inject the dummy.
Smoke test
Benefits Anthropic, AWS, and Google providers (all three call
convert_mid_conversation_instructions). OpenAI handles mid-conversation system messages natively and isn't affected.Test plan
session.generate_reply(instructions=...)as its first action no longer responds with "you didn't say anything" under Gemini.