Skip to content

Comments

Python: fix reasoning model workflow handoff and history serialization#4083

Merged
TaoChenOSU merged 11 commits intomicrosoft:mainfrom
eavanvalkenburg:fix_4047
Feb 19, 2026
Merged

Python: fix reasoning model workflow handoff and history serialization#4083
TaoChenOSU merged 11 commits intomicrosoft:mainfrom
eavanvalkenburg:fix_4047

Conversation

@eavanvalkenburg
Copy link
Member

@eavanvalkenburg eavanvalkenburg commented Feb 19, 2026

Summary

Fixes multiple related failures when using reasoning models (gpt-5-mini, gpt-5.2) in multi-agent workflows. The root issues are all about how reasoning items from the Responses API are emitted, serialized, and carried into subsequent agent runs.

Closes #4047


Problems Fixed

1. "reasoning was provided without its required following item"

The Responses API only accepts a reasoning item in input when it directly precedes a function_call. Sending a reasoning item that preceded a text response (no tool call) causes an API error.

Fix: _prepare_message_for_openai now checks whether the message contains a function_call. text_reasoning content is only serialized as a reasoning input item when a function_call is also present in the same message.

2. Reasoning items never emitted for encrypted/hidden reasoning

When a reasoning model produces encrypted or hidden reasoning, the output_item.added event fires with an empty content list and no reasoning_text.delta events follow. Previously, no text_reasoning Content was emitted — making it invisible to downstream serialization logic.

Fix: Both _parse_response_from_openai (non-streaming) and the output_item.added handler (streaming) now always emit at least one text_reasoning Content, even when the text is empty. The reasoning_id and encrypted_content (if present) are stored in additional_properties.

3. summary field must be an array, not an object

The summary field on a reasoning input item must be an array of objects ([{"type": "summary_text", "text": ...}]), not a single object. This caused a 400 invalid_type error.

Fix: _prepare_content_for_openai now wraps summary in a list. summary is omitted entirely when there is no visible text (e.g. encrypted reasoning, where only encrypted_content is sent).


Files Changed

File Change
packages/core/agent_framework/openai/_responses_client.py Always emit text_reasoning on reasoning output items; fix summary to be an array; skip reasoning serialization when no function_call in same message
packages/core/agent_framework/_workflows/_agent_executor.py Clear service_session_id in run and from_response handlers; remove no-op _prepare_handoff_messages
packages/core/tests/workflow/test_full_conversation.py Add test_run_request_with_full_history_clears_service_session_id and test_from_response_clears_service_session_id (TDD: fail without fix, pass with fix)

Copilot AI review requested due to automatic review settings February 19, 2026 13:04
@github-actions github-actions bot changed the title fix(python): reasoning model workflow handoff and history serialization Python: fix(python): reasoning model workflow handoff and history serialization Feb 19, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes workflow + function-calling failures when using reasoning-capable models with the OpenAI/Azure Responses API by tightening how reasoning items are emitted/serialized and by preventing duplicate history replay across agent handoffs.

Changes:

  • Adjusts Responses API parsing/serialization to (a) only include reasoning input items when paired with a function_call, (b) always emit a text_reasoning marker (even empty) for hidden/encrypted reasoning, and (c) serialize summary as an array.
  • Updates workflow execution to clear service_session_id when explicitly replaying full history to avoid “Duplicate item found” errors.
  • Improves function-invocation behavior across multi-message responses and adds/expands tests (unit + integration) covering these scenarios.

Reviewed changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
python/packages/core/agent_framework/openai/_responses_client.py Updates reasoning item parsing and input serialization rules for Responses API.
python/packages/core/agent_framework/_workflows/_agent_executor.py Clears service_session_id when replaying explicit history into an executor.
python/packages/core/agent_framework/_tools.py Improves function-call extraction across multiple messages and adjusts stop-path handling.
python/packages/core/tests/workflow/test_full_conversation.py Adds workflow tests for handoff history and service_session_id clearing.
python/packages/core/tests/core/test_function_invocation_logic.py Adds tests for multi-message function calls and stop-path conversation_id behavior.
python/packages/core/tests/azure/test_azure_responses_client.py Adds an integration test that validates minimal workflow handoff across reasoning vs non-reasoning deployments.
python/samples/05-end-to-end/workflow_evaluation/run_evaluation.py Updates the default workflow deployment name to a reasoning model for the evaluation sample.
python/samples/02-agents/conversations/redis_chat_message_store_session.py Makes Redis URL configurable via REDIS_URL env var and updates sample messaging.

@markwallace-microsoft
Copy link
Member

markwallace-microsoft commented Feb 19, 2026

Python Test Coverage

Python Test Coverage Report •
FileStmtsMissCoverMissing
packages/core/agent_framework
   _tools.py8469189%165–166, 303, 305, 323–325, 332, 350, 364, 371, 378, 394, 396, 403, 440, 465, 469, 486–488, 535–537, 600, 622, 685–691, 727, 738–749, 771–773, 778, 782, 796–798, 837, 906, 916, 926, 982, 1013, 1032, 1294, 1351, 1371, 1442–1446, 1568, 1572, 1596, 1622, 1624, 1640, 1642, 1727, 1757, 1777, 1779, 1830, 1893, 2077–2078, 2115, 2128, 2138–2139, 2174–2175, 2235
   _types.py9988791%49, 58–59, 113, 118, 137, 139, 143, 147, 149, 151, 153, 171, 175, 201, 223, 228, 233, 237, 263, 267, 615–616, 987, 1049, 1066, 1084, 1089, 1107, 1117, 1134–1135, 1137, 1155–1156, 1158, 1165–1166, 1168, 1203, 1214–1215, 1217, 1255, 1482, 1534, 1625–1630, 1652, 1657, 1823, 1835, 2078, 2087, 2108, 2203, 2428, 2635, 2705, 2717, 2735, 2933–2935, 2938–2940, 2944, 2949, 2953, 3037–3039, 3068, 3122, 3141–3142, 3145–3149, 3155
packages/core/agent_framework/_workflows
   _agent_executor.py2002687%97, 113, 168–169, 221–222, 224–225, 255–257, 265–267, 275–277, 279, 283, 287, 291–292, 391–392, 438, 456
packages/core/agent_framework/openai
   _responses_client.py6398786%290–293, 297–298, 301–302, 308–309, 314, 327–333, 354, 362, 385, 548, 551, 606, 610, 612, 614, 616, 692, 702, 707, 750, 829, 846, 859, 920, 1011, 1016, 1020–1022, 1026–1027, 1050, 1119, 1141–1142, 1157–1158, 1176–1177, 1218–1221, 1330–1331, 1347, 1349, 1428–1436, 1555, 1610, 1625, 1668–1671, 1679–1680, 1682–1684, 1698–1700, 1710–1711, 1717, 1732
TOTAL21261331484% 

Python Unit Test Overview

Tests Skipped Failures Errors Time
4189 240 💤 0 ❌ 0 🔥 1m 14s ⏱️

@markwallace-microsoft markwallace-microsoft added the lab Agent Framework Lab label Feb 19, 2026
@eavanvalkenburg eavanvalkenburg changed the title Python: fix(python): reasoning model workflow handoff and history serialization Python: fix reasoning model workflow handoff and history serialization Feb 19, 2026
giles17 and others added 8 commits February 19, 2026 20:23
… handoff

When a reasoning model (e.g. gpt-5-mini) runs as Agent 1 in a workflow, its
response includes text_reasoning items (with server-scoped IDs like rs_XXXX)
and function_call items. Forwarding these to Agent 2 in a fresh conversation
caused API errors because the reasoning/call IDs are scoped to the original
stored response context.

Changes:
- Strip 'function_call', 'text_reasoning', 'function_approval_request', and
  'function_approval_response' from handoff messages in _agent_executor.py
- Keep 'function_result' so the actual tool output content is preserved for
  the next agent's context
- Update unit tests to reflect that function_result messages survive handoff
  (messages grow from 2→3: user, tool(result), assistant(summary))
- Fix incorrect test assertions in test_function_invocation_stop_clears_*
  that assumed the client layer updates session.service_session_id
- Also fixed _extract_function_calls to search all messages with call_id
  deduplication, and the error-limit stop path to submit function_call_output
  items before halting (via tool_choice=none cleanup call)

Relates to: microsoft#4047

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Fixes multiple related issues when using reasoning models (gpt-5-mini,
gpt-5.2) in multi-agent workflows that chain agents via from_response
or replay full conversation history via AgentExecutorRequest.

## Reasoning items always emitted on output_item.added

When a reasoning model produces encrypted or hidden reasoning (no
visible text), the Responses API still fires a reasoning output item
without any reasoning_text.delta events. Previously no text_reasoning
Content was emitted in that case, making it invisible to downstream
logic. Both the non-streaming (_parse_response_from_openai) and
streaming (output_item.added) paths now always emit at least one
text_reasoning Content — with empty text if no content is available —
so co-occurrence detection and serialization guards work reliably.

## Reasoning items only serialized when paired with a function_call

The Responses API only accepts reasoning items in input when they
directly preceded a function_call in the original response. Sending a
reasoning item that preceded a text response (no tool call) causes:
  "reasoning was provided without its required following item"
_prepare_message_for_openai now checks has_function_call per message
and skips text_reasoning serialization when there is no accompanying
function_call.

## summary field is an array, not an object

The reasoning item summary field sent to the Responses API must be an
array of objects ([{"type": "summary_text", "text": ...}]), not a
single object. Fixed _prepare_content_for_openai accordingly.

## service_session_id cleared when explicit history is provided

When a workflow coordinator replays a full conversation (including
function calls from a previous agent run) back to an executor via
AgentExecutorRequest or from_response, the executor's session still
held a service_session_id (previous_response_id) from the prior run.
The API then received the same function-call items twice — once from
previous_response_id (server-stored) and once from the explicit input —
causing: "Duplicate item found with id fc_...".

AgentExecutor.run (when should_respond=True) and from_response now
reset self._session.service_session_id = None before running so that
explicit input is the sole source of conversation context.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…cit history replay

Replace the implicit 'always clear service_session_id when should_respond=True'
with an explicit opt-in field on AgentExecutorRequest.

The old approach used should_respond=True as a proxy for 'full history replay',
but that conflates two distinct intents:
- Orchestrations group chat sends should_respond=True with an empty/single-message
  list (not a full replay) — unnecessarily clearing service_session_id.
- HITL / feedback coordinators send the full prior conversation and truly need
  a fresh service session ID to avoid duplicate-item API errors.

Changes:
- Add AgentExecutorRequest.reset_service_session: bool = False
- AgentExecutor.run only clears service_session_id when this flag is True
- AgentExecutor.from_response unchanged (always clears; always full conversation)
- Set reset_service_session=True in all full-history-replay call sites:
  agents_with_HITL.py, azure_chat_agents_tool_calls_with_feedback.py,
  autogen-migration round-robin coordinator, tau2 runner
- Update _FullHistoryReplayCoordinator test helper to pass the flag

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
eavanvalkenburg and others added 3 commits February 19, 2026 20:25
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@eavanvalkenburg eavanvalkenburg added this pull request to the merge queue Feb 19, 2026
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Feb 19, 2026
@TaoChenOSU TaoChenOSU added this pull request to the merge queue Feb 19, 2026
Merged via the queue into microsoft:main with commit 67ce1ba Feb 19, 2026
25 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lab Agent Framework Lab python

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Python: Workflow evaluation fails with reasoning models but succeeds with non-reasoning models

5 participants