From 2d05a63b5088e1995731603ccdaaa9b77253178f Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 04:18:37 +0000 Subject: [PATCH 01/11] Initial plan From 578668b577af669efd51974a61820a18af95af4c Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 04:26:57 +0000 Subject: [PATCH 02/11] Update schema spec to align with final multimodal proposal from issue #416 Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 570 +++++++++++++++++++++++++++- 1 file changed, 563 insertions(+), 7 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 2ed2e701..ce6a35b7 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1,6 +1,6 @@ # Activity Protocol -- Activity -Version: Provisional 3.3 +Version: Provisional 3.4 ## Abstract @@ -638,6 +638,226 @@ Possible values for `contentType` are audio, video, text, screen, all or any oth } ``` +### Reserved Events for Media Streaming + +Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. + +`A5210`: Media streaming events MUST use the `Media.*` prefix for their `name` field. + +`A5211`: Media streaming events SHOULD include a [`streamInfo`](#streaminfo) entity to convey stream metadata. + +`A5212`: Media streaming events MAY use the `value` and `valueType` fields to carry modality-specific content. + +#### Media.Start + +The `Media.Start` event initiates a media streaming session. It establishes the stream context and media type that will be transmitted. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.Start"` | +| `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.mediastart+json"` | +| `value` | object | No | Contains media type and content type information | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"streaming"` | + +Example: +```json +{ + "type": "event", + "name": "Media.Start", + "valueType": "application/vnd.microsoft.activity.mediastart+json", + "value": { + "mediaType": "audio", + "contentType": "audio/webm" + }, + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "streaming", + "streamSequence": 1 + } + ] +} +``` + +`A5220`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Start` events with a valid `streamId`. + +`A5221`: The `streamSequence` in `Media.Start` SHOULD be `1` as it initiates the stream. + +#### Media.Chunk + +The `Media.Chunk` event sends a chunk of media data during an active streaming session. Chunks are sequenced using the `streamSequence` field in the [`streamInfo`](#streaminfo) entity. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.Chunk"` | +| `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.audiochunk+json"` | +| `value` | object | Yes | Contains the media chunk data | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity | + +The `value` object for audio chunks typically includes: + +| Property | Type | Required | Description | +|----------------|---------|----------|------------------------------------------------| +| `contentType` | string | Yes | MIME type of the media, e.g., `"audio/webm"` | +| `contentUrl` | string | Yes | Data URI containing Base64-encoded media data | +| `durationMs` | integer | No | Duration of the chunk in milliseconds | +| `timestamp` | string | No | ISO 8601 timestamp of the chunk | +| `transcription`| string | No | Optional real-time transcription of audio | + +Example: +```json +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,...", + "durationMs": 2500, + "timestamp": "2025-10-07T10:30:05Z", + "transcription": "Your destination?" + }, + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "streaming", + "streamSequence": 2 + } + ] +} +``` + +`A5230`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Chunk` events with the same `streamId` as the corresponding `Media.Start`. + +`A5231`: The `streamSequence` MUST be incrementing for each chunk within the same stream. + +`A5232`: Receivers SHOULD use `streamSequence` to order chunks and detect missing chunks. + +#### Media.End + +The `Media.End` event signals the end of a media streaming session. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"event"` | +| `name` | string | Yes | Must be `"Media.End"` | +| `valueType` | string | No | Identifies the schema, e.g., `"application/vnd.microsoft.activity.mediaend+json"` | +| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"final"` | + +Example: +```json +{ + "type": "event", + "name": "Media.End", + "valueType": "application/vnd.microsoft.activity.mediaend+json", + "entities": [ + { + "type": "streamInfo", + "streamId": "abc123", + "streamType": "final", + "streamSequence": 3 + } + ] +} +``` + +`A5240`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.End` events with `streamType` set to `"final"`. + +`A5241`: Receivers SHOULD clean up stream resources upon receiving `Media.End`. + +#### Request State Events (Optional) + +In addition to media streaming events, implementations MAY emit **event‑based runtime or request state updates** to convey observational state during an interaction (for example: speech detected, processing, response available, or barge-in detected). + +These events: +- Represent **observational runtime state** +- Do **not** change session configuration +- Do **not** instruct the receiver to take action +- Are **not** modeled as commands + +The Activity Protocol does **not** standardize request or audio state machines, nor does it mandate specific event names. Event names shown below are **illustrative only** and are not introduced as protocol primitives. + +##### Example (Illustrative Only) +```json +{ + "type": "event", + "name": "request.update", + "valueType": "application/vnd.microsoft.activity.audio.state+json", + "value": { + "state": "processing", + "message": "Request is being processed" + } +} +``` + +##### Example: Barge‑In Detected (Illustrative Only) +```json +{ + "type": "event", + "name": "request.bargeIn", + "valueType": "application/vnd.microsoft.activity.audio.state+json", + "value": { + "signal": "bargeIn", + "origin": "user" + } +} +``` + +#### Voice Message + +A voice message is sent as a `message` activity using the `valueType` and `value` fields to carry the audio content. This is the final user-visible output for a voice modality interaction. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"message"` | +| `valueType` | string | Yes | Must be `"application/vnd.microsoft.activity.voice+json"` | +| `value` | object | Yes | Contains the voice message content | + +The `value` object for voice messages includes: + +| Property | Type | Required | Description | +|----------------|---------|----------|------------------------------------------------| +| `contentType` | string | Yes | MIME type of the audio, e.g., `"audio/webm"` | +| `contentUrl` | string | Yes | Data URI or URL containing the audio data | +| `transcription`| string | No | Text transcription of the audio | +| `durationMs` | integer | No | Duration in milliseconds | +| `timestamp` | string | No | ISO 8601 timestamp | +| `locale` | string | No | Language/locale of the audio, e.g., `"en-US"` | + +Example: +```json +{ + "type": "message", + "valueType": "application/vnd.microsoft.activity.voice+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,...", + "transcription": "Book a flight to Paris", + "durationMs": 3400, + "timestamp": "2025-10-07T10:30:00Z", + "locale": "en-US" + } +} +``` + +`A5250`: Voice message activities MUST use a `type` of `"message"` and include a `valueType` of `"application/vnd.microsoft.activity.voice+json"`. + +`A5251`: The `value` object MUST include `contentType` and `contentUrl` fields. + +`A5252`: Senders SHOULD include a `transcription` field to support accessibility and text-based processing. + +#### Error Handling + +`A5260`: If a `Media.Chunk` event is received without a corresponding `Media.Start`, receivers MAY ignore it or MAY process it if the `streamId` is known from a prior session. + +`A5261`: If a stream error occurs, senders SHOULD send a `Media.End` event with `streamResult` set to `"error"` in the `streaminfo` entity. + +`A5262`: Receivers SHOULD be resilient to missing chunks and SHOULD use `streamSequence` to detect gaps. + ## Invoke activity @@ -1594,6 +1814,14 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes +# 2025-02-05 - guhiriya@microsoft.com +* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) +* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) +* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions +* Added normative requirements A5210-A5252 for media streaming events +* Added normative requirements A9260-A9262 for media streaming in streaminfo +* Added normative requirements A9400-A9442 for session lifecycle commands + # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. @@ -1764,16 +1992,20 @@ Note that on channels with a persistent chat feed, `platform` is typically usefu ### streaminfo -The `streaminfo` entity conveys metadata supporting chunked streaming of text messages, typically sent as a sequence of `typing` Activities, followed by a final `message` Activity containing the complete text. +The `streaminfo` entity conveys metadata supporting chunked streaming of messages. It is used for: +- **Text streaming**: Sent as a sequence of `typing` Activities, followed by a final `message` Activity containing the complete text. +- **Media streaming**: Used with [Media.* events](#reserved-events-for-media-streaming) (`Media.Start`, `Media.Chunk`, `Media.End`) for real-time voice/audio streaming. | Property | Type | Required | Description | |------------------|---------|----------|---------------------------------------------------------------------------------| | `type` | string | Yes | Must be `"streaminfo"` | | `streamId` | string | Yes | Unique identifier for the streaming session | | `streamSequence` | integer | Yes | Incrementing sequence number for each chunk for non-final messages | -| `streamType` | string | No | One of `"informative"`, `"streaming"`, or `"final"`. Defaults to `"streaming"`` | +| `streamType` | string | No | One of `"informative"`, `"streaming"`, or `"final"`. Defaults to `"streaming"` | | `streamResult` | string | No | Present only on final message; one of `"success"`, `"timeout"`, or `"error"` | +#### Text Streaming + `A9240`: Streaming text is sent via a sequence of `typing` Activities containing `streaminfo` entities. `A9241`: The final message is sent as a `message` Activity with `streamType` set to `"final"`. @@ -1790,11 +2022,24 @@ The `streaminfo` entity conveys metadata supporting chunked streaming of text me `A9247`: Channels that do not support streaming SHOULD buffer all chunks and deliver a single `message` when complete. +#### Media Streaming + +When used with [Media.* events](#reserved-events-for-media-streaming), the `streaminfo` entity serves as the single place for stream identification and sequencing, independent of the activity type. The existing `streamType` values (`"streaming"`, `"final"`) are used to indicate stream lifecycle, while the `valueType` field on the event activity identifies the media type. + +`A9260`: For media streaming, the `streamType` field uses existing values: `"streaming"` for active chunks, `"final"` for stream end. + +`A9261`: The `streamId` MUST be consistent across all activities in a streaming session (`Media.Start`, `Media.Chunk`, `Media.End`). + +`A9262`: Receivers SHOULD use `streamSequence` to detect out-of-order or missing chunks in media streams. + --- -Example: +#### Example: Text Streaming + +Text streaming uses `typing` activities for incremental chunks, followed by a final `message` activity: + +**Informative message** - Show processing status: ```json -// Sending an informative message chunk { "type": "typing", "text": "Getting the answer...", @@ -1808,8 +2053,10 @@ Example: } ] } +``` -// Sending a streaming text chunk +**Streaming text chunk** - Incremental content: +```json { "type": "typing", "text": "A quick brown fox jumped over the", @@ -1822,8 +2069,10 @@ Example: } ] } +``` -// Sending the final complete message +**Final complete message** - Full response: +```json { "type": "message", "text": "A quick brown fox jumped over the lazy dog.", @@ -1838,6 +2087,112 @@ Example: } ``` +#### Example: Voice/Media Streaming + +Voice streaming uses `event` activities with [Media.* events](#reserved-events-for-media-streaming). The `valueType` identifies the media type, while `streaminfo` handles sequencing: + +**Media.Start** - Initiate audio streaming session: +```json +{ + "type": "event", + "name": "Media.Start", + "valueType": "application/vnd.microsoft.activity.mediastart+json", + "value": { + "mediaType": "audio", + "contentType": "audio/webm" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 1 + } + ] +} +``` + +**Media.Chunk** - Send audio chunk with optional transcription: +```json +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,GkXfo59ChoEBQveBAU...", + "durationMs": 2500, + "timestamp": "2025-10-07T10:30:05Z", + "transcription": "Book a flight to" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 2 + } + ] +} +``` + +**Media.Chunk** - Continue streaming (additional chunks): +```json +{ + "type": "event", + "name": "Media.Chunk", + "valueType": "application/vnd.microsoft.activity.audiochunk+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,R0lGODlhAQABAIAA...", + "durationMs": 1800, + "timestamp": "2025-10-07T10:30:07Z", + "transcription": "Paris please" + }, + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "streaming", + "streamSequence": 3 + } + ] +} +``` + +**Media.End** - Signal end of audio stream: +```json +{ + "type": "event", + "name": "Media.End", + "valueType": "application/vnd.microsoft.activity.mediaend+json", + "entities": [ + { + "type": "streaminfo", + "streamId": "v-00001", + "streamType": "final", + "streamSequence": 4 + } + ] +} +``` + +**Voice Message** - Final complete voice response (Server to Client): +```json +{ + "type": "message", + "valueType": "application/vnd.microsoft.activity.voice+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,UklGRiQAAABXQVZF...", + "transcription": "I found flights to Paris. The next available is tomorrow at 8:05am.", + "durationMs": 4200, + "timestamp": "2025-10-07T10:30:12Z", + "locale": "en-US" + } +} +``` + # Appendix III - Protocols using the Invoke activity The [invoke activity](#invoke-activity) is designed for use only within protocols supported by Activity Protocol channels (i.e., it is not a generic extensibility mechanism). This appendix contains a list of all protocols using this activity. @@ -1923,6 +2278,207 @@ The authenticity of a call from an Agent can be established by inspecting its JS The Microsoft Telephony channel defines channel command activities in the namespace `channel/vnd.microsoft.telephony.`. +## Session Lifecycle Commands + +Session lifecycle commands are used to manage multimodal streaming sessions, particularly for voice interactions. These commands follow request/response semantics with acknowledgments via `commandResult` activities. + +> **Note:** The `session.*` command names are reserved Activity Protocol commands for multimodal session management. Unlike application-defined commands (which must use the `application/*` namespace per A6301), these are protocol-level commands similar to other reserved event names. + +### session.init + +The `session.init` command initializes a new streaming session. It establishes the session context and is acknowledged with a `commandResult` containing the session state. + +**Request:** +```json +{ + "type": "command", + "id": "cmd1", + "name": "session.init", + "value": { + "sessionId": "sess_123" + } +} +``` + +**Response (commandResult):** +```json +{ + "type": "commandResult", + "replyToId": "cmd1", + "value": { + "status": "success", + "sessionId": "sess_123", + "state": "listening" + } +} +``` + +`A9400`: The `session.init` command MUST include a `sessionId` in the `value` object. + +`A9401`: Receivers MUST respond with a `commandResult` activity indicating success or failure. + +`A9402`: A successful `session.init` response MAY include an initial `state` (e.g., `"listening"`), eliminating the need for a separate `session.update`. + +### session.update + +The `session.update` command updates the state of an active session. It is used to signal state transitions during multimodal interactions. + +**Request:** +```json +{ + "type": "command", + "id": "cmd2", + "name": "session.update", + "value": { + "state": "speaking" + } +} +``` + +**Response (commandResult):** +```json +{ + "type": "commandResult", + "replyToId": "cmd2", + "value": { + "status": "acknowledged" + } +} +``` + +Defined session states: + +| State | Description | +|-------------|------------------------------------------------------------| +| `listening` | Bot is awaiting user input (input.expected) | +| `thinking` | Bot is processing the input | +| `speaking` | Bot is generating or delivering output (output.generating) | +| `idle` | Bot is not currently in an active state | +| `error` | An error has occurred during the interaction | + +`A9410`: The `session.update` command SHOULD include a `state` field in the `value` object. + +`A9411`: Receivers SHOULD respond with a `commandResult` activity acknowledging the state change. + +`A9412`: Session state updates are optional and threshold-based; clients may safely ignore them. + +### session.update (Barge-In) + +The `session.update` command can also signal a barge-in event, where the user or system interrupts the current output. + +```json +{ + "type": "command", + "name": "session.update", + "value": { + "signal": "bargeIn", + "origin": "user" + } +} +``` + +`A9420`: A barge-in signal SHOULD include `origin` indicating whether it was triggered by `"user"` or `"system"`. + +`A9421`: Upon receiving a barge-in, the server SHOULD return to the `"listening"` state. + +### session.end + +The `session.end` command terminates an active session. + +```json +{ + "type": "command", + "name": "session.end", + "value": { + "reason": "completed" + } +} +``` + +Defined end reasons: + +| Reason | Description | +|-------------|------------------------------------------| +| `completed` | Session ended normally | +| `cancelled` | Session was cancelled | +| `error` | Session ended due to an error | +| `timeout` | Session ended due to inactivity timeout | + +`A9430`: The `session.end` command SHOULD include a `reason` field in the `value` object. + +`A9431`: Receivers SHOULD clean up session resources upon receiving `session.end`. + +### Multimodal Interaction Flow + +The typical flow for a voice streaming session: + +```text +Client → Server: + session.init → commandResult (listening) → Media.Start → Media.Chunk x N → Media.End → bargeIn (optional) + +Server → Client: + Optional session.update (thinking) → Optional session.update (speaking) → Final voice message + +Barge-In: + Client sends bargeIn → Server returns to listening +``` + +#### Round-Trip Flow Example: Client and Server Interaction + +The following example illustrates a complete voice streaming interaction: + +**Step 1: Session Handshake** +```text +client → command: session.init +server → commandResult: { "status": "success", "sessionId": "SESS-123", "state": "listening" } +``` +> Because readiness (`listening`) is embedded in the response above, a separate `session.update(state="listening")` call is NOT required. + +**Step 2: Readiness Signal (Optional)** + +This step is required only if the channel or runtime explicitly requires a readiness signal: +```text +server → command: session.update { "state": "listening", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } +``` + +**Step 3: Stream Media (Fire-and-Forget Events)** +```text +client → event: Media.Start { streamId: "STR-1", contentType: "audio/webm" } +client → event: Media.Chunk { streamId: "STR-1", seq: 1, ... } +client → event: Media.Chunk { streamId: "STR-1", seq: 2, ... } + ... (more Media.Chunk events) +client → event: Media.End { streamId: "STR-1" } +``` + +**Step 4: Processing State Updates (Optional)** + +These updates are optional and rate-limited. Clients may safely ignore them. They fire only when thresholds are crossed (e.g., >200ms of "thinking"): +```text +server → command: session.update { "state": "thinking", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } + +server → command: session.update { "state": "speaking", "sessionId": "SESS-123" } +client → commandResult: { "status": "acknowledged" } +``` + +**Step 5: Final Voice Response** +```text +server → message: valueType: "application/vnd.microsoft.activity.voice+json" + value: { "contentType": "audio/webm", "contentUrl": "...", "transcription": "..." } +``` + +> **Notes:** +> - `listening` is NOT needed as a separate step if included in the `session.init` commandResult. +> - `thinking` and `speaking` session.update messages are optional and threshold-based. +> - Media streaming events are fire-and-forget (no acknowledgment required). + +`A9440`: Session lifecycle commands follow request/response semantics; receivers SHOULD send acknowledgments via `commandResult`. + +`A9441`: Session lifecycle commands are required only for real-time streaming modalities (voice, video). + +`A9442`: The `listening` state MAY be embedded in the `session.init` response, making a separate `session.update(listening)` optional. + ## Patterns for rejecting commands ### General pattern for rejecting commands From 15a1fb6b36b6baa209cd84a09be2a4c254dc5c75 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 05:15:02 +0000 Subject: [PATCH 03/11] Move Voice message from Event activity to Message activity section Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 89 +++++++++++++++-------------- 1 file changed, 46 insertions(+), 43 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index ce6a35b7..d55728bf 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -488,6 +488,51 @@ Semantic actions are sometimes used to indicate a change in which participant co `A3136`: Agents MAY use semantic action and [handoff activity](#handoff-activity) internally to coordinate conversational focus between components of the Agent. +### Voice message + +Voice messages carry the final audio output of a voice modality interaction. They use the standard `message` activity type with the `valueType` and `value` fields to carry encoded audio content. + +Voice messages are identified by a `type` value of `message` and a `valueType` of `application/vnd.microsoft.activity.voice+json`. + +| Field | Type | Required | Description | +|-------------|--------|----------|--------------------------------------------------| +| `type` | string | Yes | Must be `"message"` | +| `valueType` | string | Yes | Must be `"application/vnd.microsoft.activity.voice+json"` | +| `value` | object | Yes | Contains the voice message content | + +The `value` object for voice messages includes: + +| Property | Type | Required | Description | +|----------------|---------|----------|------------------------------------------------| +| `contentType` | string | Yes | MIME type of the audio, e.g., `"audio/webm"` | +| `contentUrl` | string | Yes | Data URI or URL containing the audio data | +| `transcription`| string | No | Text transcription of the audio | +| `durationMs` | integer | No | Duration in milliseconds | +| `timestamp` | string | No | ISO 8601 timestamp | +| `locale` | string | No | Language/locale of the audio, e.g., `"en-US"` | + +Example: +```json +{ + "type": "message", + "valueType": "application/vnd.microsoft.activity.voice+json", + "value": { + "contentType": "audio/webm", + "contentUrl": "data:audio/webm;base64,...", + "transcription": "Book a flight to Paris", + "durationMs": 3400, + "timestamp": "2025-10-07T10:30:00Z", + "locale": "en-US" + } +} +``` + +`A5250`: Voice message activities MUST use a `type` of `"message"` and include a `valueType` of `"application/vnd.microsoft.activity.voice+json"`. + +`A5251`: The `value` object MUST include `contentType` and `contentUrl` fields. + +`A5252`: Senders SHOULD include a `transcription` field to support accessibility and text-based processing. + ## Contact relation update activity Contact relation update activities signal a change in the relationship between the recipient and a user within the channel. Contact relation update activities generally do not contain user-generated content. The relationship update described by a contact relation update activity exists between the user in the `from` field (often, but not always, the user initiating the update) and the user or Agent in the `recipient` field. @@ -807,49 +852,6 @@ The Activity Protocol does **not** standardize request or audio state machines, } ``` -#### Voice Message - -A voice message is sent as a `message` activity using the `valueType` and `value` fields to carry the audio content. This is the final user-visible output for a voice modality interaction. - -| Field | Type | Required | Description | -|-------------|--------|----------|--------------------------------------------------| -| `type` | string | Yes | Must be `"message"` | -| `valueType` | string | Yes | Must be `"application/vnd.microsoft.activity.voice+json"` | -| `value` | object | Yes | Contains the voice message content | - -The `value` object for voice messages includes: - -| Property | Type | Required | Description | -|----------------|---------|----------|------------------------------------------------| -| `contentType` | string | Yes | MIME type of the audio, e.g., `"audio/webm"` | -| `contentUrl` | string | Yes | Data URI or URL containing the audio data | -| `transcription`| string | No | Text transcription of the audio | -| `durationMs` | integer | No | Duration in milliseconds | -| `timestamp` | string | No | ISO 8601 timestamp | -| `locale` | string | No | Language/locale of the audio, e.g., `"en-US"` | - -Example: -```json -{ - "type": "message", - "valueType": "application/vnd.microsoft.activity.voice+json", - "value": { - "contentType": "audio/webm", - "contentUrl": "data:audio/webm;base64,...", - "transcription": "Book a flight to Paris", - "durationMs": 3400, - "timestamp": "2025-10-07T10:30:00Z", - "locale": "en-US" - } -} -``` - -`A5250`: Voice message activities MUST use a `type` of `"message"` and include a `valueType` of `"application/vnd.microsoft.activity.voice+json"`. - -`A5251`: The `value` object MUST include `contentType` and `contentUrl` fields. - -`A5252`: Senders SHOULD include a `transcription` field to support accessibility and text-based processing. - #### Error Handling `A5260`: If a `Media.Chunk` event is received without a corresponding `Media.Start`, receivers MAY ignore it or MAY process it if the `streamId` is known from a prior session. @@ -1821,6 +1823,7 @@ The `error` field contains the reason the original [command activity](#command-a * Added normative requirements A5210-A5252 for media streaming events * Added normative requirements A9260-A9262 for media streaming in streaminfo * Added normative requirements A9400-A9442 for session lifecycle commands +* Moved `Voice message` section from Event activity to Message activity (voice messages use `type: "message"`, not `type: "event"`) # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. From 0161f2c6483e80ffd53b475e29e65f47df381334 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 05:23:10 +0000 Subject: [PATCH 04/11] Cross-link multimodal sections; clarify voice message context and full interaction flow Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 14 +++++++++----- 1 file changed, 9 insertions(+), 5 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index d55728bf..fc6b75f9 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -334,7 +334,7 @@ Delivery mode `notification` has been deprecated and will be handled as `normal` ## Message activity -Message activities represent content intended to be shown within a conversational interface. Message activities may contain text, speech, interactive cards, and binary or unknown attachments; typically channels require at most one of these for the message activity to be well-formed. +Message activities represent content intended to be shown within a conversational interface. Message activities may contain text, speech, interactive cards, and binary or unknown attachments; typically channels require at most one of these for the message activity to be well-formed. For multimodal modalities such as voice and audio, the `valueType` and `value` fields carry modality-specific content (see [Voice message](#voice-message)). Message activities are identified by a `type` value of `message`. @@ -490,7 +490,9 @@ Semantic actions are sometimes used to indicate a change in which participant co ### Voice message -Voice messages carry the final audio output of a voice modality interaction. They use the standard `message` activity type with the `valueType` and `value` fields to carry encoded audio content. +Voice messages carry the final audio output of a voice modality interaction. They are sent by the Agent (server) to the client as the final response after processing a voice input stream. Voice input is delivered to the Agent as a sequence of [Media streaming events](#reserved-events-for-media-streaming) (`Media.Start`, `Media.Chunk`, `Media.End`). The session is managed via [session lifecycle commands](#session-lifecycle-commands). For the complete end-to-end interaction flow, see [Multimodal Interaction Flow](#multimodal-interaction-flow). + +Voice messages use the standard `message` activity type with the `valueType` and `value` fields to carry encoded audio content. Voice messages are identified by a `type` value of `message` and a `valueType` of `application/vnd.microsoft.activity.voice+json`. @@ -610,7 +612,7 @@ The `value` field contains parameters specific to this event, as defined by the ## Event activity -Event activities communicate programmatic information from a client or channel to an Agent. The meaning of an event activity is defined by the `name` field, which is meaningful within the scope of a channel. Event activities are designed to carry both interactive information (such as button clicks) and non-interactive information (such as a notification of a client automatically updating an embedded speech model). +Event activities communicate programmatic information from a client or channel to an Agent. The meaning of an event activity is defined by the `name` field, which is meaningful within the scope of a channel. Event activities are designed to carry both interactive information (such as button clicks) and non-interactive information (such as a notification of a client automatically updating an embedded speech model). For real-time multimodal streaming (voice/audio), event activities with names `Media.Start`, `Media.Chunk`, and `Media.End` are used to stream audio from the client to the Agent (see [Reserved Events for Media Streaming](#reserved-events-for-media-streaming)). Event activities are the asynchronous counterpart to [invoke activities](#invoke-activity). Unlike invoke, event is designed to be extended by client application extensions. @@ -685,7 +687,9 @@ Possible values for `contentType` are audio, video, text, screen, all or any oth ### Reserved Events for Media Streaming -Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. +Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. The client streams audio input to the Agent using a sequence of `Media.Start`, `Media.Chunk`, and `Media.End` events. After processing, the Agent sends the final voice response as a [Voice message](#voice-message) (a `message` activity). For session lifecycle management and the complete end-to-end flow, see [Session Lifecycle Commands](#session-lifecycle-commands) and [Multimodal Interaction Flow](#multimodal-interaction-flow). + +These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. `A5210`: Media streaming events MUST use the `Media.*` prefix for their `name` field. @@ -2283,7 +2287,7 @@ The Microsoft Telephony channel defines channel command activities in the namesp ## Session Lifecycle Commands -Session lifecycle commands are used to manage multimodal streaming sessions, particularly for voice interactions. These commands follow request/response semantics with acknowledgments via `commandResult` activities. +Session lifecycle commands are used to manage multimodal streaming sessions, particularly for voice interactions. These commands follow request/response semantics with acknowledgments via `commandResult` activities. They work together with [Media streaming events](#reserved-events-for-media-streaming) (audio input) and [Voice messages](#voice-message) (audio output) to enable a complete multimodal interaction. > **Note:** The `session.*` command names are reserved Activity Protocol commands for multimodal session management. Unlike application-defined commands (which must use the `application/*` namespace per A6301), these are protocol-level commands similar to other reserved event names. From 7b14dc13d62af8b07e35c23d206b652195db3470 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 05:33:55 +0000 Subject: [PATCH 05/11] Fix bidirectionality: Voice message and Media streaming events work in both directions Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 8 +++++--- 1 file changed, 5 insertions(+), 3 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index fc6b75f9..b9eda499 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -490,7 +490,7 @@ Semantic actions are sometimes used to indicate a change in which participant co ### Voice message -Voice messages carry the final audio output of a voice modality interaction. They are sent by the Agent (server) to the client as the final response after processing a voice input stream. Voice input is delivered to the Agent as a sequence of [Media streaming events](#reserved-events-for-media-streaming) (`Media.Start`, `Media.Chunk`, `Media.End`). The session is managed via [session lifecycle commands](#session-lifecycle-commands). For the complete end-to-end interaction flow, see [Multimodal Interaction Flow](#multimodal-interaction-flow). +Voice messages carry a complete voice payload within a single activity. They are **bidirectional**: the client can send a voice message to the Agent (for example, a user sending a short voice query that fits in one message), and the Agent can send a voice message to the client (for example, a complete spoken response). Use a voice message whenever the entire audio content fits in a single activity. When audio must be sent in real time or in multiple pieces, use [Media streaming events](#reserved-events-for-media-streaming) (`Media.Start`, `Media.Chunk`, `Media.End`) instead. The session is managed via [session lifecycle commands](#session-lifecycle-commands). For a complete end-to-end interaction flow, see [Multimodal Interaction Flow](#multimodal-interaction-flow). Voice messages use the standard `message` activity type with the `valueType` and `value` fields to carry encoded audio content. @@ -687,7 +687,7 @@ Possible values for `contentType` are audio, video, text, screen, all or any oth ### Reserved Events for Media Streaming -Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. The client streams audio input to the Agent using a sequence of `Media.Start`, `Media.Chunk`, and `Media.End` events. After processing, the Agent sends the final voice response as a [Voice message](#voice-message) (a `message` activity). For session lifecycle management and the complete end-to-end flow, see [Session Lifecycle Commands](#session-lifecycle-commands) and [Multimodal Interaction Flow](#multimodal-interaction-flow). +Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. These events are **bidirectional**: the client can stream audio input to the Agent (for example, a user speaking), and the Agent can stream audio output back to the client (for example, a spoken response delivered in chunks). When the entire audio fits in a single activity, use a [Voice message](#voice-message) instead. For session lifecycle management and the complete end-to-end flow, see [Session Lifecycle Commands](#session-lifecycle-commands) and [Multimodal Interaction Flow](#multimodal-interaction-flow). These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. @@ -2184,7 +2184,7 @@ Voice streaming uses `event` activities with [Media.* events](#reserved-events-f } ``` -**Voice Message** - Final complete voice response (Server to Client): +**Voice Message** - Complete voice response (bidirectional; Agent → Client shown here): ```json { "type": "message", @@ -2430,6 +2430,8 @@ Barge-In: Client sends bargeIn → Server returns to listening ``` +> **Note on direction:** Both `Media.*` streaming events and `Voice message` activities are bidirectional. The flow above shows a typical pattern (client streams audio input; Agent responds with a voice message), but either party may stream media events or send a voice message in a single activity in either direction. + #### Round-Trip Flow Example: Client and Server Interaction The following example illustrates a complete voice streaming interaction: From fdb46338b577740049cc1e32be01b57f53a39cf9 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 05:50:13 +0000 Subject: [PATCH 06/11] Fix Event activity intro: Media streaming events are bidirectional, not client-to-Agent only Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index b9eda499..9da407d3 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -612,7 +612,7 @@ The `value` field contains parameters specific to this event, as defined by the ## Event activity -Event activities communicate programmatic information from a client or channel to an Agent. The meaning of an event activity is defined by the `name` field, which is meaningful within the scope of a channel. Event activities are designed to carry both interactive information (such as button clicks) and non-interactive information (such as a notification of a client automatically updating an embedded speech model). For real-time multimodal streaming (voice/audio), event activities with names `Media.Start`, `Media.Chunk`, and `Media.End` are used to stream audio from the client to the Agent (see [Reserved Events for Media Streaming](#reserved-events-for-media-streaming)). +Event activities communicate programmatic information from a client or channel to an Agent. The meaning of an event activity is defined by the `name` field, which is meaningful within the scope of a channel. Event activities are designed to carry both interactive information (such as button clicks) and non-interactive information (such as a notification of a client automatically updating an embedded speech model). For real-time multimodal streaming (voice/audio), event activities with names `Media.Start`, `Media.Chunk`, and `Media.End` are used to stream audio in either direction — from the client to the Agent or from the Agent to the client (see [Reserved Events for Media Streaming](#reserved-events-for-media-streaming)). Event activities are the asynchronous counterpart to [invoke activities](#invoke-activity). Unlike invoke, event is designed to be extended by client application extensions. From 5ce64f3fb4732d9b7442aa1bc6911b869413f527 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 05:56:52 +0000 Subject: [PATCH 07/11] spec: enrich Appendix I with PR-#423-style summaries for multimodal changes Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 43 +++++++++++++++++++++++++---- 1 file changed, 38 insertions(+), 5 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 9da407d3..d16bcc1f 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1820,14 +1820,47 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes +# 2026-03-02 - guhiriya@microsoft.com + +Clarified bidirectionality and structure of multimodal voice/audio support, and improved cross-linking across all multimodal sections. Implements the review feedback from [issue #416](https://github.com/microsoft/Agents/issues/416). + +**Changes:** +* Corrected `## Event activity` intro — `Media.*` streaming events are bidirectional (not client→Agent only) +* Corrected `### Voice message` intro — voice messages are bidirectional (client→Agent short query **or** Agent→client spoken response) +* Corrected `### Reserved Events for Media Streaming` intro — streaming events flow in either direction +* Added cross-references so readers navigating any one multimodal section can find the complete picture: Message activity ↔ Voice message ↔ Media streaming events ↔ Session Lifecycle Commands ↔ Multimodal Interaction Flow +* Added `Note on direction` callout to the Multimodal Interaction Flow diagram + +**Key clarification:** +* Use **Voice message** (`type: "message"`) when the entire audio fits in a single activity (either direction) +* Use **Media streaming events** (`Media.Start`, `Media.Chunk`, `Media.End`) when audio is sent in real time or in multiple pieces (either direction) + +Related: [#416](https://github.com/microsoft/Agents/issues/416) + +--- + # 2025-02-05 - guhiriya@microsoft.com + +This entry implements the approved proposal from [issue #416](https://github.com/microsoft/Agents/issues/416) to extend the Activity Protocol schema for multimodal interactions with streaming support for voice/audio. + +**Changes:** * Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) -* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) +* Added `Voice message` — a `message` activity carrying a complete voice payload via `valueType`/`value` +* Extended `streaminfo` entity to support media streaming (stream identification and sequencing for `Media.*` events) * Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions -* Added normative requirements A5210-A5252 for media streaming events -* Added normative requirements A9260-A9262 for media streaming in streaminfo -* Added normative requirements A9400-A9442 for session lifecycle commands -* Moved `Voice message` section from Event activity to Message activity (voice messages use `type: "message"`, not `type: "event"`) +* Added normative requirements A5210–A5252 for media streaming events and voice messages +* Added normative requirements A9260–A9262 for media streaming in `streaminfo` +* Added normative requirements A9400–A9442 for session lifecycle commands +* Bumped version to Provisional 3.4 + +**Key design decisions (per AP Core Committee):** +* No new activity types — uses existing `event`, `command`, `commandResult`, `message` +* No new schema fields — uses existing `value`, `valueType`, `entities` +* 100% backward compatible +* Uses `streamInfo` entity for stream metadata and sequencing +* Uses `Media.*` prefix for media streaming events + +Related: [#416](https://github.com/microsoft/Agents/issues/416) # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. From aef7d2f55f88a6e47ff3049a21e8cf7835e0d511 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 06:00:16 +0000 Subject: [PATCH 08/11] revert: restore Appendix I to original content (no spec file changes) Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 43 ++++------------------------- 1 file changed, 5 insertions(+), 38 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index d16bcc1f..9da407d3 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1820,47 +1820,14 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes -# 2026-03-02 - guhiriya@microsoft.com - -Clarified bidirectionality and structure of multimodal voice/audio support, and improved cross-linking across all multimodal sections. Implements the review feedback from [issue #416](https://github.com/microsoft/Agents/issues/416). - -**Changes:** -* Corrected `## Event activity` intro — `Media.*` streaming events are bidirectional (not client→Agent only) -* Corrected `### Voice message` intro — voice messages are bidirectional (client→Agent short query **or** Agent→client spoken response) -* Corrected `### Reserved Events for Media Streaming` intro — streaming events flow in either direction -* Added cross-references so readers navigating any one multimodal section can find the complete picture: Message activity ↔ Voice message ↔ Media streaming events ↔ Session Lifecycle Commands ↔ Multimodal Interaction Flow -* Added `Note on direction` callout to the Multimodal Interaction Flow diagram - -**Key clarification:** -* Use **Voice message** (`type: "message"`) when the entire audio fits in a single activity (either direction) -* Use **Media streaming events** (`Media.Start`, `Media.Chunk`, `Media.End`) when audio is sent in real time or in multiple pieces (either direction) - -Related: [#416](https://github.com/microsoft/Agents/issues/416) - ---- - # 2025-02-05 - guhiriya@microsoft.com - -This entry implements the approved proposal from [issue #416](https://github.com/microsoft/Agents/issues/416) to extend the Activity Protocol schema for multimodal interactions with streaming support for voice/audio. - -**Changes:** * Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) -* Added `Voice message` — a `message` activity carrying a complete voice payload via `valueType`/`value` -* Extended `streaminfo` entity to support media streaming (stream identification and sequencing for `Media.*` events) +* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) * Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions -* Added normative requirements A5210–A5252 for media streaming events and voice messages -* Added normative requirements A9260–A9262 for media streaming in `streaminfo` -* Added normative requirements A9400–A9442 for session lifecycle commands -* Bumped version to Provisional 3.4 - -**Key design decisions (per AP Core Committee):** -* No new activity types — uses existing `event`, `command`, `commandResult`, `message` -* No new schema fields — uses existing `value`, `valueType`, `entities` -* 100% backward compatible -* Uses `streamInfo` entity for stream metadata and sequencing -* Uses `Media.*` prefix for media streaming events - -Related: [#416](https://github.com/microsoft/Agents/issues/416) +* Added normative requirements A5210-A5252 for media streaming events +* Added normative requirements A9260-A9262 for media streaming in streaminfo +* Added normative requirements A9400-A9442 for session lifecycle commands +* Moved `Voice message` section from Event activity to Message activity (voice messages use `type: "message"`, not `type: "event"`) # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. From bd75bacf679916cb9fcf8e230c6b6c12581e4347 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 06:10:16 +0000 Subject: [PATCH 09/11] spec: rewrite Appendix I entry for PR #468 in structured format (intro + Changes + Key design decisions + Related) Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 35 ++++++++++++++++++++++------- 1 file changed, 27 insertions(+), 8 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 9da407d3..80af9445 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1820,14 +1820,33 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes -# 2025-02-05 - guhiriya@microsoft.com -* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) -* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) -* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions -* Added normative requirements A5210-A5252 for media streaming events -* Added normative requirements A9260-A9262 for media streaming in streaminfo -* Added normative requirements A9400-A9442 for session lifecycle commands -* Moved `Voice message` section from Event activity to Message activity (voice messages use `type: "message"`, not `type: "event"`) +# 2026-03-02 - guhiriya@microsoft.com + +This PR extends and finalizes the Activity Protocol schema for multimodal voice/audio interactions with streaming support, based on the approved proposal from [issue #416](https://github.com/microsoft/Agents/issues/416). + +**Changes:** +* Moved `Voice message` section from Event activity to Message activity — voice messages use `type: "message"`, not `type: "event"` +* Added `Voice message` — a `message` activity carrying a complete voice payload via `valueType: "application/vnd.microsoft.activity.voice+json"` and `value` +* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) for real-time, multi-piece audio streams in either direction +* Documented usage of existing `streamInfo` entity for media streaming stream identification and sequencing (no schema changes) +* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal session management +* Added Multimodal Interaction Flow section with annotated round-trip example +* Added cross-references between all multimodal sections (Message activity ↔ Voice message ↔ Media streaming events ↔ Session Lifecycle Commands ↔ Multimodal Interaction Flow) +* Added normative requirements A5210–A5252 for media streaming events and voice messages +* Added normative requirements A9260–A9262 for media streaming in `streamInfo` +* Added normative requirements A9400–A9442 for session lifecycle commands +* Bumped version to Provisional 3.4 + +**Key design decisions (per AP Core Committee):** +* No new activity types — uses existing `event`, `command`, `commandResult`, `message` +* No new schema fields — uses existing `value`, `valueType`, `entities` +* 100% backward compatible +* Uses `streamInfo` entity for stream metadata and sequencing +* Uses `Media.*` prefix for media streaming events +* `Media.*` streaming events and `Voice message` are bidirectional (client→Agent or Agent→client) +* `session.update(listening)` is optional when `listening` state is embedded in the `session.init` response + +Related: [#416](https://github.com/microsoft/Agents/issues/416) # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. From 8e84ddee6cf7156bf52070b9ca6351d311ab58e7 Mon Sep 17 00:00:00 2001 From: "copilot-swe-agent[bot]" <198982749+Copilot@users.noreply.github.com> Date: Mon, 2 Mar 2026 06:14:58 +0000 Subject: [PATCH 10/11] spec: restore Appendix I to original content and update PR description to match PR #423 format for PR #468 changes Co-authored-by: gurubhg <89311725+gurubhg@users.noreply.github.com> --- specs/activity/protocol-activity.md | 35 +++++++---------------------- 1 file changed, 8 insertions(+), 27 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 80af9445..9da407d3 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -1820,33 +1820,14 @@ The `error` field contains the reason the original [command activity](#command-a # Appendix I - Changes -# 2026-03-02 - guhiriya@microsoft.com - -This PR extends and finalizes the Activity Protocol schema for multimodal voice/audio interactions with streaming support, based on the approved proposal from [issue #416](https://github.com/microsoft/Agents/issues/416). - -**Changes:** -* Moved `Voice message` section from Event activity to Message activity — voice messages use `type: "message"`, not `type: "event"` -* Added `Voice message` — a `message` activity carrying a complete voice payload via `valueType: "application/vnd.microsoft.activity.voice+json"` and `value` -* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) for real-time, multi-piece audio streams in either direction -* Documented usage of existing `streamInfo` entity for media streaming stream identification and sequencing (no schema changes) -* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal session management -* Added Multimodal Interaction Flow section with annotated round-trip example -* Added cross-references between all multimodal sections (Message activity ↔ Voice message ↔ Media streaming events ↔ Session Lifecycle Commands ↔ Multimodal Interaction Flow) -* Added normative requirements A5210–A5252 for media streaming events and voice messages -* Added normative requirements A9260–A9262 for media streaming in `streamInfo` -* Added normative requirements A9400–A9442 for session lifecycle commands -* Bumped version to Provisional 3.4 - -**Key design decisions (per AP Core Committee):** -* No new activity types — uses existing `event`, `command`, `commandResult`, `message` -* No new schema fields — uses existing `value`, `valueType`, `entities` -* 100% backward compatible -* Uses `streamInfo` entity for stream metadata and sequencing -* Uses `Media.*` prefix for media streaming events -* `Media.*` streaming events and `Voice message` are bidirectional (client→Agent or Agent→client) -* `session.update(listening)` is optional when `listening` state is embedded in the `session.init` response - -Related: [#416](https://github.com/microsoft/Agents/issues/416) +# 2025-02-05 - guhiriya@microsoft.com +* Added Reserved Events for Media Streaming (`Media.Start`, `Media.Chunk`, `Media.End`) +* Documented usage of existing `streaminfo` entity for media streaming (no schema changes) +* Added Session Lifecycle Commands (`session.init`, `session.update`, `session.end`) for multimodal interactions +* Added normative requirements A5210-A5252 for media streaming events +* Added normative requirements A9260-A9262 for media streaming in streaminfo +* Added normative requirements A9400-A9442 for session lifecycle commands +* Moved `Voice message` section from Event activity to Message activity (voice messages use `type: "message"`, not `type: "event"`) # 2025-09-30 - mattb-msft * Updated Channel Account definition to reflect current rules and usages. From 8d2712e5a74867bc4b6fcd6b1cf406050a5f6215 Mon Sep 17 00:00:00 2001 From: Guruprasad B H <89311725+gurubhg@users.noreply.github.com> Date: Tue, 3 Mar 2026 08:41:31 +0530 Subject: [PATCH 11/11] Fix casing of 'streamInfo' to 'streaminfo' --- specs/activity/protocol-activity.md | 24 ++++++++++++------------ 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/specs/activity/protocol-activity.md b/specs/activity/protocol-activity.md index 9da407d3..80b6dcee 100644 --- a/specs/activity/protocol-activity.md +++ b/specs/activity/protocol-activity.md @@ -689,11 +689,11 @@ Possible values for `contentType` are audio, video, text, screen, all or any oth Media streaming events are used to facilitate real-time multimodal interactions, particularly for voice and audio streaming. These events are **bidirectional**: the client can stream audio input to the Agent (for example, a user speaking), and the Agent can stream audio output back to the client (for example, a spoken response delivered in chunks). When the entire audio fits in a single activity, use a [Voice message](#voice-message) instead. For session lifecycle management and the complete end-to-end flow, see [Session Lifecycle Commands](#session-lifecycle-commands) and [Multimodal Interaction Flow](#multimodal-interaction-flow). -These events use the `Media.*` prefix and work in conjunction with the [`streamInfo`](#streaminfo) entity for stream metadata and sequencing. +These events use the `Media.*` prefix and work in conjunction with the [`streaminfo`](#streaminfo) entity for stream metadata and sequencing. `A5210`: Media streaming events MUST use the `Media.*` prefix for their `name` field. -`A5211`: Media streaming events SHOULD include a [`streamInfo`](#streaminfo) entity to convey stream metadata. +`A5211`: Media streaming events SHOULD include a [`streaminfo`](#streaminfo) entity to convey stream metadata. `A5212`: Media streaming events MAY use the `value` and `valueType` fields to carry modality-specific content. @@ -707,7 +707,7 @@ The `Media.Start` event initiates a media streaming session. It establishes the | `name` | string | Yes | Must be `"Media.Start"` | | `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.mediastart+json"` | | `value` | object | No | Contains media type and content type information | -| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"streaming"` | +| `entities` | array | Yes | Must include a [`streaminfo`](#streaminfo) entity with `streamType` of `"streaming"` | Example: ```json @@ -721,7 +721,7 @@ Example: }, "entities": [ { - "type": "streamInfo", + "type": "streaminfo", "streamId": "abc123", "streamType": "streaming", "streamSequence": 1 @@ -730,13 +730,13 @@ Example: } ``` -`A5220`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Start` events with a valid `streamId`. +`A5220`: Senders MUST include a [`streaminfo`](#streaminfo) entity in `Media.Start` events with a valid `streamId`. `A5221`: The `streamSequence` in `Media.Start` SHOULD be `1` as it initiates the stream. #### Media.Chunk -The `Media.Chunk` event sends a chunk of media data during an active streaming session. Chunks are sequenced using the `streamSequence` field in the [`streamInfo`](#streaminfo) entity. +The `Media.Chunk` event sends a chunk of media data during an active streaming session. Chunks are sequenced using the `streamSequence` field in the [`streaminfo`](#streaminfo) entity. | Field | Type | Required | Description | |-------------|--------|----------|--------------------------------------------------| @@ -744,7 +744,7 @@ The `Media.Chunk` event sends a chunk of media data during an active streaming s | `name` | string | Yes | Must be `"Media.Chunk"` | | `valueType` | string | No | Identifies the schema of the `value` object, e.g., `"application/vnd.microsoft.activity.audiochunk+json"` | | `value` | object | Yes | Contains the media chunk data | -| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity | +| `entities` | array | Yes | Must include a [`streaminfo`](#streaminfo) entity | The `value` object for audio chunks typically includes: @@ -771,7 +771,7 @@ Example: }, "entities": [ { - "type": "streamInfo", + "type": "streaminfo", "streamId": "abc123", "streamType": "streaming", "streamSequence": 2 @@ -780,7 +780,7 @@ Example: } ``` -`A5230`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.Chunk` events with the same `streamId` as the corresponding `Media.Start`. +`A5230`: Senders MUST include a [`streaminfo`](#streaminfo) entity in `Media.Chunk` events with the same `streamId` as the corresponding `Media.Start`. `A5231`: The `streamSequence` MUST be incrementing for each chunk within the same stream. @@ -795,7 +795,7 @@ The `Media.End` event signals the end of a media streaming session. | `type` | string | Yes | Must be `"event"` | | `name` | string | Yes | Must be `"Media.End"` | | `valueType` | string | No | Identifies the schema, e.g., `"application/vnd.microsoft.activity.mediaend+json"` | -| `entities` | array | Yes | Must include a [`streamInfo`](#streaminfo) entity with `streamType` of `"final"` | +| `entities` | array | Yes | Must include a [`streaminfo`](#streaminfo) entity with `streamType` of `"final"` | Example: ```json @@ -805,7 +805,7 @@ Example: "valueType": "application/vnd.microsoft.activity.mediaend+json", "entities": [ { - "type": "streamInfo", + "type": "streaminfo", "streamId": "abc123", "streamType": "final", "streamSequence": 3 @@ -814,7 +814,7 @@ Example: } ``` -`A5240`: Senders MUST include a [`streamInfo`](#streaminfo) entity in `Media.End` events with `streamType` set to `"final"`. +`A5240`: Senders MUST include a [`streaminfo`](#streaminfo) entity in `Media.End` events with `streamType` set to `"final"`. `A5241`: Receivers SHOULD clean up stream resources upon receiving `Media.End`.