pipecat-ai · markbackman · Mar 3, 2026 · Feb 27, 2026
diff --git a/guides/fundamentals/metrics.mdx b/guides/fundamentals/metrics.mdx
@@ -26,14 +26,15 @@ task = PipelineTask(
 
 Once enabled, Pipecat logs the following metrics:
 
-| Metric           | Description                                                     |
-| ---------------- | --------------------------------------------------------------- |
-| TTFB             | Time To First Byte in seconds                                   |
-| Processing Time  | Time taken by the service to respond in seconds                 |
-| Text Aggregation | Time from first LLM token to first complete sentence in seconds |
+| Metric           | Description                                                                      |
+| ---------------- | -------------------------------------------------------------------------------- |
+| TTFB             | Time To First Byte in seconds                                                    |
+| Processing Time  | Time taken by the service to respond in seconds                                  |
+| Text Aggregation | Time from the first LLM token to the first complete sentence (TTS services only) |
 
 ```console Sample output
 AnthropicLLMService#0 TTFB: 0.8378312587738037
+CartesiaTTSService#0 text aggregation time: 0.2134
 CartesiaTTSService#0 processing time: 0.0005071163177490234
 CartesiaTTSService#0 TTFB: 0.17177796363830566
 AnthropicLLMService#0 processing time: 2.4927797317504883
@@ -114,7 +115,7 @@ When metrics are enabled, Pipecat emits a `MetricsFrame` for each interaction. T
 - `ProcessingMetricsData` — Processing time
 - `LLMUsageMetricsData` — LLM token usage
 - `TTSUsageMetricsData` — TTS character usage
-- `TextAggregationMetricsData` — Sentence aggregation latency
+- `TextAggregationMetricsData` — Sentence aggregation latency (TTS)
 - `TurnMetricsData` — Turn completion predictions
 
 You can access the metrics data by either adding a custom [FrameProcessor](/guides/fundamentals/custom-frame-processor) to your pipeline or adding an [observer](/server/utilities/observers/observer-pattern) to monitor `MetricsFrame`s.
@@ -155,6 +156,7 @@ from pipecat.frames.frames import MetricsFrame
 from pipecat.metrics.metrics import (
     LLMUsageMetricsData,
     ProcessingMetricsData,
+    TextAggregationMetricsData,
     TTFBMetricsData,
     TTSUsageMetricsData,
 )
@@ -175,6 +177,8 @@ class MetricsLogger(FrameProcessor):
                     print(
                         f"!!! MetricsFrame: {frame}, prompt_tokens: {tokens.prompt_tokens}, completion_tokens: {tokens.completion_tokens}"
                     )
+                elif isinstance(d, TextAggregationMetricsData):
+                    print(f"!!! MetricsFrame: {frame}, text aggregation: {d.value}")
                 elif isinstance(d, TTSUsageMetricsData):
                     print(f"!!! MetricsFrame: {frame}, characters: {d.value}")
         await self.push_frame(frame, direction)

diff --git a/guides/learn/text-to-speech.mdx b/guides/learn/text-to-speech.mdx
@@ -32,8 +32,8 @@ pipeline = Pipeline([
 **TTS generates speech through two primary mechanisms:**
 
 1. **Streamed LLM tokens** via `LLMTextFrame`s:
-   - TTS aggregates streaming tokens into complete sentences
-   - Sentences are sent to TTS service for audio generation
+   - By default, TTS aggregates streaming tokens into complete sentences before synthesis (`TextAggregationMode.SENTENCE`)
+   - Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency
    - Audio bytes stream back and play immediately
    - End-to-end latency often under 200ms
 

diff --git a/server/services/tts/asyncai.mdx b/server/services/tts/asyncai.mdx
@@ -93,8 +93,15 @@ Before using Async TTS services, you need:
   Audio container format.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Whether to aggregate sentences before sending to the TTS service.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">

diff --git a/server/services/tts/azure.mdx b/server/services/tts/azure.mdx
@@ -83,8 +83,15 @@ Before using Azure TTS services, you need:
   sample rate.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Whether to aggregate sentences before synthesis.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">
@@ -94,7 +101,7 @@ Before using Azure TTS services, you need:
 
 ### AzureHttpTTSService
 
-The HTTP service accepts the same parameters as the streaming service except `aggregate_sentences`:
+The HTTP service accepts the same parameters as the streaming service except `text_aggregation_mode` and `aggregate_sentences`:
 
 <ParamField path="api_key" type="str" required>
   Azure Cognitive Services subscription key.

diff --git a/server/services/tts/cartesia.mdx b/server/services/tts/cartesia.mdx
@@ -97,9 +97,15 @@ Before using Cartesia TTS services, you need:
   Audio container format.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending to Cartesia. Produces
-  more natural-sounding speech.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">
@@ -119,7 +125,7 @@ The HTTP service accepts similar parameters to the WebSocket service, with these
   API version for HTTP service.
 </ParamField>
 
-The HTTP service does not accept `aggregate_sentences`.
+The HTTP service does not accept `text_aggregation_mode` or `aggregate_sentences`.
 
 ### InputParams
 
@@ -291,7 +297,7 @@ tts = CartesiaTTSService(
 ## Notes
 
 - **WebSocket vs HTTP**: The WebSocket service supports word-level timestamps, audio context management, and interruption handling, making it better for interactive conversations. The HTTP service is simpler but lacks these features.
-- **Sentence aggregation**: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with `aggregate_sentences=False` if you need word-by-word streaming.
+- **Text aggregation**: Sentence aggregation is enabled by default (`text_aggregation_mode=TextAggregationMode.SENTENCE`). Buffering until sentence boundaries produces more natural speech. Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency. Cartesia handles token streaming well.
 - **Connection timeout**: Cartesia WebSocket connections time out after 5 minutes of inactivity (no keepalive mechanism is available). The service automatically reconnects when needed.
 - **CJK language support**: For Chinese, Japanese, and Korean, the service combines individual characters from timestamp messages into meaningful word units.
 

diff --git a/server/services/tts/elevenlabs.mdx b/server/services/tts/elevenlabs.mdx
@@ -85,10 +85,15 @@ export ELEVENLABS_API_KEY=your_api_key
   sample rate.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending to ElevenLabs. Produces
-  more natural-sounding speech at the cost of a small latency increase (~15ms)
-  for the first word of each sentence.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">
@@ -204,7 +209,7 @@ async with aiohttp.ClientSession() as session:
 
 - **Multilingual models required for `language`**: Setting `language` with a non-multilingual model (e.g. `eleven_turbo_v2_5`) has no effect. Use `eleven_multilingual_v2` or similar.
 - **WebSocket vs HTTP**: The WebSocket service supports word-level timestamps and interruption handling, making it significantly better for interactive conversations. The HTTP service is simpler but lacks these features.
-- **Sentence aggregation**: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with `aggregate_sentences=False` if you need word-by-word streaming.
+- **Text aggregation**: Sentence aggregation is enabled by default (`text_aggregation_mode=TextAggregationMode.SENTENCE`). Buffering until sentence boundaries produces more natural speech. Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency, but you must also set `auto_mode=False` in `InputParams` when using TOKEN mode.
 
 ## Event Handlers
 

diff --git a/server/services/tts/inworld.mdx b/server/services/tts/inworld.mdx
@@ -95,8 +95,15 @@ WebSocket-based service for lowest latency streaming.
   Audio encoding format.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Whether to aggregate sentences before synthesis.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="append_trailing_space" type="bool" default="True">
@@ -212,7 +219,7 @@ async with aiohttp.ClientSession() as session:
 
 - **WebSocket vs HTTP**: The WebSocket service (`InworldTTSService`) provides the lowest latency with bidirectional streaming and supports multiple independent audio contexts per connection (max 5). The HTTP service supports both streaming and non-streaming modes via the `streaming` parameter.
 - **Word timestamps**: Both services provide word-level timestamps for synchronized text display. Timestamps are tracked cumulatively across utterances within a turn.
-- **Auto mode**: When `auto_mode=True` (default), the server controls flushing of buffered text for optimal latency and quality. This is recommended when text is sent in full sentences or phrases (i.e., when `aggregate_sentences=True`).
+- **Auto mode**: When `auto_mode=True` (default), the server controls flushing of buffered text for optimal latency and quality. This is recommended when text is sent in full sentences or phrases (i.e., when using `text_aggregation_mode=TextAggregationMode.SENTENCE`).
 - **Keepalive**: The WebSocket service sends periodic keepalive messages every 60 seconds to maintain the connection.
 
 ## Event Handlers

diff --git a/server/services/tts/neuphonic.mdx b/server/services/tts/neuphonic.mdx
@@ -84,8 +84,15 @@ Before using Neuphonic TTS services, you need:
   Audio encoding format.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending to Neuphonic.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">

diff --git a/server/services/tts/rime.mdx b/server/services/tts/rime.mdx
@@ -81,8 +81,15 @@ Before using Rime TTS services, you need:
   sample rate.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending to Rime.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">
@@ -147,8 +154,14 @@ A non-JSON WebSocket service for models like Arcana that use plain text messages
   sample rate.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries. `TOKEN` streams tokens directly for
+  lower latency. Import from `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="params" type="InputParams" default="None">

diff --git a/server/services/tts/sarvam.mdx b/server/services/tts/sarvam.mdx
@@ -83,8 +83,15 @@ Sarvam offers two service implementations: `SarvamTTSService` (WebSocket) for re
   WebSocket URL for the TTS backend.
 </ParamField>
 
-<ParamField path="aggregate_sentences" type="bool" default="True">
-  Buffer text until sentence boundaries before sending.
+<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
+  Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
+  buffers text until sentence boundaries, producing more natural speech. `TOKEN`
+  streams tokens directly for lower latency. Import from
+  `pipecat.services.tts_service`.
+</ParamField>
+
+<ParamField path="aggregate_sentences" type="bool" default="None">
+  _Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
 </ParamField>
 
 <ParamField path="sample_rate" type="int" default="None">