Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 10 additions & 6 deletions guides/fundamentals/metrics.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -26,14 +26,15 @@ task = PipelineTask(

Once enabled, Pipecat logs the following metrics:

| Metric | Description |
| ---------------- | --------------------------------------------------------------- |
| TTFB | Time To First Byte in seconds |
| Processing Time | Time taken by the service to respond in seconds |
| Text Aggregation | Time from first LLM token to first complete sentence in seconds |
| Metric | Description |
| ---------------- | -------------------------------------------------------------------------------- |
| TTFB | Time To First Byte in seconds |
| Processing Time | Time taken by the service to respond in seconds |
| Text Aggregation | Time from the first LLM token to the first complete sentence (TTS services only) |

```console Sample output
AnthropicLLMService#0 TTFB: 0.8378312587738037
CartesiaTTSService#0 text aggregation time: 0.2134
CartesiaTTSService#0 processing time: 0.0005071163177490234
CartesiaTTSService#0 TTFB: 0.17177796363830566
AnthropicLLMService#0 processing time: 2.4927797317504883
Expand Down Expand Up @@ -114,7 +115,7 @@ When metrics are enabled, Pipecat emits a `MetricsFrame` for each interaction. T
- `ProcessingMetricsData` — Processing time
- `LLMUsageMetricsData` — LLM token usage
- `TTSUsageMetricsData` — TTS character usage
- `TextAggregationMetricsData` — Sentence aggregation latency
- `TextAggregationMetricsData` — Sentence aggregation latency (TTS)
- `TurnMetricsData` — Turn completion predictions

You can access the metrics data by either adding a custom [FrameProcessor](/guides/fundamentals/custom-frame-processor) to your pipeline or adding an [observer](/server/utilities/observers/observer-pattern) to monitor `MetricsFrame`s.
Expand Down Expand Up @@ -155,6 +156,7 @@ from pipecat.frames.frames import MetricsFrame
from pipecat.metrics.metrics import (
LLMUsageMetricsData,
ProcessingMetricsData,
TextAggregationMetricsData,
TTFBMetricsData,
TTSUsageMetricsData,
)
Expand All @@ -175,6 +177,8 @@ class MetricsLogger(FrameProcessor):
print(
f"!!! MetricsFrame: {frame}, prompt_tokens: {tokens.prompt_tokens}, completion_tokens: {tokens.completion_tokens}"
)
elif isinstance(d, TextAggregationMetricsData):
print(f"!!! MetricsFrame: {frame}, text aggregation: {d.value}")
elif isinstance(d, TTSUsageMetricsData):
print(f"!!! MetricsFrame: {frame}, characters: {d.value}")
await self.push_frame(frame, direction)
Expand Down
4 changes: 2 additions & 2 deletions guides/learn/text-to-speech.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -32,8 +32,8 @@ pipeline = Pipeline([
**TTS generates speech through two primary mechanisms:**

1. **Streamed LLM tokens** via `LLMTextFrame`s:
- TTS aggregates streaming tokens into complete sentences
- Sentences are sent to TTS service for audio generation
- By default, TTS aggregates streaming tokens into complete sentences before synthesis (`TextAggregationMode.SENTENCE`)
- Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency
- Audio bytes stream back and play immediately
- End-to-end latency often under 200ms

Expand Down
11 changes: 9 additions & 2 deletions server/services/tts/asyncai.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -93,8 +93,15 @@ Before using Async TTS services, you need:
Audio container format.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Whether to aggregate sentences before sending to the TTS service.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand Down
13 changes: 10 additions & 3 deletions server/services/tts/azure.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,15 @@ Before using Azure TTS services, you need:
sample rate.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Whether to aggregate sentences before synthesis.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand All @@ -94,7 +101,7 @@ Before using Azure TTS services, you need:

### AzureHttpTTSService

The HTTP service accepts the same parameters as the streaming service except `aggregate_sentences`:
The HTTP service accepts the same parameters as the streaming service except `text_aggregation_mode` and `aggregate_sentences`:

<ParamField path="api_key" type="str" required>
Azure Cognitive Services subscription key.
Expand Down
16 changes: 11 additions & 5 deletions server/services/tts/cartesia.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -97,9 +97,15 @@ Before using Cartesia TTS services, you need:
Audio container format.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending to Cartesia. Produces
more natural-sounding speech.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand All @@ -119,7 +125,7 @@ The HTTP service accepts similar parameters to the WebSocket service, with these
API version for HTTP service.
</ParamField>

The HTTP service does not accept `aggregate_sentences`.
The HTTP service does not accept `text_aggregation_mode` or `aggregate_sentences`.

### InputParams

Expand Down Expand Up @@ -291,7 +297,7 @@ tts = CartesiaTTSService(
## Notes

- **WebSocket vs HTTP**: The WebSocket service supports word-level timestamps, audio context management, and interruption handling, making it better for interactive conversations. The HTTP service is simpler but lacks these features.
- **Sentence aggregation**: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with `aggregate_sentences=False` if you need word-by-word streaming.
- **Text aggregation**: Sentence aggregation is enabled by default (`text_aggregation_mode=TextAggregationMode.SENTENCE`). Buffering until sentence boundaries produces more natural speech. Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency. Cartesia handles token streaming well.
- **Connection timeout**: Cartesia WebSocket connections time out after 5 minutes of inactivity (no keepalive mechanism is available). The service automatically reconnects when needed.
- **CJK language support**: For Chinese, Japanese, and Korean, the service combines individual characters from timestamp messages into meaningful word units.

Expand Down
15 changes: 10 additions & 5 deletions server/services/tts/elevenlabs.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -85,10 +85,15 @@ export ELEVENLABS_API_KEY=your_api_key
sample rate.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending to ElevenLabs. Produces
more natural-sounding speech at the cost of a small latency increase (~15ms)
for the first word of each sentence.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand Down Expand Up @@ -204,7 +209,7 @@ async with aiohttp.ClientSession() as session:

- **Multilingual models required for `language`**: Setting `language` with a non-multilingual model (e.g. `eleven_turbo_v2_5`) has no effect. Use `eleven_multilingual_v2` or similar.
- **WebSocket vs HTTP**: The WebSocket service supports word-level timestamps and interruption handling, making it significantly better for interactive conversations. The HTTP service is simpler but lacks these features.
- **Sentence aggregation**: Enabled by default. Buffering until sentence boundaries produces more natural speech with minimal latency impact. Disable with `aggregate_sentences=False` if you need word-by-word streaming.
- **Text aggregation**: Sentence aggregation is enabled by default (`text_aggregation_mode=TextAggregationMode.SENTENCE`). Buffering until sentence boundaries produces more natural speech. Set `text_aggregation_mode=TextAggregationMode.TOKEN` to stream tokens directly for lower latency, but you must also set `auto_mode=False` in `InputParams` when using TOKEN mode.

## Event Handlers

Expand Down
13 changes: 10 additions & 3 deletions server/services/tts/inworld.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -95,8 +95,15 @@ WebSocket-based service for lowest latency streaming.
Audio encoding format.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Whether to aggregate sentences before synthesis.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="append_trailing_space" type="bool" default="True">
Expand Down Expand Up @@ -212,7 +219,7 @@ async with aiohttp.ClientSession() as session:

- **WebSocket vs HTTP**: The WebSocket service (`InworldTTSService`) provides the lowest latency with bidirectional streaming and supports multiple independent audio contexts per connection (max 5). The HTTP service supports both streaming and non-streaming modes via the `streaming` parameter.
- **Word timestamps**: Both services provide word-level timestamps for synchronized text display. Timestamps are tracked cumulatively across utterances within a turn.
- **Auto mode**: When `auto_mode=True` (default), the server controls flushing of buffered text for optimal latency and quality. This is recommended when text is sent in full sentences or phrases (i.e., when `aggregate_sentences=True`).
- **Auto mode**: When `auto_mode=True` (default), the server controls flushing of buffered text for optimal latency and quality. This is recommended when text is sent in full sentences or phrases (i.e., when using `text_aggregation_mode=TextAggregationMode.SENTENCE`).
- **Keepalive**: The WebSocket service sends periodic keepalive messages every 60 seconds to maintain the connection.

## Event Handlers
Expand Down
11 changes: 9 additions & 2 deletions server/services/tts/neuphonic.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -84,8 +84,15 @@ Before using Neuphonic TTS services, you need:
Audio encoding format.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending to Neuphonic.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand Down
21 changes: 17 additions & 4 deletions server/services/tts/rime.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -81,8 +81,15 @@ Before using Rime TTS services, you need:
sample rate.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending to Rime.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand Down Expand Up @@ -147,8 +154,14 @@ A non-JSON WebSocket service for models like Arcana that use plain text messages
sample rate.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries. `TOKEN` streams tokens directly for
lower latency. Import from `pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="params" type="InputParams" default="None">
Expand Down
11 changes: 9 additions & 2 deletions server/services/tts/sarvam.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,15 @@ Sarvam offers two service implementations: `SarvamTTSService` (WebSocket) for re
WebSocket URL for the TTS backend.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="True">
Buffer text until sentence boundaries before sending.
<ParamField path="text_aggregation_mode" type="TextAggregationMode" default="TextAggregationMode.SENTENCE">
Controls how incoming text is aggregated before synthesis. `SENTENCE` (default)
buffers text until sentence boundaries, producing more natural speech. `TOKEN`
streams tokens directly for lower latency. Import from
`pipecat.services.tts_service`.
</ParamField>

<ParamField path="aggregate_sentences" type="bool" default="None">
_Deprecated in v0.0.104._ Use `text_aggregation_mode` instead.
</ParamField>

<ParamField path="sample_rate" type="int" default="None">
Expand Down