feat(tts): add Azure Speech TTS provider#51776
feat(tts): add Azure Speech TTS provider#51776leonchui wants to merge 1 commit intoopenclaw:mainfrom
Conversation
- Add Azure TTS provider with SSML synthesis - Support for 400+ neural voices including Cantonese (zh-HK-HiuMaanNeural) - Config: apiKey, region, voice, lang, outputFormat - Environment variables: AZURE_SPEECH_API_KEY, AZURE_SPEECH_REGION - Provider ID: 'azure' with alias 'azure-tts' - Added azure to TTS_PROVIDERS and auto-selection - Added azure_voice directive support in parseTtsDirectives - Added tests for Azure TTS voice listing - Fixed file extension mapping for non-MP3 formats - Resolves issue openclaw#4021
Greptile SummaryThis PR adds Azure Speech as a new TTS provider, supporting 400+ neural voices via the Azure Cognitive Services REST API with SSML synthesis, region/baseUrl configuration, and a new Key issues found:
Confidence Score: 2/5
Prompt To Fix All With AIThis is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 79
Comment:
**SSML injection via unescaped `voice` and `lang`**
`buildAzureSSML` escapes the user-provided `text` body correctly, but both `voice` and `lang` are interpolated directly into the XML template without any escaping.
The `voice` parameter is populated from the `azure_voice` directive override (`overrides.azure.voice`), which accepts any non-empty string. An attacker who can influence a TTS directive (e.g. via message content reaching `parseTtsDirectives`) could inject arbitrary SSML attributes or elements:
- Input: `azure_voice=foo' xml:lang='evil`
- Resulting SSML: `<voice name='foo' xml:lang='evil'>...</voice>`
Similarly `lang` (which uses single-quote delimiters in the `xml:lang` attribute) would be broken by any value containing a single quote.
At minimum both values should be single-quote-escaped before insertion; ideally a proper XML attribute escaper should be applied:
```suggestion
return `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='${escapeXmlAttr(lang || "en-US")}'><voice name='${escapeXmlAttr(voice)}'>${escapedText}</voice></speak>`;
```
Where `escapeXmlAttr` replaces at least `&`, `<`, `>`, `"`, and `'` (i.e. the same set applied to `escapedText`).
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 99-103
Comment:
**`isConfigured` check diverges from auto-selection logic in `getTtsProvider`**
`isConfigured` returns `false` when no `voice` or `lang` is configured (API key alone is not enough). However, `getTtsProvider` in `tts.ts` auto-selects `azure` as soon as `resolveTtsApiKey` finds an `AZURE_SPEECH_API_KEY` — it does **not** consult `isConfigured`.
The practical result: a user who sets only `AZURE_SPEECH_API_KEY` (no voice) will have azure auto-selected, which then hard-fails at `synthesize` time with:
> Azure voice not configured. Set voice in config or use [[tts:voice=…]] directive
The error message itself references `[[tts:voice=…]]` (the OpenAI voice directive) rather than the Azure-specific `[[tts:azure_voice=…]]`, adding to the confusion.
Consider either:
1. Aligning `getTtsProvider` to also require a configured voice before auto-selecting azure, or
2. Updating the error message to reference the correct directive:
```suggestion
"Azure voice not configured. Set voice in config or use [[tts:azure_voice=zh-HK-HiuMaanNeural]] directive",
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 4
Comment:
**Duplicate constant across modules**
`DEFAULT_AZURE_OUTPUT_FORMAT` is defined identically in both `src/tts/providers/azure.ts` (line 4) and `src/tts/tts.ts`. If the default ever changes it must be updated in two places. Consider exporting it from one location (e.g. `azure.ts`) and importing it in `tts.ts`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 45-49
Comment:
**No request timeout on `listAzureVoices`**
The `synthesize` path correctly uses `AbortSignal.timeout(timeoutMs)`, but the `fetch` call inside `listAzureVoices` has no timeout. A slow or unresponsive Azure endpoint could stall a voice-listing request indefinitely. Consider passing a timeout signal here as well:
```suggestion
const response = await fetch(url, {
headers: {
"Ocp-Apim-Subscription-Key": params.apiKey,
},
signal: AbortSignal.timeout(params.timeoutMs ?? DEFAULT_TIMEOUT_MS),
});
```
You would need to add an optional `timeoutMs` field to the params type accordingly.
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: "feat(tts): add Azure..." |
| .replace(/"/g, """) | ||
| .replace(/'/g, "'"); | ||
|
|
||
| return `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='${lang || "en-US"}'><voice name='${voice}'>${escapedText}</voice></speak>`; |
There was a problem hiding this comment.
SSML injection via unescaped
voice and lang
buildAzureSSML escapes the user-provided text body correctly, but both voice and lang are interpolated directly into the XML template without any escaping.
The voice parameter is populated from the azure_voice directive override (overrides.azure.voice), which accepts any non-empty string. An attacker who can influence a TTS directive (e.g. via message content reaching parseTtsDirectives) could inject arbitrary SSML attributes or elements:
- Input:
azure_voice=foo' xml:lang='evil - Resulting SSML:
<voice name='foo' xml:lang='evil'>...</voice>
Similarly lang (which uses single-quote delimiters in the xml:lang attribute) would be broken by any value containing a single quote.
At minimum both values should be single-quote-escaped before insertion; ideally a proper XML attribute escaper should be applied:
| return `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='${lang || "en-US"}'><voice name='${voice}'>${escapedText}</voice></speak>`; | |
| return `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='${escapeXmlAttr(lang || "en-US")}'><voice name='${escapeXmlAttr(voice)}'>${escapedText}</voice></speak>`; |
Where escapeXmlAttr replaces at least &, <, >, ", and ' (i.e. the same set applied to escapedText).
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 79
Comment:
**SSML injection via unescaped `voice` and `lang`**
`buildAzureSSML` escapes the user-provided `text` body correctly, but both `voice` and `lang` are interpolated directly into the XML template without any escaping.
The `voice` parameter is populated from the `azure_voice` directive override (`overrides.azure.voice`), which accepts any non-empty string. An attacker who can influence a TTS directive (e.g. via message content reaching `parseTtsDirectives`) could inject arbitrary SSML attributes or elements:
- Input: `azure_voice=foo' xml:lang='evil`
- Resulting SSML: `<voice name='foo' xml:lang='evil'>...</voice>`
Similarly `lang` (which uses single-quote delimiters in the `xml:lang` attribute) would be broken by any value containing a single quote.
At minimum both values should be single-quote-escaped before insertion; ideally a proper XML attribute escaper should be applied:
```suggestion
return `<speak version='1.0' xmlns='http://www.w3.org/2001/10/synthesis' xml:lang='${escapeXmlAttr(lang || "en-US")}'><voice name='${escapeXmlAttr(voice)}'>${escapedText}</voice></speak>`;
```
Where `escapeXmlAttr` replaces at least `&`, `<`, `>`, `"`, and `'` (i.e. the same set applied to `escapedText`).
How can I resolve this? If you propose a fix, please make it concise.| isConfigured: ({ config }) => | ||
| Boolean( | ||
| ((config as any)?.azure?.apiKey || process.env.AZURE_SPEECH_API_KEY) && | ||
| ((config as any)?.azure?.voice || (config as any)?.azure?.lang), | ||
| ), |
There was a problem hiding this comment.
isConfigured check diverges from auto-selection logic in getTtsProvider
isConfigured returns false when no voice or lang is configured (API key alone is not enough). However, getTtsProvider in tts.ts auto-selects azure as soon as resolveTtsApiKey finds an AZURE_SPEECH_API_KEY — it does not consult isConfigured.
The practical result: a user who sets only AZURE_SPEECH_API_KEY (no voice) will have azure auto-selected, which then hard-fails at synthesize time with:
Azure voice not configured. Set voice in config or use [[tts:voice=…]] directive
The error message itself references [[tts:voice=…]] (the OpenAI voice directive) rather than the Azure-specific [[tts:azure_voice=…]], adding to the confusion.
Consider either:
- Aligning
getTtsProviderto also require a configured voice before auto-selecting azure, or - Updating the error message to reference the correct directive:
| isConfigured: ({ config }) => | |
| Boolean( | |
| ((config as any)?.azure?.apiKey || process.env.AZURE_SPEECH_API_KEY) && | |
| ((config as any)?.azure?.voice || (config as any)?.azure?.lang), | |
| ), | |
| "Azure voice not configured. Set voice in config or use [[tts:azure_voice=zh-HK-HiuMaanNeural]] directive", |
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 99-103
Comment:
**`isConfigured` check diverges from auto-selection logic in `getTtsProvider`**
`isConfigured` returns `false` when no `voice` or `lang` is configured (API key alone is not enough). However, `getTtsProvider` in `tts.ts` auto-selects `azure` as soon as `resolveTtsApiKey` finds an `AZURE_SPEECH_API_KEY` — it does **not** consult `isConfigured`.
The practical result: a user who sets only `AZURE_SPEECH_API_KEY` (no voice) will have azure auto-selected, which then hard-fails at `synthesize` time with:
> Azure voice not configured. Set voice in config or use [[tts:voice=…]] directive
The error message itself references `[[tts:voice=…]]` (the OpenAI voice directive) rather than the Azure-specific `[[tts:azure_voice=…]]`, adding to the confusion.
Consider either:
1. Aligning `getTtsProvider` to also require a configured voice before auto-selecting azure, or
2. Updating the error message to reference the correct directive:
```suggestion
"Azure voice not configured. Set voice in config or use [[tts:azure_voice=zh-HK-HiuMaanNeural]] directive",
```
How can I resolve this? If you propose a fix, please make it concise.| import type { SpeechProviderPlugin } from "../../plugins/types.js"; | ||
| import type { SpeechVoiceOption } from "../provider-types.js"; | ||
|
|
||
| const DEFAULT_AZURE_OUTPUT_FORMAT = "audio-24khz-48kbitrate-mono-mp3"; |
There was a problem hiding this comment.
Duplicate constant across modules
DEFAULT_AZURE_OUTPUT_FORMAT is defined identically in both src/tts/providers/azure.ts (line 4) and src/tts/tts.ts. If the default ever changes it must be updated in two places. Consider exporting it from one location (e.g. azure.ts) and importing it in tts.ts.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 4
Comment:
**Duplicate constant across modules**
`DEFAULT_AZURE_OUTPUT_FORMAT` is defined identically in both `src/tts/providers/azure.ts` (line 4) and `src/tts/tts.ts`. If the default ever changes it must be updated in two places. Consider exporting it from one location (e.g. `azure.ts`) and importing it in `tts.ts`.
How can I resolve this? If you propose a fix, please make it concise.| const response = await fetch(url, { | ||
| headers: { | ||
| "Ocp-Apim-Subscription-Key": params.apiKey, | ||
| }, | ||
| }); |
There was a problem hiding this comment.
No request timeout on
listAzureVoices
The synthesize path correctly uses AbortSignal.timeout(timeoutMs), but the fetch call inside listAzureVoices has no timeout. A slow or unresponsive Azure endpoint could stall a voice-listing request indefinitely. Consider passing a timeout signal here as well:
| const response = await fetch(url, { | |
| headers: { | |
| "Ocp-Apim-Subscription-Key": params.apiKey, | |
| }, | |
| }); | |
| const response = await fetch(url, { | |
| headers: { | |
| "Ocp-Apim-Subscription-Key": params.apiKey, | |
| }, | |
| signal: AbortSignal.timeout(params.timeoutMs ?? DEFAULT_TIMEOUT_MS), | |
| }); |
You would need to add an optional timeoutMs field to the params type accordingly.
Prompt To Fix With AI
This is a comment left during a code review.
Path: src/tts/providers/azure.ts
Line: 45-49
Comment:
**No request timeout on `listAzureVoices`**
The `synthesize` path correctly uses `AbortSignal.timeout(timeoutMs)`, but the `fetch` call inside `listAzureVoices` has no timeout. A slow or unresponsive Azure endpoint could stall a voice-listing request indefinitely. Consider passing a timeout signal here as well:
```suggestion
const response = await fetch(url, {
headers: {
"Ocp-Apim-Subscription-Key": params.apiKey,
},
signal: AbortSignal.timeout(params.timeoutMs ?? DEFAULT_TIMEOUT_MS),
});
```
You would need to add an optional `timeoutMs` field to the params type accordingly.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 996c529913
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| isConfigured: ({ config }) => | ||
| Boolean( | ||
| ((config as any)?.azure?.apiKey || process.env.AZURE_SPEECH_API_KEY) && | ||
| ((config as any)?.azure?.voice || (config as any)?.azure?.lang), | ||
| ), |
There was a problem hiding this comment.
Require a configured Azure voice before advertising readiness
isConfigured() currently checks azure.voice || azure.lang, but resolveTtsConfig() always fills config.azure.lang with "en-US" (src/tts/tts.ts:345-355). In practice, any host with only AZURE_SPEECH_API_KEY set is now reported as Azure-ready, getTtsProvider() can auto-pick Azure as the primary provider (src/tts/tts.ts:503-510), and the first synthesis then fails with Azure voice not configured. That adds a guaranteed failure to every fallback path and hard-fails disableFallback callers until a voice is explicitly configured.
Useful? React with 👍 / 👎.
| return listAzureVoices({ | ||
| apiKey, | ||
| region: (req.config as any)?.azure?.region || process.env.AZURE_SPEECH_REGION, | ||
| baseUrl: (req.config as any)?.azure?.baseUrl, |
There was a problem hiding this comment.
Thread
req.baseUrl through Azure voice listing
listSpeechVoices() passes a caller-supplied baseUrl into every provider (src/tts/tts.ts:848-852), but this Azure adapter ignores it and only forwards config.azure.baseUrl. Any setup that uses a custom Azure endpoint (private link, sovereign cloud, proxy, etc.) can still synthesize with the custom URL, yet runtime.tts.listVoices({ baseUrl }) will query the default public endpoint instead and fail or return the wrong catalog.
Useful? React with 👍 / 👎.
| // Use timeout from config, directive, or default | ||
| const timeoutMs = (req.config as any)?.azure?.timeoutMs ?? DEFAULT_TIMEOUT_MS; |
There was a problem hiding this comment.
Honor the global TTS timeout for Azure requests
The top-level messages.tts.timeoutMs is the generic request timeout (src/config/types.tts.ts:110-111), and the existing providers all respect req.config.timeoutMs. This Azure implementation skips that fallback and jumps straight to DEFAULT_TIMEOUT_MS, so deployments that shorten the global timeout to keep auto-replies responsive will still wait ~30s on Azure unless they discover and duplicate a provider-specific override.
Useful? React with 👍 / 👎.
Summary
Add Azure Speech TTS provider to OpenClaw with SSML synthesis support.
Problem
What Changed
Features
Related Issues