From bb0bc583bcd1664d20f8ac96c6c33e07bdbe5aaf Mon Sep 17 00:00:00 2001 From: Lazare Rossillon Date: Tue, 7 Apr 2026 16:53:40 +0200 Subject: [PATCH 1/5] Refine benchmarking documentation --- chapters/introduction/index.mdx | 4 +- chapters/pre-recorded-stt/benchmarking.mdx | 167 +++++++++++++++++++++ snippets/get-transcription-result.mdx | 2 +- snippets/getting-started-playground.mdx | 8 +- snippets/setup-account.mdx | 4 +- style.css | 16 +- 6 files changed, 192 insertions(+), 9 deletions(-) create mode 100644 chapters/pre-recorded-stt/benchmarking.mdx diff --git a/chapters/introduction/index.mdx b/chapters/introduction/index.mdx index e73570f..afd48fc 100644 --- a/chapters/introduction/index.mdx +++ b/chapters/introduction/index.mdx @@ -10,11 +10,13 @@ import IntegrationsCards from '/snippets/integrations-cards.mdx'; className="block dark:hidden rounded-lg" noZoom src="/assets/dark-banner.png" + alt="Gladia documentation hero banner in the light theme." /> Gladia documentation hero banner in the dark theme. @@ -61,4 +63,4 @@ in both Real-time and asynchronous ways, with audio intelligence tools to extrac # Our integration partners ------ \ No newline at end of file +----- diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx new file mode 100644 index 0000000..d23b9d0 --- /dev/null +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -0,0 +1,167 @@ +--- +title: Benchmarking +description: How to benchmark speech-to-text systems correctly with transcript normalization, WER, and representative datasets. +--- + +Benchmarking speech-to-text systems is easy to get wrong. Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. + + + Although this page currently lives in the pre-recorded section, the same methodology also applies to live transcription if you benchmark finalized outputs or _final_ utterances rather than partials. + + +## Benchmarking at a glance + + + + Normalize both references and predictions before computing WER. + + + Measure substitutions, deletions, and insertions on normalized text. + + + Benchmark on audio that matches your real traffic and target users. + + + Look beyond one average score and inspect meaningful slices. + + + +## 1. Define your evaluation goal + +Before comparing providers, define what "good" means for your product. + +For example: + +- For a contact center, priority often goes to narrowband and dual-channel telephony robustness, channel-level speaker separation, and high recall on [named entities](/chapters/audio-intelligence/named-entity-recognition), especially [persons](/chapters/audio-intelligence/named-entity-recognition#supported-entities), [organizations](/chapters/audio-intelligence/named-entity-recognition#supported-entities), and [phone numbers](/chapters/audio-intelligence/named-entity-recognition#supported-entities). +- For a meeting assistant, the benchmark usually needs to emphasize [speaker diarization](/chapters/audio-intelligence/speaker-diarization), diarization stability over long recordings, turn attribution accuracy, punctuation and casing quality, and robustness on overlapping multi-speaker conversations. +- For a media workflow, the focus is often [subtitle](/chapters/audio-intelligence/subtitles) segmentation quality, subtitle timestamp precision, proper-noun fidelity, and acoustic robustness on noisy real-world recordings, so captions stay correctly aligned on screen. + +If your benchmark does not reflect your real traffic, the result will not tell you much about production performance. + +## 2. Normalize transcripts before computing WER + +Normalization defines which surface-form differences should be ignored before scoring. The goal is to remove formatting variance without erasing information that still matters for the benchmark. + +In practice, a good normalization pipeline should reconcile equivalent forms while preserving entity-level meaning such as currencies, numbered labels, honorifics, and timestamps. + +| Reference | Prediction | Why raw WER is wrong | +|-----------|------------|----------------------| +| `It's $50` | `it is fifty dollars` | Contraction and currency formatting differ, but the semantic content is the same. | +| `Meet at Point 14` | `meet at point fourteen` | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. | +| `Mr. Smith joined at 3:00 PM` | `mister smith joined at 3 pm` | Honorific and timestamp formatting differ, but the transcript content is equivalent. | + +This is where normalization quality matters. A benchmark-friendly normalizer should make few assumptions by default: it should standardize casing, abbreviations, symbols, and numeric rendering, but avoid discarding cues that may affect downstream evaluation. + +That is one of the main limitations of Whisper-style normalization. It is useful for generic lexical comparability, but it applies stronger canonicalization assumptions by default, such as aggressive lowercasing, punctuation stripping, and verbalization. That can be acceptable for broad WER reporting, yet it is less appropriate when entity fidelity matters, for example when distinguishing `Point 14`, preserving explicit currency markers, or keeping structured time expressions stable across systems. + +Gladia's recommended approach is to normalize both the ground-truth transcript and the model output with [`gladia-normalization`](https://github.com/gladiaio/normalization), our open-source normalization library designed for transcript evaluation. + +For example, a Gladia-oriented pipeline can reconcile formatting differences such as: + +- `It's $50` -> `it is 50 dollars` +- `Meet at Point 14` -> `meet at point 14` +- `Mr. Smith joined at 3:00 PM` -> `mister smith joined at 3 pm` + + + Open-source transcript normalization library used before WER computation. + + + + Be careful with generic Whisper-style normalizers such as [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/): beyond stronger normalization assumptions, its own package description notes known issues with Indic and other low-resource languages, so it may not be an appropriate evaluation baseline for multilingual production traffic. + + +```python +from normalization import load_pipeline + +pipeline = load_pipeline("gladia-3", language="en") + +reference = "Meet at Point 14. It's $50 at 3:00 PM." +prediction = "meet at point fourteen it is fifty dollars at 3 pm" + +normalized_reference = pipeline.normalize(reference) +normalized_prediction = pipeline.normalize(prediction) +``` + + + Always apply the same normalization pipeline to every system you compare. Changing the normalization rules between vendors invalidates the benchmark. + + +## 3. Compute WER correctly + +Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. + +The standard formula is: + +```text +WER = (S + D + I) / N +``` + +Where: + +- `S` = substitutions +- `D` = deletions +- `I` = insertions +- `N` = number of words in the reference transcript + +Lower is better. In practice: + +1. Prepare a reference transcript for each audio sample. +2. Run each provider on the exact same audio. +3. Normalize both the reference and each prediction with the same pipeline. +4. Compute WER on the normalized outputs. +5. Aggregate results across the full dataset. + + + Do not compute WER on raw transcripts if providers format numbers, punctuation, abbreviations, or casing differently. That mostly measures formatting conventions, not recognition quality. + + +## 4. Choose a representative dataset + +Start from your [evaluation goal](#1-define-your-evaluation-goal): the right dataset depends on the use case and traffic shape you want to measure. + +A representative benchmark dataset is not just a random audio collection. It should be curated to reflect production traffic, annotated with reliable ground truth, and structured well enough to support consistent quality and latency measurement. + +Selection criteria: + +- Match the channel conditions you expect in production: telephony, browser microphone, meeting capture, studio audio, or field recording. +- Match the linguistic distribution: language mix, accents, code-switching frequency, and domain terminology. +- Match the acoustic conditions: compression artifacts, background noise, reverberation, cross-talk, and overlapping speech. +- Match the interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form continuous recordings. +- Match the output requirements: entity fidelity, diarization, subtitle timing, readability, or downstream extraction accuracy. +- Use transcripts that are strong enough to serve as ground truth, with timing annotations when latency is part of the benchmark. +- Keep the sample size large and diverse enough to keep rankings stable across domains, speakers, and recording conditions. +- Prefer a mix of public datasets for external comparability and private in-domain datasets for business relevance. + +Typical failure cases: + +- Benchmarking call-center audio with clean podcast recordings overestimates real-world performance. +- Benchmarking English-only speech does not capture code-switching traffic. +- Benchmarking short clips can hide failures that appear on long recordings with multiple speakers. + +For a broader methodology view, see [this benchmark guide](/chapters/pre-recorded-stt/benchmarking), especially the evaluation-goal section above when mapping use cases to dataset types. + +## 5. Interpret results carefully + +Do not stop at a single WER number. Review: + +- overall average WER +- median WER and spread across files +- breakdowns by language, domain, or audio condition +- failure modes on proper nouns, acronyms, and numbers +- whether differences are consistent or concentrated in a few hard samples + +Two systems can post similar average WER while failing on different error classes. Separate statistically meaningful gaps from noise introduced by dataset composition or normalization choices. + +If two systems are close, inspect actual transcript examples before drawing strong conclusions. + +## Common pitfalls + +- Comparing providers on different datasets +- Using low-quality or inconsistent ground truth +- Treating punctuation and formatting differences as recognition errors +- Drawing conclusions from too few samples +- Reporting one average score without any slice analysis diff --git a/snippets/get-transcription-result.mdx b/snippets/get-transcription-result.mdx index 5cfc609..53f360f 100644 --- a/snippets/get-transcription-result.mdx +++ b/snippets/get-transcription-result.mdx @@ -12,7 +12,7 @@ You can get your transcription results in **3 different ways**: You can configure webhooks at https://app.gladia.io/webhooks to be notified when your transcriptions are done. - + Gladia dashboard webhook settings page for configuring transcription notifications. Once a transcription is done, a `POST` request will be made to the endpoint you configured. The request body is a JSON object containing the transcription `id` that you can use to retrieve your result with [our API](/api-reference/v2/pre-recorded/get). For the full body definition, check [our API definition](/api-reference/v2/pre-recorded/webhook/success). diff --git a/snippets/getting-started-playground.mdx b/snippets/getting-started-playground.mdx index 6483a30..01263e9 100644 --- a/snippets/getting-started-playground.mdx +++ b/snippets/getting-started-playground.mdx @@ -8,7 +8,7 @@ audio transcription. Choose your audio source (stream from you microphone, or upload a local file) - + Gladia playground step showing audio source selection options. Then proceed to the next step. @@ -25,7 +25,7 @@ audio transcription. - + Gladia playground feature selection screen with transcription options enabled. @@ -35,7 +35,7 @@ audio transcription. Text in italic in the transcription represents [partials transcripts](/chapters/live-stt/features#partial-transcripts). - + Gladia playground live transcription screen after starting capture. @@ -44,7 +44,7 @@ audio transcription. the result in JSON format (the one you'd get with an API call). - + Gladia playground transcription results view with formatted transcript and JSON output. diff --git a/snippets/setup-account.mdx b/snippets/setup-account.mdx index 438842d..62055fc 100644 --- a/snippets/setup-account.mdx +++ b/snippets/setup-account.mdx @@ -10,9 +10,9 @@ Now that you signed up, login to app.gladia.io and go to the [API keys section]( created a default key for you. You can use this one or create your own. - + Gladia dashboard API keys page showing the location of the default key. Gladia offers 10 Hours of free audio transcription per month if you want to test the service! -With your API key, you're now ready to use Gladia APIs. \ No newline at end of file +With your API key, you're now ready to use Gladia APIs. diff --git a/style.css b/style.css index f3d4c05..90c4cdb 100644 --- a/style.css +++ b/style.css @@ -28,4 +28,18 @@ .prose pre { max-width: 100%; overflow: auto; -} \ No newline at end of file +} + +.benchmark-status-pill { + display: inline-flex; + align-items: center; + padding: 0.2rem 0.5rem; + border-radius: 999px; + border: 1px solid rgba(46, 52, 62, 0.18); + background: rgba(46, 52, 62, 0.08); + color: #2e343e; + font-size: 0.85rem; + font-weight: 600; + line-height: 1.2; + white-space: nowrap; +} From 8cb88401107ca5ce1b25d89fb81e4b2bdc4621af Mon Sep 17 00:00:00 2001 From: Lazare Rossillon Date: Tue, 7 Apr 2026 17:08:57 +0200 Subject: [PATCH 2/5] Human update wording --- chapters/pre-recorded-stt/benchmarking.mdx | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx index d23b9d0..9ea0070 100644 --- a/chapters/pre-recorded-stt/benchmarking.mdx +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -1,13 +1,11 @@ --- title: Benchmarking -description: How to benchmark speech-to-text systems correctly with transcript normalization, WER, and representative datasets. +description: A guide on how we benchmark speech-to-text systems correctly with transcript normalization, WER, and representative datasets. --- -Benchmarking speech-to-text systems is easy to get wrong. Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. +Benchmarking speech-to-text systems is easy to get wrong. - - Although this page currently lives in the pre-recorded section, the same methodology also applies to live transcription if you benchmark finalized outputs or _final_ utterances rather than partials. - +Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. ## Benchmarking at a glance From 8fa2f7a50332c1a2b037486ab605eb7c6c16cd58 Mon Sep 17 00:00:00 2001 From: Lazare Rossillon Date: Wed, 8 Apr 2026 16:01:43 +0200 Subject: [PATCH 3/5] Adjust benchmarking navigation and steps --- chapters/pre-recorded-stt/benchmarking.mdx | 13 ++++++++----- docs.json | 1 + 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx index 9ea0070..9c05a40 100644 --- a/chapters/pre-recorded-stt/benchmarking.mdx +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -10,6 +10,9 @@ Small methodology changes can produce large swings in reported quality, which ma ## Benchmarking at a glance + + Decide what "good" means for your product before comparing systems. + Normalize both references and predictions before computing WER. @@ -24,7 +27,7 @@ Small methodology changes can produce large swings in reported quality, which ma -## 1. Define your evaluation goal +## 0. Define your evaluation goal Before comparing providers, define what "good" means for your product. @@ -36,7 +39,7 @@ For example: If your benchmark does not reflect your real traffic, the result will not tell you much about production performance. -## 2. Normalize transcripts before computing WER +## 1. Normalize transcripts before computing WER Normalization defines which surface-form differences should be ignored before scoring. The goal is to remove formatting variance without erasing information that still matters for the benchmark. @@ -88,7 +91,7 @@ normalized_prediction = pipeline.normalize(prediction) Always apply the same normalization pipeline to every system you compare. Changing the normalization rules between vendors invalidates the benchmark. -## 3. Compute WER correctly +## 2. Compute WER correctly Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. @@ -117,7 +120,7 @@ Lower is better. In practice: Do not compute WER on raw transcripts if providers format numbers, punctuation, abbreviations, or casing differently. That mostly measures formatting conventions, not recognition quality. -## 4. Choose a representative dataset +## 3. Choose a representative dataset Start from your [evaluation goal](#1-define-your-evaluation-goal): the right dataset depends on the use case and traffic shape you want to measure. @@ -142,7 +145,7 @@ Typical failure cases: For a broader methodology view, see [this benchmark guide](/chapters/pre-recorded-stt/benchmarking), especially the evaluation-goal section above when mapping use cases to dataset types. -## 5. Interpret results carefully +## 4. Interpret results carefully Do not stop at a single WER number. Review: diff --git a/docs.json b/docs.json index ca37881..84e2c39 100644 --- a/docs.json +++ b/docs.json @@ -44,6 +44,7 @@ "chapters/pre-recorded-stt/features/sentences" ] }, + "chapters/pre-recorded-stt/benchmarking", { "group": "Live Transcription", "expanded": false, From c07347fcb3334d047b6ba956d9dd464f191b712f Mon Sep 17 00:00:00 2001 From: Lazare Rossillon Date: Wed, 8 Apr 2026 16:07:17 +0200 Subject: [PATCH 4/5] Simplify benchmarking normalization note --- chapters/pre-recorded-stt/benchmarking.mdx | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx index 9c05a40..eba43b9 100644 --- a/chapters/pre-recorded-stt/benchmarking.mdx +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -1,9 +1,9 @@ --- title: Benchmarking -description: A guide on how we benchmark speech-to-text systems correctly with transcript normalization, WER, and representative datasets. +description: A guide on how to benchmark speech-to-text systems with transcript normalization, WER, and representative datasets. --- -Benchmarking speech-to-text systems is easy to get wrong. +Benchmarking speech-to-text systems is very easy to get wrong. Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. @@ -43,7 +43,9 @@ If your benchmark does not reflect your real traffic, the result will not tell y Normalization defines which surface-form differences should be ignored before scoring. The goal is to remove formatting variance without erasing information that still matters for the benchmark. -In practice, a good normalization pipeline should reconcile equivalent forms while preserving entity-level meaning such as currencies, numbered labels, honorifics, and timestamps. +Simply put, normalization ensures you compare apple to apples when judging transcription output. + +In practice that means a good normalization pipeline reconciles equivalent forms while preserving entity-level meaning such as currencies, numbered labels, honorifics, and timestamps. | Reference | Prediction | Why raw WER is wrong | |-----------|------------|----------------------| @@ -51,9 +53,11 @@ In practice, a good normalization pipeline should reconcile equivalent forms whi | `Meet at Point 14` | `meet at point fourteen` | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. | | `Mr. Smith joined at 3:00 PM` | `mister smith joined at 3 pm` | Honorific and timestamp formatting differ, but the transcript content is equivalent. | -This is where normalization quality matters. A benchmark-friendly normalizer should make few assumptions by default: it should standardize casing, abbreviations, symbols, and numeric rendering, but avoid discarding cues that may affect downstream evaluation. +A benchmark-friendly normalizer should make few assumptions by default: it should standardize casing, abbreviations, symbols, and numeric rendering, but avoid discarding cues that may affect downstream evaluation. + +And that is one of the main limitations of "Whisper-style normalization" introduced by OpenAI in 2022 and commonly implemented in packages such as [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/). -That is one of the main limitations of Whisper-style normalization. It is useful for generic lexical comparability, but it applies stronger canonicalization assumptions by default, such as aggressive lowercasing, punctuation stripping, and verbalization. That can be acceptable for broad WER reporting, yet it is less appropriate when entity fidelity matters, for example when distinguishing `Point 14`, preserving explicit currency markers, or keeping structured time expressions stable across systems. +It is useful for generic lexical comparability, but applies stronger canonicalization assumptions by default, such as aggressive lowercasing, punctuation stripping, and verbalization. Gladia's recommended approach is to normalize both the ground-truth transcript and the model output with [`gladia-normalization`](https://github.com/gladiaio/normalization), our open-source normalization library designed for transcript evaluation. @@ -71,10 +75,6 @@ For example, a Gladia-oriented pipeline can reconcile formatting differences suc Open-source transcript normalization library used before WER computation. - - Be careful with generic Whisper-style normalizers such as [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/): beyond stronger normalization assumptions, its own package description notes known issues with Indic and other low-resource languages, so it may not be an appropriate evaluation baseline for multilingual production traffic. - - ```python from normalization import load_pipeline From 8c95e9c25f76cea4f3bfe4bab779711408cd8753 Mon Sep 17 00:00:00 2001 From: Lazare Rossillon Date: Fri, 10 Apr 2026 15:20:22 +0200 Subject: [PATCH 5/5] Refine benchmarking guide wording and evaluation tips --- chapters/pre-recorded-stt/benchmarking.mdx | 134 ++++++++++----------- 1 file changed, 66 insertions(+), 68 deletions(-) diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx index eba43b9..fe42614 100644 --- a/chapters/pre-recorded-stt/benchmarking.mdx +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -1,51 +1,74 @@ --- title: Benchmarking -description: A guide on how to benchmark speech-to-text systems with transcript normalization, WER, and representative datasets. +description: A practical guide to benchmarking speech-to-text accuracy — from defining goals to choosing datasets, normalizing transcripts, computing WER, and interpreting results. --- -Benchmarking speech-to-text systems is very easy to get wrong. - +Benchmarking speech-to-text systems is easy to get wrong. Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. ## Benchmarking at a glance - - + + Decide what "good" means for your product before comparing systems. - - - Normalize both references and predictions before computing WER. - - - Measure substitutions, deletions, and insertions on normalized text. - - - Benchmark on audio that matches your real traffic and target users. - - - Look beyond one average score and inspect meaningful slices. - - - -## 0. Define your evaluation goal - -Before comparing providers, define what "good" means for your product. - -For example: - -- For a contact center, priority often goes to narrowband and dual-channel telephony robustness, channel-level speaker separation, and high recall on [named entities](/chapters/audio-intelligence/named-entity-recognition), especially [persons](/chapters/audio-intelligence/named-entity-recognition#supported-entities), [organizations](/chapters/audio-intelligence/named-entity-recognition#supported-entities), and [phone numbers](/chapters/audio-intelligence/named-entity-recognition#supported-entities). -- For a meeting assistant, the benchmark usually needs to emphasize [speaker diarization](/chapters/audio-intelligence/speaker-diarization), diarization stability over long recordings, turn attribution accuracy, punctuation and casing quality, and robustness on overlapping multi-speaker conversations. -- For a media workflow, the focus is often [subtitle](/chapters/audio-intelligence/subtitles) segmentation quality, subtitle timestamp precision, proper-noun fidelity, and acoustic robustness on noisy real-world recordings, so captions stay correctly aligned on screen. + + + Benchmark on audio that matches your real traffic and target users. + + + Normalize both references and hypothesis outputs before computing WER. + + + Measure substitutions, deletions, and insertions on normalized text. + + + Look beyond one average score and inspect meaningful slices. + + + +## 1. Define your evaluation goal + +Before comparing providers and models, the first step is to define which aspects of performance matter most for your use case. + +Below are **examples of performance aspects that would be more weighted for domain applications of speech to text**: + +- Accuracy on noisy backgrounds: for contact centers, telephony, and field recordings. +- Speaker diarization quality: for meeting assistants and multi-speaker calls. +- Named entity accuracy: for workflows that extract people, organizations, phone numbers, or addresses. +- Domain-specific vocabulary handling: for medical, legal, or financial transcription. +- Timestamp accuracy: for media workflows that need readable, well-timed captions. +- Filler-word handling: for agentic workflows . + +Those choices shape every downstream decision: which dataset to use, which normalization rules to apply, and which metrics to report. + If your benchmark does not reflect your real traffic, the result will not tell you much about production performance. -## 1. Normalize transcripts before computing WER +## 2. Choose the right dataset + +The right dataset depends on the use case and traffic shape you want to measure. +You of course wouldn't want to be benchmarking call-center audio with clean podcast recordings. + +So pick audio that matches your real traffic along these axes: + +- Language: target language(s), accents, code-switching frequency. +- Audio quality: noisy field recordings, telephony, studio, or browser microphone. +- Topics and domain: medical, financial, operational, legal, etc. +- Typical words that matter: numbers, proper nouns, acronyms, domain-specific terms. +- Interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form recordings. + +Use transcripts that are strong enough to serve as ground truth, and prefer a mix of: + - public datasets (for comparability and immediate availability) + - private in-domain datasets, when available, to ensure no data is "spoiled" by some speech-to-text providers training their models on the very datasets you're benchmarking. + + + Your favorite LLM with internet access is be very effective at finding public datasets that match your use case. + -Normalization defines which surface-form differences should be ignored before scoring. The goal is to remove formatting variance without erasing information that still matters for the benchmark. -Simply put, normalization ensures you compare apple to apples when judging transcription output. +## 3. Normalize transcripts before computing WER -In practice that means a good normalization pipeline reconciles equivalent forms while preserving entity-level meaning such as currencies, numbered labels, honorifics, and timestamps. +Normalization removes surface-form differences (casing, abbreviations, numeric rendering) so you compare apples to apples when judging transcription output. | Reference | Prediction | Why raw WER is wrong | |-----------|------------|----------------------| @@ -53,15 +76,9 @@ In practice that means a good normalization pipeline reconciles equivalent forms | `Meet at Point 14` | `meet at point fourteen` | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. | | `Mr. Smith joined at 3:00 PM` | `mister smith joined at 3 pm` | Honorific and timestamp formatting differ, but the transcript content is equivalent. | -A benchmark-friendly normalizer should make few assumptions by default: it should standardize casing, abbreviations, symbols, and numeric rendering, but avoid discarding cues that may affect downstream evaluation. - -And that is one of the main limitations of "Whisper-style normalization" introduced by OpenAI in 2022 and commonly implemented in packages such as [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/). - -It is useful for generic lexical comparability, but applies stronger canonicalization assumptions by default, such as aggressive lowercasing, punctuation stripping, and verbalization. - -Gladia's recommended approach is to normalize both the ground-truth transcript and the model output with [`gladia-normalization`](https://github.com/gladiaio/normalization), our open-source normalization library designed for transcript evaluation. +One common limitation is "Whisper-style normalization" (OpenAI, 2022): implemented in packages like [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/). It does not affect numbers, and applies aggressive lowercasing and punctuation stripping. -For example, a Gladia-oriented pipeline can reconcile formatting differences such as: +Gladia's recommended approach is [`gladia-normalization`](https://github.com/gladiaio/normalization), our open-source library designed for transcript evaluation: - `It's $50` -> `it is 50 dollars` - `Meet at Point 14` -> `meet at point 14` @@ -88,10 +105,10 @@ normalized_prediction = pipeline.normalize(prediction) ``` - Always apply the same normalization pipeline to every system you compare. Changing the normalization rules between vendors invalidates the benchmark. + Always apply the same normalization pipeline to **both** the reference transcript **and** every hypothesis output you compare. Changing the normalization rules between vendors — or forgetting to normalize one side — invalidates the benchmark. -## 2. Compute WER correctly +## 4. Compute WER correctly Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. @@ -120,32 +137,11 @@ Lower is better. In practice: Do not compute WER on raw transcripts if providers format numbers, punctuation, abbreviations, or casing differently. That mostly measures formatting conventions, not recognition quality. -## 3. Choose a representative dataset - -Start from your [evaluation goal](#1-define-your-evaluation-goal): the right dataset depends on the use case and traffic shape you want to measure. - -A representative benchmark dataset is not just a random audio collection. It should be curated to reflect production traffic, annotated with reliable ground truth, and structured well enough to support consistent quality and latency measurement. - -Selection criteria: - -- Match the channel conditions you expect in production: telephony, browser microphone, meeting capture, studio audio, or field recording. -- Match the linguistic distribution: language mix, accents, code-switching frequency, and domain terminology. -- Match the acoustic conditions: compression artifacts, background noise, reverberation, cross-talk, and overlapping speech. -- Match the interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form continuous recordings. -- Match the output requirements: entity fidelity, diarization, subtitle timing, readability, or downstream extraction accuracy. -- Use transcripts that are strong enough to serve as ground truth, with timing annotations when latency is part of the benchmark. -- Keep the sample size large and diverse enough to keep rankings stable across domains, speakers, and recording conditions. -- Prefer a mix of public datasets for external comparability and private in-domain datasets for business relevance. - -Typical failure cases: - -- Benchmarking call-center audio with clean podcast recordings overestimates real-world performance. -- Benchmarking English-only speech does not capture code-switching traffic. -- Benchmarking short clips can hide failures that appear on long recordings with multiple speakers. - -For a broader methodology view, see [this benchmark guide](/chapters/pre-recorded-stt/benchmarking), especially the evaluation-goal section above when mapping use cases to dataset types. + + Inspect your reference transcripts carefully before computing WER. If a reference contains text that is not actually present in the audio, for example an intro such as "this audio is a recording of...", it can make WER look much worse across all providers. + -## 4. Interpret results carefully +## 5. Interpret results carefully Do not stop at a single WER number. Review: @@ -166,3 +162,5 @@ If two systems are close, inspect actual transcript examples before drawing stro - Treating punctuation and formatting differences as recognition errors - Drawing conclusions from too few samples - Reporting one average score without any slice analysis +- Not inspecting the reference transcript: if it contains text not present in the audio, for example an intro like "this audio is a recording of...", it will inflate WER across all providers +- Not experimenting with provider configurations: for example, using Gladia's [custom vocabulary](/chapters/audio-intelligence/custom-vocabulary) to improve proper noun accuracy, then comparing against the ground truth