diff --git a/chapters/introduction/index.mdx b/chapters/introduction/index.mdx index e73570f..afd48fc 100644 --- a/chapters/introduction/index.mdx +++ b/chapters/introduction/index.mdx @@ -10,11 +10,13 @@ import IntegrationsCards from '/snippets/integrations-cards.mdx'; className="block dark:hidden rounded-lg" noZoom src="/assets/dark-banner.png" + alt="Gladia documentation hero banner in the light theme." /> Gladia documentation hero banner in the dark theme. @@ -61,4 +63,4 @@ in both Real-time and asynchronous ways, with audio intelligence tools to extrac # Our integration partners ------ \ No newline at end of file +----- diff --git a/chapters/pre-recorded-stt/benchmarking.mdx b/chapters/pre-recorded-stt/benchmarking.mdx new file mode 100644 index 0000000..fe42614 --- /dev/null +++ b/chapters/pre-recorded-stt/benchmarking.mdx @@ -0,0 +1,166 @@ +--- +title: Benchmarking +description: A practical guide to benchmarking speech-to-text accuracy — from defining goals to choosing datasets, normalizing transcripts, computing WER, and interpreting results. +--- + +Benchmarking speech-to-text systems is easy to get wrong. +Small methodology changes can produce large swings in reported quality, which makes comparisons misleading. + +## Benchmarking at a glance + + + + Decide what "good" means for your product before comparing systems. + + + Benchmark on audio that matches your real traffic and target users. + + + Normalize both references and hypothesis outputs before computing WER. + + + Measure substitutions, deletions, and insertions on normalized text. + + + Look beyond one average score and inspect meaningful slices. + + + +## 1. Define your evaluation goal + +Before comparing providers and models, the first step is to define which aspects of performance matter most for your use case. + +Below are **examples of performance aspects that would be more weighted for domain applications of speech to text**: + +- Accuracy on noisy backgrounds: for contact centers, telephony, and field recordings. +- Speaker diarization quality: for meeting assistants and multi-speaker calls. +- Named entity accuracy: for workflows that extract people, organizations, phone numbers, or addresses. +- Domain-specific vocabulary handling: for medical, legal, or financial transcription. +- Timestamp accuracy: for media workflows that need readable, well-timed captions. +- Filler-word handling: for agentic workflows . + +Those choices shape every downstream decision: which dataset to use, which normalization rules to apply, and which metrics to report. + + +If your benchmark does not reflect your real traffic, the result will not tell you much about production performance. + +## 2. Choose the right dataset + +The right dataset depends on the use case and traffic shape you want to measure. +You of course wouldn't want to be benchmarking call-center audio with clean podcast recordings. + +So pick audio that matches your real traffic along these axes: + +- Language: target language(s), accents, code-switching frequency. +- Audio quality: noisy field recordings, telephony, studio, or browser microphone. +- Topics and domain: medical, financial, operational, legal, etc. +- Typical words that matter: numbers, proper nouns, acronyms, domain-specific terms. +- Interaction pattern: single-speaker dictation, dialogue, multi-speaker meetings, or long-form recordings. + +Use transcripts that are strong enough to serve as ground truth, and prefer a mix of: + - public datasets (for comparability and immediate availability) + - private in-domain datasets, when available, to ensure no data is "spoiled" by some speech-to-text providers training their models on the very datasets you're benchmarking. + + + Your favorite LLM with internet access is be very effective at finding public datasets that match your use case. + + + +## 3. Normalize transcripts before computing WER + +Normalization removes surface-form differences (casing, abbreviations, numeric rendering) so you compare apples to apples when judging transcription output. + +| Reference | Prediction | Why raw WER is wrong | +|-----------|------------|----------------------| +| `It's $50` | `it is fifty dollars` | Contraction and currency formatting differ, but the semantic content is the same. | +| `Meet at Point 14` | `meet at point fourteen` | The normalization should preserve the numbered entity instead of collapsing it into an unrelated form. | +| `Mr. Smith joined at 3:00 PM` | `mister smith joined at 3 pm` | Honorific and timestamp formatting differ, but the transcript content is equivalent. | + +One common limitation is "Whisper-style normalization" (OpenAI, 2022): implemented in packages like [`whisper-normalizer`](https://pypi.org/project/whisper-normalizer/). It does not affect numbers, and applies aggressive lowercasing and punctuation stripping. + +Gladia's recommended approach is [`gladia-normalization`](https://github.com/gladiaio/normalization), our open-source library designed for transcript evaluation: + +- `It's $50` -> `it is 50 dollars` +- `Meet at Point 14` -> `meet at point 14` +- `Mr. Smith joined at 3:00 PM` -> `mister smith joined at 3 pm` + + + Open-source transcript normalization library used before WER computation. + + +```python +from normalization import load_pipeline + +pipeline = load_pipeline("gladia-3", language="en") + +reference = "Meet at Point 14. It's $50 at 3:00 PM." +prediction = "meet at point fourteen it is fifty dollars at 3 pm" + +normalized_reference = pipeline.normalize(reference) +normalized_prediction = pipeline.normalize(prediction) +``` + + + Always apply the same normalization pipeline to **both** the reference transcript **and** every hypothesis output you compare. Changing the normalization rules between vendors — or forgetting to normalize one side — invalidates the benchmark. + + +## 4. Compute WER correctly + +Word Error Rate measures the edit distance between a reference transcript and a predicted transcript at the word level. + +The standard formula is: + +```text +WER = (S + D + I) / N +``` + +Where: + +- `S` = substitutions +- `D` = deletions +- `I` = insertions +- `N` = number of words in the reference transcript + +Lower is better. In practice: + +1. Prepare a reference transcript for each audio sample. +2. Run each provider on the exact same audio. +3. Normalize both the reference and each prediction with the same pipeline. +4. Compute WER on the normalized outputs. +5. Aggregate results across the full dataset. + + + Do not compute WER on raw transcripts if providers format numbers, punctuation, abbreviations, or casing differently. That mostly measures formatting conventions, not recognition quality. + + + + Inspect your reference transcripts carefully before computing WER. If a reference contains text that is not actually present in the audio, for example an intro such as "this audio is a recording of...", it can make WER look much worse across all providers. + + +## 5. Interpret results carefully + +Do not stop at a single WER number. Review: + +- overall average WER +- median WER and spread across files +- breakdowns by language, domain, or audio condition +- failure modes on proper nouns, acronyms, and numbers +- whether differences are consistent or concentrated in a few hard samples + +Two systems can post similar average WER while failing on different error classes. Separate statistically meaningful gaps from noise introduced by dataset composition or normalization choices. + +If two systems are close, inspect actual transcript examples before drawing strong conclusions. + +## Common pitfalls + +- Comparing providers on different datasets +- Using low-quality or inconsistent ground truth +- Treating punctuation and formatting differences as recognition errors +- Drawing conclusions from too few samples +- Reporting one average score without any slice analysis +- Not inspecting the reference transcript: if it contains text not present in the audio, for example an intro like "this audio is a recording of...", it will inflate WER across all providers +- Not experimenting with provider configurations: for example, using Gladia's [custom vocabulary](/chapters/audio-intelligence/custom-vocabulary) to improve proper noun accuracy, then comparing against the ground truth diff --git a/docs.json b/docs.json index ca37881..84e2c39 100644 --- a/docs.json +++ b/docs.json @@ -44,6 +44,7 @@ "chapters/pre-recorded-stt/features/sentences" ] }, + "chapters/pre-recorded-stt/benchmarking", { "group": "Live Transcription", "expanded": false, diff --git a/snippets/get-transcription-result.mdx b/snippets/get-transcription-result.mdx index 5cfc609..53f360f 100644 --- a/snippets/get-transcription-result.mdx +++ b/snippets/get-transcription-result.mdx @@ -12,7 +12,7 @@ You can get your transcription results in **3 different ways**: You can configure webhooks at https://app.gladia.io/webhooks to be notified when your transcriptions are done. - + Gladia dashboard webhook settings page for configuring transcription notifications. Once a transcription is done, a `POST` request will be made to the endpoint you configured. The request body is a JSON object containing the transcription `id` that you can use to retrieve your result with [our API](/api-reference/v2/pre-recorded/get). For the full body definition, check [our API definition](/api-reference/v2/pre-recorded/webhook/success). diff --git a/snippets/getting-started-playground.mdx b/snippets/getting-started-playground.mdx index 6483a30..01263e9 100644 --- a/snippets/getting-started-playground.mdx +++ b/snippets/getting-started-playground.mdx @@ -8,7 +8,7 @@ audio transcription. Choose your audio source (stream from you microphone, or upload a local file) - + Gladia playground step showing audio source selection options. Then proceed to the next step. @@ -25,7 +25,7 @@ audio transcription. - + Gladia playground feature selection screen with transcription options enabled. @@ -35,7 +35,7 @@ audio transcription. Text in italic in the transcription represents [partials transcripts](/chapters/live-stt/features#partial-transcripts). - + Gladia playground live transcription screen after starting capture. @@ -44,7 +44,7 @@ audio transcription. the result in JSON format (the one you'd get with an API call). - + Gladia playground transcription results view with formatted transcript and JSON output. diff --git a/snippets/setup-account.mdx b/snippets/setup-account.mdx index 438842d..62055fc 100644 --- a/snippets/setup-account.mdx +++ b/snippets/setup-account.mdx @@ -10,9 +10,9 @@ Now that you signed up, login to app.gladia.io and go to the [API keys section]( created a default key for you. You can use this one or create your own. - + Gladia dashboard API keys page showing the location of the default key. Gladia offers 10 Hours of free audio transcription per month if you want to test the service! -With your API key, you're now ready to use Gladia APIs. \ No newline at end of file +With your API key, you're now ready to use Gladia APIs. diff --git a/style.css b/style.css index f3d4c05..90c4cdb 100644 --- a/style.css +++ b/style.css @@ -28,4 +28,18 @@ .prose pre { max-width: 100%; overflow: auto; -} \ No newline at end of file +} + +.benchmark-status-pill { + display: inline-flex; + align-items: center; + padding: 0.2rem 0.5rem; + border-radius: 999px; + border: 1px solid rgba(46, 52, 62, 0.18); + background: rgba(46, 52, 62, 0.08); + color: #2e343e; + font-size: 0.85rem; + font-weight: 600; + line-height: 1.2; + white-space: nowrap; +}