Add dutch language normalization#19
Conversation
📝 WalkthroughWalkthroughAdds Dutch language support: new Changes
Sequence DiagramsequenceDiagram
participant Client as Client Code
participant Operators as DutchOperators
participant Normalizer as DutchNumberNormalizer
participant Regex as Regex Engine
Client->>Operators: expand_written_numbers(text)
Operators->>Normalizer: __call__(text)
Normalizer->>Normalizer: _normalize_currency_symbols / _normalize_mixed_numbers
Normalizer->>Normalizer: alpha2digit(text, "nl")
Normalizer->>Regex: _apply_currency_plural_fixes(result)
Regex-->>Normalizer: fixed_text
Normalizer-->>Operators: normalized_text
Operators-->>Client: result
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (7)
tests/unit/languages/dutch_operators_test.py (2)
45-51: Consolidate repeatedget_word_replacements()calls.Each assertion calls
operators.get_word_replacements()again. The function is cheap, but it's cleaner and faster to bind once:Proposed change
def test_word_replacements(operators): - assert operators.get_word_replacements()["ge"] == "je" - assert operators.get_word_replacements()["da"] == "dat" - assert operators.get_word_replacements()["u"] == "je" - assert operators.get_word_replacements()["uw"] == "je" - assert operators.get_word_replacements()["okee"] == "oke" - assert operators.get_word_replacements()["euro"] == "euros" + replacements = operators.get_word_replacements() + assert replacements["ge"] == "je" + assert replacements["da"] == "dat" + assert replacements["u"] == "je" + assert replacements["uw"] == "je" + assert replacements["okee"] == "oke" + assert replacements["euro"] == "euros"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` around lines 45 - 51, The test repeatedly calls operators.get_word_replacements(); change test_word_replacements to call operators.get_word_replacements() once, assign the result to a local variable (e.g., replacements) and use replacements["ge"], replacements["da"], replacements["u"], replacements["uw"], replacements["okee"], replacements["euro"] in the assertions to avoid repeated calls and improve clarity.
54-62: Consider adding a negative/edge test for the compound-number Dutch order.The existing tests cover single-word amounts (
tien euro) and numeric+symbol forms (€10,$3.50). Given the Dutchones + "en" + tenscomposition is a non-trivial branch inDutchNumberNormalizer.process_words(thepending_onesstate), please add at least one assertion exercising it end-to-end, e.g."vijfentwintig euro"or"een en twintig euro"→"25 euros". Without it, regressions in the compound-order logic won't be caught by this file.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_operators_test.py` around lines 54 - 62, Add a negative/edge unit test exercising the Dutch compound-number branch by asserting that operators.expand_written_numbers correctly normalizes compound words like "vijfentwintig euro" (or the spaced variant "een en twintig euro") to "25 euros"; this targets the DutchNumberNormalizer.process_words logic and its pending_ones handling, so add an assertion in tests/unit/languages/dutch_operators_test.py alongside test_expand_written_numbers_euro_after_amount_dutch_order that calls operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros" (and optionally another for "een en twintig euro").normalization/languages/dutch/operators.py (1)
108-111: Move theDUTCH_REPLACEMENTSimport to module top-level.There's no circular-import reason to defer this —
replacements.pyimports nothing from this module. Other language implementations import the replacements dict at module top. A local import here adds per-call overhead for no gain.Proposed change
from normalization.languages.dutch.number_normalizer import DutchNumberNormalizer +from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS from normalization.languages.dutch.sentence_replacements import ( DUTCH_SENTENCE_REPLACEMENTS, ) @@ def get_word_replacements(self) -> dict[str, str]: - from normalization.languages.dutch.replacements import DUTCH_REPLACEMENTS - return DUTCH_REPLACEMENTS🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/operators.py` around lines 108 - 111, The get_word_replacements method currently performs a local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top of this module and update get_word_replacements (the function name) to simply return the top-level DUTCH_REPLACEMENTS constant so other language modules follow the same pattern and avoid repeated local imports.normalization/languages/dutch/replacements.py (1)
26-27:euro → euroscanonical form is non-standard Dutch.The Dutch plural of
euroiseuro(official) or colloquiallyeuro's;eurosis the English form. Since the pipeline also maps the currency symbol€to"euros"inDUTCH_CONFIG.currency_symbol_to_word, this is internally consistent as a canonical evaluation token, but it's worth a one-line comment here documenting thateurosis an internal canonical form (not a Dutch word) so future contributors don't "fix" it.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/replacements.py` around lines 26 - 27, The mapping "euro": "euros" uses a non-standard Dutch canonical token; update replacements.py to add a one-line comment next to the "euro": "euros" entry clarifying that "euros" is an internal canonical form (not the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps '€' to "euros" for consistency, so future contributors don't "fix" it to a real Dutch word.normalization/languages/dutch/number_normalizer.py (3)
5-13: Move module docstring to the top of the file.The triple-quoted string at lines 5–13 appears after the imports, so Python does not treat it as the module docstring (
__doc__will beNone). Place it above theimportstatements for it to be picked up by tooling (help(), IDE hover, Sphinx).Proposed change
-import re -from fractions import Fraction -from typing import Iterator, Match - -""" -Dutch number normalizer: spelled-out numbers to digits. - -- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21). -- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ... -- Multipliers: honderd, duizend, miljoen, miljard, biljoen. -- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma). -- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros"). -""" +""" +Dutch number normalizer: spelled-out numbers to digits. + +- Dutch compound order: ones + "en" + tens (e.g. "een en twintig" -> 21). +- Vocabulary: nul, een, twee, ..., tien, elf, twaalf, ..., twintig, dertig, ... +- Multipliers: honderd, duizend, miljoen, miljard, biljoen. +- Handles currency (euro, dollar, pond, cent), percent (procent), and decimal (komma). +- Currency output follows Dutch word order: amount then unit (e.g. €10 and "tien euro" -> "10 euros"). +""" + +import re +from fractions import Fraction +from typing import Iterator, Match🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 5 - 13, The module docstring currently appears after the import statements so __doc__ is None; move the triple-quoted Dutch number normalizer string to the very top of the file before any import lines so it becomes the module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact text and spacing and leave the imports unchanged below it.
210-212:has_prefixcheck can misfire on 1-char numeric tokens like"0".
self.prefixescontains the string"0"(from"nul": "0"inpreceding_prefixers), so a token of literal"0"givescurrent[0] == "0" in self.prefixes→has_prefix = True, andcurrent_without_prefix = ""fails the regex. Control falls through to thecurrent_lower not in self.wordsbranch (since"0"isn't a Dutch word) and the token is yielded viaoutput(current)— so functionally it survives, but the logic is accidentally load-bearing.Two small cleanups worth considering:
- Exclude numeric digits from
self.prefixes, or- Short-circuit
has_prefixto only trigger on non-digit first characters:- has_prefix = current[0] in self.prefixes + has_prefix = not current[0].isdigit() and current[0] in self.prefixesNo user-visible bug today as far as I can tell; flagging for defensive clarity.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 210 - 212, The has_prefix check can mis-detect single-digit tokens (e.g., "0") because self.prefixes may contain digit strings; change the predicate so it only treats a leading prefix when the first character is not a digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in self.prefixes) or alternatively remove digit entries from self.prefixes; update the logic around has_prefix/current_without_prefix and the regex match in the function that processes tokens (refer to variables has_prefix, current, current_without_prefix, and self.prefixes) so single-digit numeric tokens are not stripped into empty strings before the numeric regex check.
79-85:tens_pluralkeys liketwintigenare not standard Dutch plurals.Building plural variants as
"twintig" + "en"→"twintigen","dertig" + "en"→"dertigen", etc. is not a real Dutch form (Dutch usestientallen,in de twintig, or baretwintigfor "twenties"). Same critique applies tomultipliers_plural(honderden,duizenden,miljoenen,miljarden,biljoenenare actually correct and common — those are fine).For
tens_pluralspecifically, the generated keys are unlikely to appear in real transcripts but will still match if an ASR produces them, potentially yielding surprising output. If these were copy-pasted from the English/French normalizer template, consider droppingtens_pluralentirely for Dutch, or verify at least one of these forms against corpus data before shipping.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 79 - 85, The tens_plural dict is generating nonstandard Dutch keys (e.g., "twintigen") and should be removed to avoid false matches: delete the creation of tens_plural and change tens_suffixed to not include it (keep only tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural as-is; update any references that expect tens_plural to instead rely on tens_ordinal or validated corpus-backed forms so tens_suffixed = {**self.tens_ordinal} (or merge only validated entries) and adjust downstream logic that used tens_plural accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/dutch/operators.py`:
- Around line 16-24: The current _RE_CLITIC_S unconditionally matches any
leading non-word then _APOST s and will incorrectly convert genitive/adverbial
constructions like "'s werelds" to "is werelds"; update the replacement regex
logic so we only convert reduced copula instances: replace _RE_CLITIC_S with a
pattern that requires a copula-like preceding token (e.g. use a positive
lookbehind such as (?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or
alternatively add exclusions by extending _TEMPORAL_S_AFTER to include known
genitive words like werelds|lands if you prefer the whitelist approach; adjust
the code that compiles _RE_CLITIC_S to use the chosen narrower pattern so only
true "'s" copula cases are replaced.
In `@normalization/languages/dutch/replacements.py`:
- Around line 21-24: The unconditional mappings "u": "je" and "uw": "je" in
normalization/languages/dutch/replacements.py aggressively conflate formal and
informal pronouns and should be either documented or made opt-in; update the
module/class docstring (e.g. DutchOperators) to clearly state these rules are
tuned for Flemish informal customer-service transcripts and their impact on WER,
or refactor the mappings into a guarded/parameterized option (e.g. a boolean
flag passed to the normalization routine that enables the "u"/"uw" → "je"
replacement) so callers can opt in when appropriate; ensure the change
references the exact mapping keys "u" and "uw" and the normalization entry point
(the Dutch normalization function or DutchOperators) so callers know how to
enable/disable it.
---
Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 5-13: The module docstring currently appears after the import
statements so __doc__ is None; move the triple-quoted Dutch number normalizer
string to the very top of the file before any import lines so it becomes the
module docstring available to help()/IDE/Sphinx, ensuring you preserve the exact
text and spacing and leave the imports unchanged below it.
- Around line 210-212: The has_prefix check can mis-detect single-digit tokens
(e.g., "0") because self.prefixes may contain digit strings; change the
predicate so it only treats a leading prefix when the first character is not a
digit (e.g., set has_prefix = (not current[0].isdigit()) and current[0] in
self.prefixes) or alternatively remove digit entries from self.prefixes; update
the logic around has_prefix/current_without_prefix and the regex match in the
function that processes tokens (refer to variables has_prefix, current,
current_without_prefix, and self.prefixes) so single-digit numeric tokens are
not stripped into empty strings before the numeric regex check.
- Around line 79-85: The tens_plural dict is generating nonstandard Dutch keys
(e.g., "twintigen") and should be removed to avoid false matches: delete the
creation of tens_plural and change tens_suffixed to not include it (keep only
tens_ordinal and any genuinely valid suffixed forms), leaving multipliers_plural
as-is; update any references that expect tens_plural to instead rely on
tens_ordinal or validated corpus-backed forms so tens_suffixed =
{**self.tens_ordinal} (or merge only validated entries) and adjust downstream
logic that used tens_plural accordingly.
In `@normalization/languages/dutch/operators.py`:
- Around line 108-111: The get_word_replacements method currently performs a
local import of DUTCH_REPLACEMENTS which adds per-call overhead; move the import
of DUTCH_REPLACEMENTS from normalization.languages.dutch.replacements to the top
of this module and update get_word_replacements (the function name) to simply
return the top-level DUTCH_REPLACEMENTS constant so other language modules
follow the same pattern and avoid repeated local imports.
In `@normalization/languages/dutch/replacements.py`:
- Around line 26-27: The mapping "euro": "euros" uses a non-standard Dutch
canonical token; update replacements.py to add a one-line comment next to the
"euro": "euros" entry clarifying that "euros" is an internal canonical form (not
the Dutch plural) and noting that DUTCH_CONFIG.currency_symbol_to_word also maps
'€' to "euros" for consistency, so future contributors don't "fix" it to a real
Dutch word.
In `@tests/unit/languages/dutch_operators_test.py`:
- Around line 45-51: The test repeatedly calls
operators.get_word_replacements(); change test_word_replacements to call
operators.get_word_replacements() once, assign the result to a local variable
(e.g., replacements) and use replacements["ge"], replacements["da"],
replacements["u"], replacements["uw"], replacements["okee"],
replacements["euro"] in the assertions to avoid repeated calls and improve
clarity.
- Around line 54-62: Add a negative/edge unit test exercising the Dutch
compound-number branch by asserting that operators.expand_written_numbers
correctly normalizes compound words like "vijfentwintig euro" (or the spaced
variant "een en twintig euro") to "25 euros"; this targets the
DutchNumberNormalizer.process_words logic and its pending_ones handling, so add
an assertion in tests/unit/languages/dutch_operators_test.py alongside
test_expand_written_numbers_euro_after_amount_dutch_order that calls
operators.expand_written_numbers("vijfentwintig euro") and expects "25 euros"
(and optionally another for "een en twintig euro").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: b667681a-4f7b-4a0d-b733-df033b4abe17
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3/nl.csvis excluded by!**/*.csv
📒 Files selected for processing (7)
normalization/languages/__init__.pynormalization/languages/dutch/__init__.pynormalization/languages/dutch/number_normalizer.pynormalization/languages/dutch/operators.pynormalization/languages/dutch/replacements.pynormalization/languages/dutch/sentence_replacements.pytests/unit/languages/dutch_operators_test.py
| _TEMPORAL_S_AFTER = ( | ||
| r"ochtends|morgens|middags|namiddags|avonds|nachts|" | ||
| r"zaterdags|zondags|weekend|weekends" | ||
| ) | ||
| _RE_TEMPORAL_S = re.compile( | ||
| rf"(?<!\w){_APOST}s(\s+)({_TEMPORAL_S_AFTER})\b", | ||
| re.IGNORECASE, | ||
| ) | ||
| _RE_CLITIC_S = re.compile(rf"(?<!\w){_APOST}s\b", re.IGNORECASE) |
There was a problem hiding this comment.
_RE_CLITIC_S may over-expand non-temporal 's to is.
After the temporal pass consumes 's ochtends/avonds/..., any remaining 's preceded by a non-word char (start of string, space, punctuation) is unconditionally replaced by is. That corrupts genuine Dutch constructions where standalone 's is a reduced genitive/adverbial article, e.g.:
's werelds grootste(= "of the world's largest") →is werelds grootste's lands belangen→is lands belangen
These are less frequent than the informal "dat 's goed" → "dat is goed" case you want to cover, but silently wrong when they do appear. Consider either (a) extending _TEMPORAL_S_AFTER to include werelds|lands|..., or (b) only replacing 's with is when the previous token ends with a vowel/word consistent with a reduced copula (e.g. require (?:dat|hoe|wat|er|hier|daar)\s+ before it).
Also applies to: 99-100
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@normalization/languages/dutch/operators.py` around lines 16 - 24, The current
_RE_CLITIC_S unconditionally matches any leading non-word then _APOST s and will
incorrectly convert genitive/adverbial constructions like "'s werelds" to "is
werelds"; update the replacement regex logic so we only convert reduced copula
instances: replace _RE_CLITIC_S with a pattern that requires a copula-like
preceding token (e.g. use a positive lookbehind such as
(?<=\b(?:dat|hoe|wat|er|hier|daar)\s){_APOST}s\b) or alternatively add
exclusions by extending _TEMPORAL_S_AFTER to include known genitive words like
werelds|lands if you prefer the whitelist approach; adjust the code that
compiles _RE_CLITIC_S to use the chosen narrower pattern so only true "'s"
copula cases are replaced.
There was a problem hiding this comment.
Test for steps must be in corresponding test files in /tests/unit/step/text
There was a problem hiding this comment.
Actionable comments posted: 3
🧹 Nitpick comments (2)
tests/unit/languages/dutch_number_normalizer_test.py (1)
17-55: Consider adding coverage for decimal separator and edge cases.The Dutch decimal separator is
","perDUTCH_CONFIG.decimal_separator, yet currency tests only exercise.decimals (e.g."$3.50"). Worth adding cases like"€10,50"→"10,50 euros","¢50"(the¢→centmapping where singular==trailing and the postprocess branch is skipped), and a multi-digit+multiplier case (once the12 miljoensemantics are decided) to prevent regressions in the config→normalizer coupling.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/languages/dutch_number_normalizer_test.py` around lines 17 - 55, Add unit tests covering Dutch decimal separator and edge cases: extend or add cases in test_currency_symbols_and_plural_trailing_words to include a comma-decimal currency like "€10,50" expecting "10,50 euros" (verify DUTCH_CONFIG.decimal_separator handling), add a cent symbol case such as "¢50" expecting the mapped "50 cent" behavior (covering the postprocess branch where singular==trailing is skipped), and add an extra multi-digit+multiplier case (e.g., similar to test_multi_digit_then_miljoen_not_fully_merged) to assert the decided semantics for "12 miljoen" (ensure DutchNumberNormalizer preserves the digit token plus multiplier as expected).normalization/languages/dutch/number_normalizer.py (1)
52-63: Plural→singular and euros-specific fix are hardcoded; decouple from config.
_singular_spoken_unithardcodes the exact plural forms currently inDUTCH_CONFIG.currency_symbol_to_word, and_currency_plural_fix_patternsspecial-cases"euros"to also matcheuro's. If the config is ever changed (e.g., adding a new currency, or using a different plural spelling), both preprocessing and postprocessing will silently misbehave while tests still pass against the fixed config. Consider either (a) co-locating the singular/plural mapping withDUTCH_CONFIG(e.g., a small{plural: singular}map on the config) and passing it in, or (b) deriving the singular heuristically (strip trailings/en) with a small explicit-overrides map for irregulars (ponden→pond). Theeuro'sapostrophe-s form is a Dutch plural variant and would then be handled generically by the singular derivation rather than a special euros branch.Also applies to: 98-106
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/dutch/number_normalizer.py` around lines 52 - 63, _singular_spoken_unit currently hardcodes plural→singular mappings (and _currency_plural_fix_patterns special-cases "euros") which will break if DUTCH_CONFIG changes; update the logic to derive singulars from DUTCH_CONFIG instead of hardcoding: either add a plural→singular map to DUTCH_CONFIG and use that in _singular_spoken_unit (passed in or imported), or implement a generic heuristic (strip trailing "s"/"en" and remove apostrophe-s) with a small explicit override map for irregulars (e.g., "ponden"→"pond") so both preprocessing and postprocessing use the same source of truth and handle forms like "euro's" without bespoke branches.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 127-137: The runtime ImportError check for "alpha2digit is None"
inside DutchNumberNormalizer.__init__ is dead because alpha2digit is imported at
module scope; remove the conditional and its raise so __init__ simply assigns
self._alpha2digit = alpha2digit and continues to set
self._currency_symbol_to_word and self._currency_plural_fixes; alternatively, if
you prefer lazy-friendly behavior, move the module import of alpha2digit into a
try/except ImportError at module scope (setting alpha2digit = None on failure)
and keep the existing check — but do not keep both patterns.
- Around line 34-49: The current _normalize_mixed_numbers leaves multi-digit
numerals like "12 miljoen" unchanged so later alpha2digit produces "12 1000000"
(wrong); change _normalize_mixed_numbers (matched via _RE_MIXED_NUMBER) to
detect when match.group(1) is a multi-digit numeric string and match.group(2) is
a large-number multiplier, parse the numeric part to an int, multiply by the
multiplier value (use the same mapping used for single-word multipliers, e.g.
the numeric value for "miljoen"/"miljard"), and replace the match with the
computed integer string (e.g. "12000000") instead of leaving it; keep the
existing single-digit branch that converts digits to words using
_DIGIT_TO_DUTCH, and update the failing test
(test_multi_digit_then_miljoen_not_fully_merged) to expect "12 miljoen" →
"12000000".
In `@tests/unit/languages/dutch_number_normalizer_test.py`:
- Around line 34-36: The test test_multi_digit_then_miljoen_not_fully_merged
currently asserts the old incorrect behavior ("12 miljoen" -> "12 1000000");
update the expectation to the corrected multiplication result ("12000000") and
rename the test to reflect correct multiplication semantics (e.g.,
test_multi_digit_times_miljoen_fully_merged) to match the fix made in
_normalize_mixed_numbers in normalization/languages/dutch/number_normalizer.py;
keep the test using the DutchNumberNormalizer fixture and ensure the assertion
uses the new expected string.
---
Nitpick comments:
In `@normalization/languages/dutch/number_normalizer.py`:
- Around line 52-63: _singular_spoken_unit currently hardcodes plural→singular
mappings (and _currency_plural_fix_patterns special-cases "euros") which will
break if DUTCH_CONFIG changes; update the logic to derive singulars from
DUTCH_CONFIG instead of hardcoding: either add a plural→singular map to
DUTCH_CONFIG and use that in _singular_spoken_unit (passed in or imported), or
implement a generic heuristic (strip trailing "s"/"en" and remove apostrophe-s)
with a small explicit override map for irregulars (e.g., "ponden"→"pond") so
both preprocessing and postprocessing use the same source of truth and handle
forms like "euro's" without bespoke branches.
In `@tests/unit/languages/dutch_number_normalizer_test.py`:
- Around line 17-55: Add unit tests covering Dutch decimal separator and edge
cases: extend or add cases in test_currency_symbols_and_plural_trailing_words to
include a comma-decimal currency like "€10,50" expecting "10,50 euros" (verify
DUTCH_CONFIG.decimal_separator handling), add a cent symbol case such as "¢50"
expecting the mapped "50 cent" behavior (covering the postprocess branch where
singular==trailing is skipped), and add an extra multi-digit+multiplier case
(e.g., similar to test_multi_digit_then_miljoen_not_fully_merged) to assert the
decided semantics for "12 miljoen" (ensure DutchNumberNormalizer preserves the
digit token plus multiplier as expected).
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: fc6ce718-a9de-47ff-a18c-7c7d32f4780f
📒 Files selected for processing (5)
normalization/languages/dutch/number_normalizer.pynormalization/languages/dutch/operators.pytests/unit/languages/dutch_number_normalizer_test.pytests/unit/languages/dutch_operators_test.pytests/unit/steps/text/conftest.py
✅ Files skipped from review due to trivial changes (2)
- tests/unit/steps/text/conftest.py
- tests/unit/languages/dutch_operators_test.py
🚧 Files skipped from review as they are similar to previous changes (1)
- normalization/languages/dutch/operators.py
What does this PR do?
This change adds Dutch (nl) normalization—operators, number parsing, Flemish clitic expansion, dialect/word and phrase replacements, plus unit and e2e tests—so transcripts normalize consistently for evaluation.
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
New language
normalization/languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py(not hardcoded inoperators.py)LanguageConfigis filled in with the language's data (separators, currency words, digit words, …)LanguageOperators— only override methods where the logic changes, not just the data@register_languageand imported innormalization/languages/__init__.pytests/unit/languages/tests/e2e/files/{preset}/{lang}.csv(e.g.tests/e2e/files/gladia-3/fr.csv)Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyNew step
nameclass attribute set (this is the key used in YAML presets)@register_stepand imported insteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*insteadsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyEdit existing step
nameis unchanged — if the output changes, create a new step name + new preset insteaduv run scripts/generate_step_docs.pyPreset change
pipeline.validate()passes (runs automatically vialoader.py)How was this tested?
Summary by CodeRabbit
New Features
Tests