feat: init swedish language basic normalization#23
Conversation
Made-with: Cursor
📝 WalkthroughWalkthroughThis PR adds Swedish language support to a text normalization library by introducing a complete Swedish language package with number normalization, operators, and word replacements, while also improving multi-character currency symbol handling in existing text-processing steps to avoid partial matches. ChangesSwedish Language Support
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (4)
normalization/languages/swedish/number_normalizer.py (2)
113-125: ⚖️ Poor tradeoffHardcoded singular mapping couples normalizer to specific currency words.
_singular_spoken_unitenumerateseuros/dollars/pounds/kronor/yensdirectly. Any new entry added tocurrency_symbol_to_word(e.g. a new locale) silently falls back to using the trailing word as both singular and plural, defeating the plural-fix patterns built later. Consider either deriving singular forms from a small data table colocated withLanguageConfigor, if the canonical-plural strategy is project-wide, lifting this map into a shared module so it can be reused by other language normalizers.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@normalization/languages/swedish/number_normalizer.py` around lines 113 - 125, _singular_spoken_unit currently hardcodes a few currency plurals so new entries in currency_symbol_to_word won't get correct singular forms; replace the ad-hoc mapping by deriving singulars from a small colocated data table (or lift the map into the shared config) and look up trailing_word there: update _singular_spoken_unit to consult the new singulars map (or shared module) instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the table is kept next to LanguageConfig or exported from the global canonical-plural utility so other language normalizers can reuse it.
234-280: 💤 Low valueInconsistent
en/ett + multipliercoverage.The fast-path branches at lines 234, 242, 250 and 266 handle
en/ett tusen,en/ett miljon(singular only), anden/ett miljard(er)/en/ett biljon(er)(singular and plural).miljondoes not get its plural formmiljonerlisted alongside it as the others do. Whileen miljoneris not idiomatic Swedish and rarely produced by STT, the asymmetry is easy to miss in maintenance. Consider unifying the multipliers in a single tuple-driven branch.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@normalization/languages/swedish/number_normalizer.py` around lines 234 - 280, The branch handling "en"/"ett" + multiplier is inconsistent: add the missing plural "miljoner" or unify all multiplier checks into a single tuple-driven branch to avoid asymmetry; locate the blocks that check fw in ("en", "ett") and _fold(words[i + 1]) == "miljon" (and the other blocks using "tusen"/"miljard"/"biljon") inside the _parse_number logic and either include "miljoner" alongside "miljon" or replace the repeated if-blocks with one that looks up _fold(words[i + 1]) in a mapping/tuple of multipliers (e.g., {"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000, "miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000, "biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and return the combined value as currently done.tests/unit/steps/text/remove_standalone_currency_symbols_test.py (1)
13-17: ⚡ Quick winAdd a standalone
krassertion to prevent no-op false positives.Lines 13–17 verify boundary safety, but adding a companion case for standalone
krremoval would ensure the core behavior still works (not just the “do not strip inside words” case).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py` around lines 13 - 17, Add a new assertion to the test to ensure standalone "kr" is handled correctly: update the test function test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an input like "25 kr" is transformed/removed according to expected behavior (e.g., becomes "25" or "25 " depending on trimming rules) so the suite covers both multi-char "kronor" retention and standalone "kr" handling.tests/unit/languages/swedish_operators_test.py (1)
4-4: ⚡ Quick winAdd a test that verifies
"sv"becomes available via package import flow.The direct import on line 4 triggers the
@register_languagedecorator at import time, making the registry tests pass. However, this bypasses verification that the package-level wiring innormalization/languages/__init__.pyactually exercises the registration. Consider adding a separate test that imports viafrom normalization.languages import swedish(or similar) to ensure the__init__.pyimport chain properly registers Swedish operators.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/unit/languages/swedish_operators_test.py` at line 4, Add a unit test that verifies package-level import triggers the `@register_language` registration for Swedish: instead of directly importing SwedishOperators, import the package module (e.g., "from normalization.languages import swedish" or "import normalization.languages; import normalization.languages.swedish") and then assert the registry contains the "sv" entry or that swedish.SwedishOperators is available; this ensures the __init__.py import chain exercises the register_language decorator rather than relying on a direct import of SwedishOperators.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 82-110: The regex _RE_MIXED_NUMBER currently matches multi-digit
numerals but replace() only handles single digits (checking len(number) == 1),
so strings like "20 miljoner" are left unchanged; update
_normalize_mixed_numbers (replace) to handle multi-digit numbers by parsing
number = int(match.group(1)) and, if multiplier = match.group(2) is in
_BIG_MULT, compute value = number * _BIG_MULT[multiplier] and return str(value)
(otherwise fall back to the existing single-digit logic that uses
_DIGIT_TO_SWEDISH), or alternatively make _RE_MIXED_NUMBER only match a single
digit if you prefer the original single-digit-only behavior.
In `@normalization/languages/swedish/operators.py`:
- Around line 36-43: The currency mapping in currency_symbol_to_word in
operators.py uses "¢": "cent" which breaks the pluralization convention; change
that entry to "¢": "cents" so it matches the plural canonical form used for
other currencies and the _currency_plural_fix_patterns logic, and also add a
branch in number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents":
return "cent") so the plural-to-singular fix recognizes "cents".
---
Nitpick comments:
In `@normalization/languages/swedish/number_normalizer.py`:
- Around line 113-125: _singular_spoken_unit currently hardcodes a few currency
plurals so new entries in currency_symbol_to_word won't get correct singular
forms; replace the ad-hoc mapping by deriving singulars from a small colocated
data table (or lift the map into the shared config) and look up trailing_word
there: update _singular_spoken_unit to consult the new singulars map (or shared
module) instead of hardcoding euros/dollars/pounds/kronor/yens, and ensure the
table is kept next to LanguageConfig or exported from the global
canonical-plural utility so other language normalizers can reuse it.
- Around line 234-280: The branch handling "en"/"ett" + multiplier is
inconsistent: add the missing plural "miljoner" or unify all multiplier checks
into a single tuple-driven branch to avoid asymmetry; locate the blocks that
check fw in ("en", "ett") and _fold(words[i + 1]) == "miljon" (and the other
blocks using "tusen"/"miljard"/"biljon") inside the _parse_number logic and
either include "miljoner" alongside "miljon" or replace the repeated if-blocks
with one that looks up _fold(words[i + 1]) in a mapping/tuple of multipliers
(e.g., {"tusen":1000, "miljon":1_000_000, "miljoner":1_000_000,
"miljard":1_000_000_000, "miljarder":1_000_000_000, "biljon":1_000_000_000_000,
"biljoner":1_000_000_000_000}), then call self._parse_number(words, i+2, n) and
return the combined value as currently done.
In `@tests/unit/languages/swedish_operators_test.py`:
- Line 4: Add a unit test that verifies package-level import triggers the
`@register_language` registration for Swedish: instead of directly importing
SwedishOperators, import the package module (e.g., "from normalization.languages
import swedish" or "import normalization.languages; import
normalization.languages.swedish") and then assert the registry contains the "sv"
entry or that swedish.SwedishOperators is available; this ensures the
__init__.py import chain exercises the register_language decorator rather than
relying on a direct import of SwedishOperators.
In `@tests/unit/steps/text/remove_standalone_currency_symbols_test.py`:
- Around line 13-17: Add a new assertion to the test to ensure standalone "kr"
is handled correctly: update the test function
test_multi_char_kr_not_stripped_from_kronor (or add a new sibling test) to call
RemoveStandaloneCurrencySymbolsStep() with SwedishOperators() and assert that an
input like "25 kr" is transformed/removed according to expected behavior (e.g.,
becomes "25" or "25 " depending on trimming rules) so the suite covers both
multi-char "kronor" retention and standalone "kr" handling.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: c1ea7130-8c85-44f0-a1df-7adf62e56f29
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3/sv.csvis excluded by!**/*.csv
📒 Files selected for processing (14)
README.mddocs/contributing-guide.mddocs/steps.mdnormalization/languages/__init__.pynormalization/languages/swedish/__init__.pynormalization/languages/swedish/number_normalizer.pynormalization/languages/swedish/operators.pynormalization/languages/swedish/replacements.pynormalization/steps/text/remove_standalone_currency_symbols.pynormalization/steps/text/replace_currency.pytests/unit/languages/swedish_number_normalizer_test.pytests/unit/languages/swedish_operators_test.pytests/unit/steps/text/remove_standalone_currency_symbols_test.pytests/unit/steps/text/replace_currency_test.py
| _RE_MIXED_NUMBER = re.compile( | ||
| r"\b(\d+)\s+(" | ||
| r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen" | ||
| r")\b", | ||
| re.IGNORECASE, | ||
| ) | ||
|
|
||
| _BIG_MULT: dict[str, int] = { | ||
| "tusen": 1000, | ||
| "miljon": 1_000_000, | ||
| "miljoner": 1_000_000, | ||
| "miljard": 1_000_000_000, | ||
| "miljarder": 1_000_000_000, | ||
| "biljon": 1_000_000_000_000, | ||
| "biljoner": 1_000_000_000_000, | ||
| } | ||
|
|
||
|
|
||
| def _normalize_mixed_numbers(text: str) -> str: | ||
| """Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9.""" | ||
|
|
||
| def replace(match: re.Match[str]) -> str: | ||
| number = match.group(1) | ||
| multiplier = match.group(2) | ||
| if len(number) == 1 and number in _DIGIT_TO_SWEDISH: | ||
| return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}" | ||
| return match.group(0) | ||
|
|
||
| return _RE_MIXED_NUMBER.sub(replace, text) |
There was a problem hiding this comment.
Multi-digit mixed numbers are silently dropped.
_RE_MIXED_NUMBER matches \d+ (any digit count), but replace() only rewrites single-digit cases (len(number) == 1). As a result, "20 miljoner" stays unchanged and never gets composed by the spelled-out parser, so it is not converted to 20000000. Either narrow the regex to a single digit (so the intent is explicit) or extend the rewrite to handle multi-digit values.
♻️ Option A: narrow regex (current behavior, made explicit)
_RE_MIXED_NUMBER = re.compile(
- r"\b(\d+)\s+("
+ r"\b(\d)\s+("
r"miljon|miljoner|miljard|miljarder|biljon|biljoner|tusen"
r")\b",
re.IGNORECASE,
)♻️ Option B: handle multi-digit numbers by multiplying directly
def _normalize_mixed_numbers(text: str) -> str:
- """Convert ``3 miljard`` → ``tre miljard`` so the word parser yields 3e9."""
+ """Convert ``3 miljard`` → ``tre miljard`` (single digit) or ``20 miljoner`` → digits."""
def replace(match: re.Match[str]) -> str:
number = match.group(1)
multiplier = match.group(2)
if len(number) == 1 and number in _DIGIT_TO_SWEDISH:
return f"{_DIGIT_TO_SWEDISH[number]} {multiplier}"
- return match.group(0)
+ mult = _BIG_MULT.get(multiplier.lower())
+ if mult is None:
+ return match.group(0)
+ return str(int(number) * mult)
return _RE_MIXED_NUMBER.sub(replace, text)🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@normalization/languages/swedish/number_normalizer.py` around lines 82 - 110,
The regex _RE_MIXED_NUMBER currently matches multi-digit numerals but replace()
only handles single digits (checking len(number) == 1), so strings like "20
miljoner" are left unchanged; update _normalize_mixed_numbers (replace) to
handle multi-digit numbers by parsing number = int(match.group(1)) and, if
multiplier = match.group(2) is in _BIG_MULT, compute value = number *
_BIG_MULT[multiplier] and return str(value) (otherwise fall back to the existing
single-digit logic that uses _DIGIT_TO_SWEDISH), or alternatively make
_RE_MIXED_NUMBER only match a single digit if you prefer the original
single-digit-only behavior.
| currency_symbol_to_word={ | ||
| "€": "euros", | ||
| "$": "dollars", | ||
| "£": "pounds", | ||
| "¢": "cent", | ||
| "¥": "yens", | ||
| "kr": "kronor", | ||
| }, |
There was a problem hiding this comment.
"¢": "cent" is inconsistent with the plural canonical used elsewhere.
All other entries map to the plural form (euros, dollars, pounds, yens, kronor), and the number normalizer's plural-fix logic relies on that convention: in _currency_plural_fix_patterns, the entry is skipped when singular.lower() == trailing.lower() (which is the case for cent). Net effect: 5 ¢ becomes 5 cent and 5 euro becomes 5 euros — different singular/plural canonicalization across currencies, which will hurt WER consistency. Likely intended to be "cents".
🩹 Suggested fix
currency_symbol_to_word={
"€": "euros",
"$": "dollars",
"£": "pounds",
- "¢": "cent",
+ "¢": "cents",
"¥": "yens",
"kr": "kronor",
},Note: if cents is added, also extend _singular_spoken_unit in number_normalizer.py with if t == "cents": return "cent" so the plural fix actually matches.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@normalization/languages/swedish/operators.py` around lines 36 - 43, The
currency mapping in currency_symbol_to_word in operators.py uses "¢": "cent"
which breaks the pluralization convention; change that entry to "¢": "cents" so
it matches the plural canonical form used for other currencies and the
_currency_plural_fix_patterns logic, and also add a branch in
number_normalizer.py's _singular_spoken_unit (e.g., if t == "cents": return
"cent") so the plural-to-singular fix recognizes "cents".
adds Swedish (sv) normalization (operators, replacements, number normalizer, registry wiring, unit and gladia-3 e2e tests
Type of change
Checklist
Only fill in the section(s) that match your change — delete the rest.
New language
normalization/languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py(not hardcoded inoperators.py)LanguageConfigis filled in with the language's data (separators, currency words, digit words, …)LanguageOperators— only override methods where the logic changes, not just the data@register_languageand imported innormalization/languages/__init__.pytests/unit/languages/tests/e2e/files/{preset}/{lang}.csv(e.g.tests/e2e/files/gladia-3/fr.csv)Edit existing language
replacements.py, not inline inoperators.pyNone: the step reading it still handlesNonegracefullyNew step
nameclass attribute set (this is the key used in YAML presets)@register_stepand imported insteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*insteadsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyEdit existing step
nameis unchanged — if the output changes, create a new step name + new preset insteaduv run scripts/generate_step_docs.pyPreset change
pipeline.validate()passes (runs automatically vialoader.py)How was this tested?
Summary by CodeRabbit
New Features
Bug Fixes
Documentation