feat: add german, italian, and spanish text normalization#15
feat: add german, italian, and spanish text normalization#15Karamouche merged 11 commits intomainfrom
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: ⛔ Files ignored due to path filters (2)
📒 Files selected for processing (1)
✅ Files skipped from review due to trivial changes (1)
📝 WalkthroughWalkthroughAdds German, Italian, and Spanish language packages (operators, replacements); Spanish includes a written-number normalizer. Updates language exports, refactors e2e tests to per-preset/per-language CSV discovery, and updates docs, PR template, and .gitignore. Changes
Sequence Diagram(s)sequenceDiagram
participant Disc as TestDiscovery
participant FS as Filesystem (tests/e2e/files/{preset}/)
participant CSV as CSV Loader
participant TF as Pytest Function Generator
participant NP as Normalizer Pipeline
Disc->>FS: list CSV files for preset
FS-->>Disc: filenames per-language (fr.csv, de.csv, es.csv)
Disc->>CSV: load each `{lang}.csv` (input,expected)
CSV-->>Disc: test cases for lang
Disc->>TF: create test_gladia_3_{lang} bound to lang
TF->>NP: run normalization pipeline with fixed language
NP-->>TF: normalized output
TF->>Disc: assert expected == normalized
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 6
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@normalization/languages/german/replacements.py`:
- Line 9: Remove the unconditional mapping '"bis": "-"' from replacements.py and
instead implement a contextual range rule: only convert "bis" to a range dash
when it appears between two tokens that are numeric, date, or unit types. Locate
the '"bis": "-"' entry and delete it, then update the range-handling logic (the
code that normalizes ranges / applies range rules — e.g., functions that check
neighbor tokens when normalizing tokens) to detect token types on both sides and
replace "bis" with "-" only when both neighbors are numeric/date/unit tokens.
In `@normalization/languages/italian/operators.py`:
- Around line 105-112: The regex at the end of fix_one_word_in_numeric_contexts
currently converts any sentence-start "uno" followed by a lowercase word (regex
^uno\s+(?=[a-z])) which incorrectly changes non-numeric phrases like "uno
studente"; change it to only match when the following token is a digit or a
known Italian number word. Replace the pattern ^uno\s+(?=[a-z]) with something
like ^uno\s+(?=(?:\d|due|tre|quattro|cinque|sei|sette|otto|nove|dieci)\b) (or
use a centralized list variable for number words) so only numeric contexts are
rewritten; keep the rest of the substitutions intact in
fix_one_word_in_numeric_contexts.
In `@normalization/languages/spanish/number_normalizer.py`:
- Around line 163-168: The current branch handling fw == "veinte" only looks one
word ahead and misses the pattern "veinte y X"; update the block in
number_normalizer.py (the fw == "veinte" branch using variables words, i, n,
_get and _ONES_1_9) to first check if i+2 < n and words[i+1] == "y" and then
call _get(_ONES_1_9, words[i+2]) returning (i+3, 20 + o) when o is not None;
otherwise fall back to the existing checks (single next-word ones digit or
return (i+1, 20)).
In `@normalization/languages/spanish/operators.py`:
- Around line 125-132: The final regex in fix_one_word_in_numeric_contexts is
too broad and rewrites ordinary phrases starting with "uno"; replace the pattern
r"^uno\s+(?=[a-z])" with a pattern that only matches when the following token is
a numeric digit or a Spanish number word (e.g. use a lookahead like
(?=\d|\b(uno|dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b)); update the
substitution accordingly so only true numeric sequences (digits or spelled-out
numbers) are converted.
- Around line 20-24: SPANISH_CONFIG currently sets decimal_word="punto" which
mismatches the decimal_separator=","; update the LanguageConfig instance named
SPANISH_CONFIG to use decimal_word="coma" so numeric tokens like 3,14 normalize
correctly; locate the SPANISH_CONFIG definition (the LanguageConfig(...) call)
and replace the decimal_word value from "punto" to "coma".
- Around line 82-87: The Spanish number_words list is missing accented variants
for the "veinti*" forms; update the number_words collection (variable name:
number_words in normalization/languages/spanish/operators.py) to include the
canonical accented strings "veintidós", "veintitrés", and "veintiséis" alongside
the existing unaccented entries (or replace the unaccented ones with accented
forms) so written-number expansion correctly matches common Spanish text.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a286e177-24b0-45b8-b932-41e4ce0565af
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (12)
normalization/languages/__init__.pynormalization/languages/german/__init__.pynormalization/languages/german/operators.pynormalization/languages/german/replacements.pynormalization/languages/german/sentence_replacements.pynormalization/languages/italian/__init__.pynormalization/languages/italian/operators.pynormalization/languages/italian/replacements.pynormalization/languages/spanish/__init__.pynormalization/languages/spanish/number_normalizer.pynormalization/languages/spanish/operators.pynormalization/languages/spanish/replacements.py
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/e2e/normalization_test.py (1)
35-35: Make pytest case IDs unambiguous within the same language group.At Line 35, truncating only
inputcan produce duplicate IDs, which makes failures harder to triage.Proposed tweak
-def _case_ids(cases: list[NormalizationTest]) -> list[str]: - return [test.input[:60] for test in cases] +def _case_ids(cases: list[NormalizationTest]) -> list[str]: + return [f"{i}:{test.input[:60]}" for i, test in enumerate(cases)]🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/e2e/normalization_test.py` at line 35, The current ID generator returns [test.input[:60] for test in cases] which can produce duplicates; change it to include an additional unique context such as the case index and/or language (and optionally shortened expected) so IDs are unambiguous. For example, build IDs from the comprehension using enumerate and test.language (and/or test.expected) like f"{i}-{test.language}-{test.input[:60]}" to replace the existing [test.input[:60] for test in cases].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/e2e/normalization_test.py`:
- Around line 80-85: If _GLADIA_3_TESTS is empty the dynamic test generation
loop (where _GLADIA_3_BY_LANGUAGE = _group_by_language(_GLADIA_3_TESTS) and
globals()[f"test_gladia_3_{_language}"] = _make_gladia_3_test(...)) will produce
no tests and silently pass CI; add a guard after computing _GLADIA_3_BY_LANGUAGE
that checks for emptiness (e.g., assert or raise an informative RuntimeError)
mentioning _GLADIA_3_TESTS/_GLADIA_3_BY_LANGUAGE so the test suite fails fast
when no e2e rows are loaded and developers notice missing fixtures.
---
Nitpick comments:
In `@tests/e2e/normalization_test.py`:
- Line 35: The current ID generator returns [test.input[:60] for test in cases]
which can produce duplicates; change it to include an additional unique context
such as the case index and/or language (and optionally shortened expected) so
IDs are unambiguous. For example, build IDs from the comprehension using
enumerate and test.language (and/or test.expected) like
f"{i}-{test.language}-{test.input[:60]}" to replace the existing
[test.input[:60] for test in cases].
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f8e7a938-b3d8-4715-8bca-8ea51ce18fbf
⛔ Files ignored due to path filters (1)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csv
📒 Files selected for processing (1)
tests/e2e/normalization_test.py
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@docs/contributing-guide.md`:
- Around line 188-198: Update the docs example to use the actual helper naming
convention from tests/e2e/normalization_test.py: replace the generic
`_make_test(...)` reference with the preset-specific factory name pattern (e.g.,
`_make_gladia_3_test(...)`) and adjust the generated test function name
accordingly so contributors can copy/paste the snippet and match the real
helpers like `_MY_PRESET_DIR`, `_MY_PRESET_BY_LANGUAGE`, and
`_MY_PRESET_PIPELINES`.
- Around line 166-174: Two fenced code blocks in the contributing-guide use
unlabeled backticks causing markdownlint MD040; update the first block (the
directory listing block that shows "tests/e2e/files/ gladia-3/ ... default.csv
...") to use a language label like "text" and update the CSV example block (the
block starting with "input,expected" and containing "$1,000,000,1000000
dollars") to use "csv" as the fence language; locate these unlabeled
triple-backtick blocks and replace their opening fences with ```text and ```csv
respectively so the examples are explicitly labeled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 7149f753-abd2-4949-8341-6d4c1a4eab19
⛔ Files ignored due to path filters (7)
tests/e2e/files/gladia-3.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/de.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/default.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/en.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/es.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/fr.csvis excluded by!**/*.csvtests/e2e/files/gladia-3/it.csvis excluded by!**/*.csv
📒 Files selected for processing (6)
.github/pull_request_template.md.gitignoreAGENTS.mdCONTRIBUTING.mddocs/contributing-guide.mdtests/e2e/normalization_test.py
✅ Files skipped from review due to trivial changes (4)
- .gitignore
- CONTRIBUTING.md
- AGENTS.md
- .github/pull_request_template.md
🚧 Files skipped from review as they are similar to previous changes (1)
- tests/e2e/normalization_test.py
…language configuration
There was a problem hiding this comment.
♻️ Duplicate comments (2)
normalization/languages/spanish/operators.py (2)
132-132:⚠️ Potential issue | 🟠 MajorKeep the start-of-string
unorewrite constrained to numeric followers.Line 132 will also turn plain phrases like
uno de ellosinto1 de ellos. Restrict that lookahead to digits or Spanish number words only.💡 Suggested fix
+_SPANISH_NUMBER_WORDS_PATTERN = "|".join( + sorted((re.escape(word) for word in SPANISH_CONFIG.number_words), key=len, reverse=True) +) + `@register_language` class SpanishOperators(LanguageOperators): @@ - text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text) + text = re.sub( + rf"^uno\s+(?=\d|\b(?:{_SPANISH_NUMBER_WORDS_PATTERN})\b)", + "1 ", + text, + ) return text🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/spanish/operators.py` at line 132, The current re.sub call replaces any leading "uno" before letters (text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text)), which incorrectly converts phrases like "uno de ellos"; change the lookahead to only match digits or Spanish number words so only numeric contexts are rewritten (e.g., use a lookahead like (?=(\d|\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b))). Update the regex in the same re.sub invocation operating on variable text so "^uno\\s+(?=(\\d|\\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\\b))" maps to "1 " instead of the broader pattern.
82-87:⚠️ Potential issue | 🟠 MajorAdd the canonical accented
veinti*forms.Lines 82-87 only include the unaccented spellings. Any
SPANISH_CONFIG.number_wordslookup will miss common inputs likeveintidós,veintitrés, andveintiséis.💡 Suggested fix
"veinte", "veintiuno", + "veintidós", "veintidos", + "veintitrés", "veintitres", "veinticuatro", "veinticinco", + "veintiséis", "veintiseis", "veintisiete",🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@normalization/languages/spanish/operators.py` around lines 82 - 87, The Spanish number-words list in normalization/languages/spanish/operators.py currently only has unaccented "veintidos", "veintitres", and "veintiseis" which will miss common inputs; update the entries used by SPANISH_CONFIG.number_words (the list containing "veintiuno", "veintidos", "veintitres", "veinticuatro", "veinticinco", "veintiseis") to include the canonical accented forms "veintidós", "veintitrés", and "veintiséis" (you can keep the unaccented variants too if you want to accept both); ensure the updated list preserves ordering and any existing references to that list remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@normalization/languages/spanish/operators.py`:
- Line 132: The current re.sub call replaces any leading "uno" before letters
(text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text)), which incorrectly converts
phrases like "uno de ellos"; change the lookahead to only match digits or
Spanish number words so only numeric contexts are rewritten (e.g., use a
lookahead like (?=(\d|\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b))).
Update the regex in the same re.sub invocation operating on variable text so
"^uno\\s+(?=(\\d|\\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\\b))"
maps to "1 " instead of the broader pattern.
- Around line 82-87: The Spanish number-words list in
normalization/languages/spanish/operators.py currently only has unaccented
"veintidos", "veintitres", and "veintiseis" which will miss common inputs;
update the entries used by SPANISH_CONFIG.number_words (the list containing
"veintiuno", "veintidos", "veintitres", "veinticuatro", "veinticinco",
"veintiseis") to include the canonical accented forms "veintidós", "veintitrés",
and "veintiséis" (you can keep the unaccented variants too if you want to accept
both); ensure the updated list preserves ordering and any existing references to
that list remain unchanged.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: a362269d-937a-4b88-aa74-61711a555058
📒 Files selected for processing (1)
normalization/languages/spanish/operators.py
What does this PR do?
Extends the text normalization pipeline with support for three new languages:
German — customizable abbreviation and compound word mappings
Italian — abbreviation expansion with numeric context handling
Spanish — comprehensive abbreviation mappings and written number conversion
Type of change
languages/{lang}/)steps/text/orsteps/word/)presets/)Checklist
New language
languages/{lang}/withoperators.py,replacements.py,__init__.pyreplacements.py, not inline inoperators.py@register_languagelanguages/__init__.pytests/unit/languages/tests/e2e/files/New step
nameclass attribute is unique and matches the YAML key@register_stepsteps/text/__init__.pyorsteps/word/__init__.pyoperators.config.*, no hardcoded language-specific valuesif operators.config.field is None: return textsteps/text/placeholders.pyandpipeline/base.py'svalidate()is updatedtests/unit/steps/uv run scripts/generate_step_docs.pyto regeneratedocs/steps.mdPreset change
Tests
Summary by CodeRabbit
New Features
Tests
Documentation
Chores