Skip to content

feat: add german, italian, and spanish text normalization#15

Merged
Karamouche merged 11 commits intomainfrom
feat/add-metrics-api-languages
Apr 14, 2026
Merged

feat: add german, italian, and spanish text normalization#15
Karamouche merged 11 commits intomainfrom
feat/add-metrics-api-languages

Conversation

@egenthon-cmd
Copy link
Copy Markdown
Contributor

@egenthon-cmd egenthon-cmd commented Apr 10, 2026

What does this PR do?

Extends the text normalization pipeline with support for three new languages:

German — customizable abbreviation and compound word mappings
Italian — abbreviation expansion with numeric context handling
Spanish — comprehensive abbreviation mappings and written number conversion

Type of change

  • New language (languages/{lang}/)
  • New step (steps/text/ or steps/word/)
  • New preset version (presets/)
  • Bug fix
  • Refactor / internal cleanup
  • Docs / CI

Checklist

New language

  • Created languages/{lang}/ with operators.py, replacements.py, __init__.py
  • All word-level substitutions are in replacements.py, not inline in operators.py
  • Decorated operators class with @register_language
  • Added one import line to languages/__init__.py
  • Added unit tests in tests/unit/languages/
  • Added e2e test rows in tests/e2e/files/

New step

  • name class attribute is unique and matches the YAML key
  • Decorated with @register_step
  • Added one import line to steps/text/__init__.py or steps/word/__init__.py
  • Algorithm reads data from operators.config.*, no hardcoded language-specific values
  • Optional config fields are guarded with if operators.config.field is None: return text
  • Placeholder protect/restore pairs are both in steps/text/placeholders.py and pipeline/base.py's validate() is updated
  • Added unit tests in tests/unit/steps/
  • Added step name to relevant preset YAMLs (new preset file if existing presets are affected)
  • If the class docstring was added or changed, ran uv run scripts/generate_step_docs.py to regenerate docs/steps.md

Preset change

  • Existing preset files are not modified — new behavior uses a new preset version file

Tests

Summary by CodeRabbit

  • New Features

    • Added German, Italian, and Spanish language support with locale-aware normalization, abbreviation/replacement maps, and written-number expansion.
  • Tests

    • Refactored e2e normalization tests to use per-preset/per-language CSV fixtures and generate per-language test functions.
  • Documentation

    • Updated contributing and PR guidance to reflect the new per-preset/per-language CSV fixture layout and testing workflow.
  • Chores

    • Added .DS_Store to ignore patterns.

@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Apr 10, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 344bc10e-f37a-442a-a745-9da2be950795

📥 Commits

Reviewing files that changed from the base of the PR and between f33ddb3 and 8568b7f.

⛔ Files ignored due to path filters (2)
  • tests/e2e/files/gladia-3/de.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/es.csv is excluded by !**/*.csv
📒 Files selected for processing (1)
  • normalization/languages/german/replacements.py
✅ Files skipped from review due to trivial changes (1)
  • normalization/languages/german/replacements.py

📝 Walkthrough

Walkthrough

Adds German, Italian, and Spanish language packages (operators, replacements); Spanish includes a written-number normalizer. Updates language exports, refactors e2e tests to per-preset/per-language CSV discovery, and updates docs, PR template, and .gitignore.

Changes

Cohort / File(s) Summary
Languages package
normalization/languages/__init__.py
Expose german, italian, spanish in __all__ and import new language packages.
German
normalization/languages/german/__init__.py, normalization/languages/german/operators.py, normalization/languages/german/replacements.py, normalization/languages/german/sentence_replacements.py
Add German LanguageConfig (GERMAN_CONFIG) and registered GermanOperators, plus token- and sentence-level replacement maps and locale-specific numeric/symbol mappings.
Italian
normalization/languages/italian/__init__.py, normalization/languages/italian/operators.py, normalization/languages/italian/replacements.py
Add Italian LanguageConfig (ITALIAN_CONFIG) and registered ItalianOperators with numeric-context fixups (fix_one_word_in_numeric_contexts) and ITALIAN_REPLACEMENTS.
Spanish
normalization/languages/spanish/__init__.py, normalization/languages/spanish/operators.py, normalization/languages/spanish/replacements.py, normalization/languages/spanish/number_normalizer.py
Add Spanish LanguageConfig (SPANISH_CONFIG) and registered SpanishOperators integrating SpanishNumberNormalizer for written-number expansion, numeric-context fixes, and SPANISH_REPLACEMENTS.
End-to-end tests
tests/e2e/normalization_test.py, tests/e2e/files/gladia-3/*
Refactor e2e discovery to per-preset directory with per-language two-column CSVs (input,expected); remove per-row language column; generate one pytest function per discovered language.
Docs & templates
AGENTS.md, CONTRIBUTING.md, docs/contributing-guide.md, .github/pull_request_template.md
Update contribution/checklist and e2e fixture guidance to require per-preset, per-language CSVs and document the new CSV layout.
Misc
.gitignore
Add .DS_Store to ignore patterns.

Sequence Diagram(s)

sequenceDiagram
    participant Disc as TestDiscovery
    participant FS as Filesystem (tests/e2e/files/{preset}/)
    participant CSV as CSV Loader
    participant TF as Pytest Function Generator
    participant NP as Normalizer Pipeline

    Disc->>FS: list CSV files for preset
    FS-->>Disc: filenames per-language (fr.csv, de.csv, es.csv)
    Disc->>CSV: load each `{lang}.csv` (input,expected)
    CSV-->>Disc: test cases for lang
    Disc->>TF: create test_gladia_3_{lang} bound to lang
    TF->>NP: run normalization pipeline with fixed language
    NP-->>TF: normalized output
    TF->>Disc: assert expected == normalized
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

Poem

🐇 I hopped through code with nimble paws,

Neue Worte, números, y accénts I saw.
Maps and rules stitched soft and neat,
CSVs per tongue — tests now meet.
A rabbit cheers: languages complete!

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 11.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and directly summarizes the main change: adding support for three new languages (German, Italian, Spanish) to the text normalization pipeline.
Description check ✅ Passed The PR description is complete and follows the required template structure with all critical sections filled: summary of changes, correct type-of-change checkbox selected, and all applicable new-language checklist items marked complete.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch feat/add-metrics-api-languages

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@normalization/languages/german/replacements.py`:
- Line 9: Remove the unconditional mapping '"bis": "-"' from replacements.py and
instead implement a contextual range rule: only convert "bis" to a range dash
when it appears between two tokens that are numeric, date, or unit types. Locate
the '"bis": "-"' entry and delete it, then update the range-handling logic (the
code that normalizes ranges / applies range rules — e.g., functions that check
neighbor tokens when normalizing tokens) to detect token types on both sides and
replace "bis" with "-" only when both neighbors are numeric/date/unit tokens.

In `@normalization/languages/italian/operators.py`:
- Around line 105-112: The regex at the end of fix_one_word_in_numeric_contexts
currently converts any sentence-start "uno" followed by a lowercase word (regex
^uno\s+(?=[a-z])) which incorrectly changes non-numeric phrases like "uno
studente"; change it to only match when the following token is a digit or a
known Italian number word. Replace the pattern ^uno\s+(?=[a-z]) with something
like ^uno\s+(?=(?:\d|due|tre|quattro|cinque|sei|sette|otto|nove|dieci)\b) (or
use a centralized list variable for number words) so only numeric contexts are
rewritten; keep the rest of the substitutions intact in
fix_one_word_in_numeric_contexts.

In `@normalization/languages/spanish/number_normalizer.py`:
- Around line 163-168: The current branch handling fw == "veinte" only looks one
word ahead and misses the pattern "veinte y X"; update the block in
number_normalizer.py (the fw == "veinte" branch using variables words, i, n,
_get and _ONES_1_9) to first check if i+2 < n and words[i+1] == "y" and then
call _get(_ONES_1_9, words[i+2]) returning (i+3, 20 + o) when o is not None;
otherwise fall back to the existing checks (single next-word ones digit or
return (i+1, 20)).

In `@normalization/languages/spanish/operators.py`:
- Around line 125-132: The final regex in fix_one_word_in_numeric_contexts is
too broad and rewrites ordinary phrases starting with "uno"; replace the pattern
r"^uno\s+(?=[a-z])" with a pattern that only matches when the following token is
a numeric digit or a Spanish number word (e.g. use a lookahead like
(?=\d|\b(uno|dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b)); update the
substitution accordingly so only true numeric sequences (digits or spelled-out
numbers) are converted.
- Around line 20-24: SPANISH_CONFIG currently sets decimal_word="punto" which
mismatches the decimal_separator=","; update the LanguageConfig instance named
SPANISH_CONFIG to use decimal_word="coma" so numeric tokens like 3,14 normalize
correctly; locate the SPANISH_CONFIG definition (the LanguageConfig(...) call)
and replace the decimal_word value from "punto" to "coma".
- Around line 82-87: The Spanish number_words list is missing accented variants
for the "veinti*" forms; update the number_words collection (variable name:
number_words in normalization/languages/spanish/operators.py) to include the
canonical accented strings "veintidós", "veintitrés", and "veintiséis" alongside
the existing unaccented entries (or replace the unaccented ones with accented
forms) so written-number expansion correctly matches common Spanish text.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a286e177-24b0-45b8-b932-41e4ce0565af

📥 Commits

Reviewing files that changed from the base of the PR and between 8974532 and 8318ca2.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (12)
  • normalization/languages/__init__.py
  • normalization/languages/german/__init__.py
  • normalization/languages/german/operators.py
  • normalization/languages/german/replacements.py
  • normalization/languages/german/sentence_replacements.py
  • normalization/languages/italian/__init__.py
  • normalization/languages/italian/operators.py
  • normalization/languages/italian/replacements.py
  • normalization/languages/spanish/__init__.py
  • normalization/languages/spanish/number_normalizer.py
  • normalization/languages/spanish/operators.py
  • normalization/languages/spanish/replacements.py

@egenthon-cmd egenthon-cmd changed the title Feat/add metrics api languages feat: add spanish italian and german language Apr 10, 2026
@egenthon-cmd egenthon-cmd changed the title feat: add spanish italian and german language feat: add german, italian, and spanish text normalization Apr 10, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
tests/e2e/normalization_test.py (1)

35-35: Make pytest case IDs unambiguous within the same language group.

At Line 35, truncating only input can produce duplicate IDs, which makes failures harder to triage.

Proposed tweak
-def _case_ids(cases: list[NormalizationTest]) -> list[str]:
-    return [test.input[:60] for test in cases]
+def _case_ids(cases: list[NormalizationTest]) -> list[str]:
+    return [f"{i}:{test.input[:60]}" for i, test in enumerate(cases)]
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/e2e/normalization_test.py` at line 35, The current ID generator returns
[test.input[:60] for test in cases] which can produce duplicates; change it to
include an additional unique context such as the case index and/or language (and
optionally shortened expected) so IDs are unambiguous. For example, build IDs
from the comprehension using enumerate and test.language (and/or test.expected)
like f"{i}-{test.language}-{test.input[:60]}" to replace the existing
[test.input[:60] for test in cases].
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/e2e/normalization_test.py`:
- Around line 80-85: If _GLADIA_3_TESTS is empty the dynamic test generation
loop (where _GLADIA_3_BY_LANGUAGE = _group_by_language(_GLADIA_3_TESTS) and
globals()[f"test_gladia_3_{_language}"] = _make_gladia_3_test(...)) will produce
no tests and silently pass CI; add a guard after computing _GLADIA_3_BY_LANGUAGE
that checks for emptiness (e.g., assert or raise an informative RuntimeError)
mentioning _GLADIA_3_TESTS/_GLADIA_3_BY_LANGUAGE so the test suite fails fast
when no e2e rows are loaded and developers notice missing fixtures.

---

Nitpick comments:
In `@tests/e2e/normalization_test.py`:
- Line 35: The current ID generator returns [test.input[:60] for test in cases]
which can produce duplicates; change it to include an additional unique context
such as the case index and/or language (and optionally shortened expected) so
IDs are unambiguous. For example, build IDs from the comprehension using
enumerate and test.language (and/or test.expected) like
f"{i}-{test.language}-{test.input[:60]}" to replace the existing
[test.input[:60] for test in cases].
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: f8e7a938-b3d8-4715-8bca-8ea51ce18fbf

📥 Commits

Reviewing files that changed from the base of the PR and between 8318ca2 and a6029dc.

⛔ Files ignored due to path filters (1)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
📒 Files selected for processing (1)
  • tests/e2e/normalization_test.py

@gladiaio gladiaio deleted a comment from coderabbitai bot Apr 13, 2026
Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@docs/contributing-guide.md`:
- Around line 188-198: Update the docs example to use the actual helper naming
convention from tests/e2e/normalization_test.py: replace the generic
`_make_test(...)` reference with the preset-specific factory name pattern (e.g.,
`_make_gladia_3_test(...)`) and adjust the generated test function name
accordingly so contributors can copy/paste the snippet and match the real
helpers like `_MY_PRESET_DIR`, `_MY_PRESET_BY_LANGUAGE`, and
`_MY_PRESET_PIPELINES`.
- Around line 166-174: Two fenced code blocks in the contributing-guide use
unlabeled backticks causing markdownlint MD040; update the first block (the
directory listing block that shows "tests/e2e/files/  gladia-3/ ... default.csv
...") to use a language label like "text" and update the CSV example block (the
block starting with "input,expected" and containing "$1,000,000,1000000
dollars") to use "csv" as the fence language; locate these unlabeled
triple-backtick blocks and replace their opening fences with ```text and ```csv
respectively so the examples are explicitly labeled.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 7149f753-abd2-4949-8341-6d4c1a4eab19

📥 Commits

Reviewing files that changed from the base of the PR and between a6029dc and 6934992.

⛔ Files ignored due to path filters (7)
  • tests/e2e/files/gladia-3.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/de.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/default.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/en.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/es.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/fr.csv is excluded by !**/*.csv
  • tests/e2e/files/gladia-3/it.csv is excluded by !**/*.csv
📒 Files selected for processing (6)
  • .github/pull_request_template.md
  • .gitignore
  • AGENTS.md
  • CONTRIBUTING.md
  • docs/contributing-guide.md
  • tests/e2e/normalization_test.py
✅ Files skipped from review due to trivial changes (4)
  • .gitignore
  • CONTRIBUTING.md
  • AGENTS.md
  • .github/pull_request_template.md
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/e2e/normalization_test.py

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
normalization/languages/spanish/operators.py (2)

132-132: ⚠️ Potential issue | 🟠 Major

Keep the start-of-string uno rewrite constrained to numeric followers.

Line 132 will also turn plain phrases like uno de ellos into 1 de ellos. Restrict that lookahead to digits or Spanish number words only.

💡 Suggested fix
+_SPANISH_NUMBER_WORDS_PATTERN = "|".join(
+    sorted((re.escape(word) for word in SPANISH_CONFIG.number_words), key=len, reverse=True)
+)
+
 `@register_language`
 class SpanishOperators(LanguageOperators):
@@
-        text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text)
+        text = re.sub(
+            rf"^uno\s+(?=\d|\b(?:{_SPANISH_NUMBER_WORDS_PATTERN})\b)",
+            "1 ",
+            text,
+        )
         return text
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/spanish/operators.py` at line 132, The current re.sub
call replaces any leading "uno" before letters (text =
re.sub(r"^uno\s+(?=[a-z])", "1 ", text)), which incorrectly converts phrases
like "uno de ellos"; change the lookahead to only match digits or Spanish number
words so only numeric contexts are rewritten (e.g., use a lookahead like
(?=(\d|\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b))). Update the
regex in the same re.sub invocation operating on variable text so
"^uno\\s+(?=(\\d|\\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\\b))"
maps to "1 " instead of the broader pattern.

82-87: ⚠️ Potential issue | 🟠 Major

Add the canonical accented veinti* forms.

Lines 82-87 only include the unaccented spellings. Any SPANISH_CONFIG.number_words lookup will miss common inputs like veintidós, veintitrés, and veintiséis.

💡 Suggested fix
         "veinte",
         "veintiuno",
+        "veintidós",
         "veintidos",
+        "veintitrés",
         "veintitres",
         "veinticuatro",
         "veinticinco",
+        "veintiséis",
         "veintiseis",
         "veintisiete",
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@normalization/languages/spanish/operators.py` around lines 82 - 87, The
Spanish number-words list in normalization/languages/spanish/operators.py
currently only has unaccented "veintidos", "veintitres", and "veintiseis" which
will miss common inputs; update the entries used by SPANISH_CONFIG.number_words
(the list containing "veintiuno", "veintidos", "veintitres", "veinticuatro",
"veinticinco", "veintiseis") to include the canonical accented forms
"veintidós", "veintitrés", and "veintiséis" (you can keep the unaccented
variants too if you want to accept both); ensure the updated list preserves
ordering and any existing references to that list remain unchanged.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@normalization/languages/spanish/operators.py`:
- Line 132: The current re.sub call replaces any leading "uno" before letters
(text = re.sub(r"^uno\s+(?=[a-z])", "1 ", text)), which incorrectly converts
phrases like "uno de ellos"; change the lookahead to only match digits or
Spanish number words so only numeric contexts are rewritten (e.g., use a
lookahead like (?=(\d|\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\b))).
Update the regex in the same re.sub invocation operating on variable text so
"^uno\\s+(?=(\\d|\\b(dos|tres|cuatro|cinco|seis|siete|ocho|nueve|diez)\\b))"
maps to "1 " instead of the broader pattern.
- Around line 82-87: The Spanish number-words list in
normalization/languages/spanish/operators.py currently only has unaccented
"veintidos", "veintitres", and "veintiseis" which will miss common inputs;
update the entries used by SPANISH_CONFIG.number_words (the list containing
"veintiuno", "veintidos", "veintitres", "veinticuatro", "veinticinco",
"veintiseis") to include the canonical accented forms "veintidós", "veintitrés",
and "veintiséis" (you can keep the unaccented variants too if you want to accept
both); ensure the updated list preserves ordering and any existing references to
that list remain unchanged.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: a362269d-937a-4b88-aa74-61711a555058

📥 Commits

Reviewing files that changed from the base of the PR and between 6934992 and f33ddb3.

📒 Files selected for processing (1)
  • normalization/languages/spanish/operators.py

@Karamouche Karamouche merged commit 667595f into main Apr 14, 2026
10 checks passed
@Karamouche Karamouche deleted the feat/add-metrics-api-languages branch April 14, 2026 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants