Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,10 +118,11 @@ Pipelines are defined declaratively in **YAML presets**. Each preset lists the s
| `it` | Italian |
| `es` | Spanish |
| `nl` | Dutch |
| `sv` | Swedish |

Unsupported language codes fall back to a safe default that applies language-independent normalization only.

Adding a new language is self-contained — create a folder, register it with a decorator, done. See [Contributing](#adding-a-new-language).
Adding a new language is self-contained — create a folder, register it with a decorator, done. See [Contributing](CONTRIBUTING.md#add-support-for-a-new-language).

## Custom presets

Expand Down
3 changes: 3 additions & 0 deletions docs/contributing-guide.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,8 +169,11 @@ tests/e2e/files/
default.csv
de.csv
en.csv
es.csv
fr.csv
it.csv
nl.csv
sv.csv
```

**CSV format** — two columns (`input,expected`), no quoting needed unless the value contains a comma:
Expand Down
13 changes: 12 additions & 1 deletion docs/steps.md
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,9 @@ operators.config.pm_word, operators.config.oclock_word, and
operators.get_compound_minutes().
No-op when required config is None.

Regex patterns are compiled once per operators config instance and cached
on the step to avoid recompilation on every call.

### `expand_alphanumeric_codes`

**Base class:** `TextStep`
Expand Down Expand Up @@ -329,6 +332,10 @@ Handles ¤ markers by processing segments separately.

Remove currency symbols that are not adjacent to numbers.

Single-character symbols use the between/start/end patterns. Each
multi-character key (e.g. ``kr``) is stripped only when it appears as its own
token (``\b...\b``), so it is not confused with a substring inside a word.

### `remove_symbols`

**Base class:** `TextStep`
Expand Down Expand Up @@ -376,7 +383,11 @@ No-op when either is None.

**Base class:** `TextStep`

Replace currency symbols with their corresponding words.
Replace currency symbols with their corresponding words next to amounts.

Reads ``operators.config.currency_symbol_to_word``. Multi-character symbols
(e.g. ``kr``) are matched with word boundaries so amounts already written as
``… kronor`` are not parsed as ``… kr`` + ``onor``.

### `restore_decimal_separator_with_word`

Expand Down
3 changes: 2 additions & 1 deletion normalization/languages/__init__.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
from . import dutch, english, finnish, french, german, italian, spanish
from . import dutch, english, finnish, french, german, italian, spanish, swedish
from .base import LanguageOperators
from .registry import get_language_registry, register_language

Expand All @@ -12,5 +12,6 @@
"german",
"italian",
"spanish",
"swedish",
"get_language_registry",
]
7 changes: 7 additions & 0 deletions normalization/languages/swedish/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
from .operators import SwedishOperators
from .replacements import SWEDISH_REPLACEMENTS

__all__ = [
"SwedishOperators",
"SWEDISH_REPLACEMENTS",
]
Loading