Skip to content

feat(analyzer): add optional country filter to load_predefined_recogn…#2000

Open
ynachiket wants to merge 6 commits intomicrosoft:mainfrom
ynachiket:feat/filter-recognizers-by-country-1328
Open

feat(analyzer): add optional country filter to load_predefined_recogn…#2000
ynachiket wants to merge 6 commits intomicrosoft:mainfrom
ynachiket:feat/filter-recognizers-by-country-1328

Conversation

@ynachiket
Copy link
Copy Markdown
Contributor

Change Description

Adds an optional countries parameter to RecognizerRegistry.load_predefined_recognizers() that lets callers restrict which country-specific predefined recognizers are loaded, alongside the existing languages filter. Locale-agnostic recognizers (credit cards, emails, URLs, crypto, IBAN, etc.) are always preserved regardless of the filter.

Motivation

Today, load_predefined_recognizers() either loads every country-specific recognizer or none. A US-only or EU-only deployment has to either accept the noise of every country's recognizer, or manually enumerate every recognizer class to reconstruct the registry. This change makes the common "I only care about these locales" case a one-liner.

Approach

  • Country is inferred from the recognizer's module path — the segment directly beneath country_specific/ (e.g. us, uk, es, in, pl, fi, sg, au, it). No changes to any individual recognizer class are required; the existing directory layout is the source of truth.
  • Comparison is case-insensitive (US, Us, us all work).
  • countries=[] keeps only the locale-agnostic recognizers.
  • countries=None (the default) preserves exactly today's behavior — fully backwards compatible. No existing call site needs to change.

Implementation lives in:

  • presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py — new RecognizerListLoader._get_recognizer_country() and filter_by_countries() helper.
  • presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py — the new countries kwarg threaded through to the loader.

Example

from presidio_analyzer import RecognizerRegistry

registry = RecognizerRegistry()
registry.load_predefined_recognizers(languages=["en"], countries=["us", "uk"])
# -> only US + UK country-specific recognizers, plus all locale-agnostic ones

Tests

Added 5 unit tests in presidio-analyzer/tests/test_recognizer_registry.py:

default behavior unchanged when countries is not passed
single-country filtering keeps that country + all agnostic recognizers
case-insensitive matching ("US" == "us")
countries=[] keeps only agnostic recognizers
multi-country filtering (["us", "uk"])
All tests in tests/test_recognizer_registry.py pass locally.

Issue reference
Fixes #1328

…izers

Allows callers to narrow the predefined recognizers loaded by
RecognizerRegistry.load_predefined_recognizers() to a subset of
countries, alongside the existing language filter.

Country is inferred from the recognizer's module path (the segment
directly under `country_specific/`), so no changes to individual
recognizer classes are required. Locale-agnostic recognizers are
always preserved.

Fixes microsoft#1328
@ynachiket
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 24, 2026

Thanks! this is great. One comment, which might not be easily solvable- if we're using the module path, it wouldn't catch any custom recognizer that a user would add and filtering on those would not work. Do you have any suggestions on how to include custom country recognizers as well?

@ynachiket
Copy link
Copy Markdown
Contributor Author

Great call out.
you're right that module-path inference is a closed-world heuristic that won't catch user-added custom recognizers. I'd like to propose making country a first-class property of a recognizer rather than keeping it as an inference rule. Three ways to do that, in rough order of invasiveness:

Option A — Opt-in country attribute on EntityRecognizer (recommended)

Add an optional country: Optional[str] = None kwarg to EntityRecognizer.__init__, default None (= locale-agnostic). Predefined country-specific recognizers set it explicitly in their own __init__ (or we can batch-set it for the country_specific/* tree in one commit). Custom recognizers opt in the same way:

class MyBrazilPassportRecognizer(PatternRecognizer):
    def __init__(self):
        super().__init__(
            supported_entity="BR_PASSPORT",
            patterns=[...],
            country="br",
        )

The filter then uses getattr(recognizer, "country", None) — no module-path tricks. Module-path inference in this PR becomes a transitional fallback (or gets removed entirely once the predefined recognizers are migrated).

Pros: explicit, discoverable, works uniformly for built-in and custom, plays well with introspection/serialization. Cons: touches the base class (small additive change, fully backwards compatible); predefined recognizers need a one-time sweep to set the attribute.

Option B — Declarative country in the YAML / dict config
Extend PatternRecognizer.from_dict (and the YAML schema under conf/example_recognizers.yaml) to accept a country field, which sets the attribute from Option A:

recognizers:
  - name: "BR Passport"
    supported_language: en
    supported_entity: BR_PASSPORT
    country: br
    patterns: [...]

Composes naturally with Option A — same attribute, just a declarative path for ops folks who wire recognizers via config.

Option C — Pluggable country resolver callback
Keep the filter but let callers pass a country_of: Callable[[EntityRecognizer], Optional[str]] so teams with custom recognizers can supply their own classifier without touching the base class:

registry.load_predefined_recognizers(countries=["us", "br"])
registry.filter_by_countries(["us", "br"], country_of=my_resolver)

Pros: zero changes to recognizer classes. Cons: less discoverable; every user has to write glue code.

What do yo think?

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 24, 2026

Thanks. 1 and 2 fit together (I don't see them as competing alternatives). Once country_code is part of the EntityRecognizer class, you don't have to use getattr(recognizer, "country_code", None), you could simply use recognizer.country_code. Still, there's are a few open questions:

  1. How do we handle generic recognizers? (e.g. credit card, date)
  2. There's still the question of discoverabilit. If someone creates a new br recognizer but doesn't know about the country_code field, they might not get the full list of recognizers they're looking for
  3. How do we do this in a backgrward compatible way? i.e. users already have their own list of custom recognizers but did not populate the country_code field. Do we assume they don't filter by country anyway?

Option 3 is interesting, but I agree with you that this requires the user to figure out the mechanism (callable is a generic interface) and use it.

I'm not sure there are great answers to those questions, just wanted to get your perspective and make sure we think about this from multiple angles.

@ynachiket
Copy link
Copy Markdown
Contributor Author

Some options..

1. Generic recognizers (credit card, date, email, URL, IBAN, crypto)

Proposal: country_code = None means "locale-agnostic, always included regardless of the filter."

So None is the default in EntityRecognizer.__init__, and the filter's rule becomes:

  • country_code is None → always kept (agnostic).
  • country_code is not None → kept only if in the requested set.

This keeps the filter's behavior intuitive ("filter country-specific recognizers; leave the generic ones alone") and matches what this PR already does for agnostic built-ins. It also avoids magic sentinels like "agnostic" / "*", which tend to leak into configs and get misspelled.

Edge case worth flagging: a few built-ins are geographically skewed but not country-bound — IBAN is the usual example (European-ish but not one country), crypto addresses are global-ish. I'd keep those as country_code=None (the status quo) rather than introduce country_code: List[str], because:

  • A list-valued country_code doubles the complexity of the filter logic for one or two edge cases.
  • Users who truly want to drop IBAN in a US-only deployment already have a cleaner lever: the entities=[...] filter at analyze time, or registry.remove_recognizer("IbanRecognizer").

If that trade-off bites enough users later, promoting country_code to Optional[Union[str, List[str]]] is an additive, backwards-compatible change for a future PR.

2. Discoverability

You're right that this is the weakest link of any opt-in metadata field. I don't think it's fully solvable — we can only reduce the failure mode. Concretely, I'd layer four things:

  1. Clear docstring + type hint on EntityRecognizer.__init__ explaining the field and linking to the "filter by country" doc section.
  2. Migrate country_specific/* built-ins to set country_code explicitly, so every predefined country-specific recognizer is a worked example. Anyone reading those for inspiration sees the pattern.
  3. A registry.get_country_codes() -> Set[str] helper so users can print(registry.get_country_codes()) in a REPL and immediately see what's tagged. Cheap to add, big debugging win.
  4. Runtime feedback when the filter "misses": if a user passes countries=["br"] and the filter removes zero recognizers from the BR bucket (i.e. no recognizer had country_code="br"), log a WARNING listing what was found, plus a one-line hint: "If you have custom BR recognizers, set country_code='br' on them to include in filtering."

That combination won't be perfect, but it'll catch the common footgun where someone filters, sees a short list, and silently wonders why their custom recognizer is missing.

3. Backwards compatibility for untagged custom recognizers

This is the most consequential question, and I think the cleanest answer is the one implied by the rule in (1):

An untagged recognizer (country_code is None) is treated as locale-agnostic and is always kept, regardless of the requested countries. Same rule as a built-in generic recognizer.

To your phrasing — "do we assume they don't filter by country anyway?" — yes, exactly. A user who hasn't adopted the field hasn't opted into the filtering contract, and the safest interpretation is "this recognizer applies everywhere." That has the nice property of giving us one rule for both generics and untagged custom recognizers — easy to document, easy to reason about.

The obvious failure mode: a user has a custom BrPassportRecognizer without country_code set, calls countries=["us"], and their BR recognizer incorrectly sticks around. That's a real but visible problem — the recognizer fires on the wrong input, they notice, they grep the docs for "country", and they find the field. It's a much less dangerous failure mode than the alternative (strict: drop all untagged recognizers), which would silently break existing registries the moment someone turns on the filter.

If we want to nudge harder, option 2.4 above (runtime warning when the filter returns zero hits for a requested country) covers the common case where they pass a country they expected to match.

Summary of proposed defaults

Scenario country_code Filter behavior
Built-in generic (CC, date, email, URL, IBAN, crypto) None (unchanged) Always included
Built-in country-specific (after migration) "us", "uk", etc. Included iff in countries
Custom recognizer with no opt-in None Always included (same as generic)
Custom recognizer that opts in user-set Included iff in countries
countries=None (default) any All included (identical to today)
countries=[] any Only country_code is None kept

Net effect:

  • Zero breaking change for anyone not passing countries=.
  • Zero breaking change for anyone passing countries= who hasn't tagged custom recognizers (they just continue to run).
  • Gradual opt-in: as folks tag their custom recognizers, filtering gets more precise.

Happy to do this as a single follow-up PR once this one lands (keeping this PR as the "module-path inference for the built-ins" step, which is then superseded by the attribute in PR #2), or collapse both into this PR if you'd prefer one bigger change. Slight lean toward the two-PR path because PR #2 will touch every file under country_specific/ and I'd rather keep the review surfaces separable. Your call.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 25, 2026

Thanks!
So calling with None returns all recognizers, country specific or not, and calling with a country code returns all recognizers with this country value + all recognizers with None. Right?

Please also add a doc section + update the recognizers yaml + contributing.md to make sure we set this rule properly from now on.

@ynachiket
Copy link
Copy Markdown
Contributor Author

Confirmed — that's exactly the rule:

  • countries=None (default) → everything is kept (country-specific + agnostic / untagged). Identical to today's behavior.
  • countries=["us"] → recognizers with country_code="us" plus recognizers with country_code=None (i.e. generic built-ins and custom recognizers that haven't opted in).
  • countries=[] → only country_code=None kept.

The one-line invariant is "untagged = always included," which is what keeps the gradual migration backwards-compatible.

On the three asks:

Doc section — a new "Filtering recognizers by country" page (or subsection of analyzer customization). Will cover the country_code attribute on EntityRecognizer, the three countries= cases above, and a short troubleshooting note for "I filtered and my custom recognizer disappeared" (answer: set country_code on it).

Recognizers YAML — annotate the predefined entries under country_specific/* with country_code: <iso>, plus one example of a custom recognizer declaring country_code so the YAML doubles as a worked example.

CONTRIBUTING.md — short subsection under "adding a new recognizer": any new country-specific recognizer must declare country_code (matching its country_specific/<iso>/ directory); generic recognizers leave it as None.

One scope question before I start: the doc / YAML / CONTRIBUTING work only really makes sense once country_code is a first-class attribute on EntityRecognizer (currently this PR is the module-path heuristic only). Two ways to handle that:

  • A — single PR: add the country_code attribute + migrate the predefined recognizers + all three docs into this PR.
  • B — two PRs: land this one as-is, follow-up PR adds the attribute + migration + the three docs.

Slight lean toward B for review hygiene (the migration touches every file under country_specific/, which I'd rather keep separable from the filter logic). Happy with A if you'd rather have everything together. Your call.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented Apr 27, 2026

Thanks! Sounds like a plan! let's go with A to not introduce changes and then remove them. It's ok if the PR is slightly big because it touches every predefined country recognizer.

omri374 and others added 2 commits April 27, 2026 12:08
Adopts plan A from PR microsoft#2000 review: rather than inferring country from
module path, country information is now a first-class attribute on
EntityRecognizer, declared once and resolved consistently everywhere.

Changes:

- EntityRecognizer
  - New optional `country_code` constructor argument and instance attribute
    (lower-cased ISO-3166 alpha-2 by convention; None means generic).
  - New `COUNTRY_CODE` ClassVar so subclasses can declare their country
    once at the class level without overriding `__init__`.
  - `to_dict()` now serializes `country_code` when set.
- PatternRecognizer
  - Forwards a new `country_code` constructor argument to the base class so
    YAML-defined custom recognizers can declare their country.
- All 60 predefined country-specific recognizers now declare
  `COUNTRY_CODE` as a class-level attribute (au, ca, de, es, fi, in, it,
  kr, ng, pl, se, sg, th, tr, uk, us). Generic recognizers (credit card,
  email, phone, IP, IBAN, dates, etc.) intentionally remain unset.
- RecognizerRegistry
  - `_get_recognizer_country` now prefers `recognizer.country_code`, with
    legacy module-path inference kept as a fallback for any user-defined
    recognizer that has not yet adopted the attribute.
  - `_prepare_recognizer_kwargs` drops `country_code` from kwargs when the
    target class's `__init__` does not accept it, so existing predefined
    recognizers that have not yet been migrated to accept the kwarg keep
    working unchanged.
  - `filter_by_countries` now keeps generic (None) recognizers and filters
    only country-tagged ones; emits a WARNING when a requested country
    matches zero recognizers, to aid discoverability.
  - New `RecognizerRegistry.get_country_codes()` helper returns the set of
    country codes currently loaded.
- Configuration
  - `example_recognizers.yaml` documents the `country_code` field and adds
    a `BR CPF` example recognizer that declares `country_code: "br"`.
- Documentation
  - New `docs/analyzer/filtering_by_country.md` page covering the country
    filter, how to declare `country_code` on custom recognizers, when to
    leave it unset, backwards compatibility, and debugging tips.
  - `docs/analyzer/index.md` and `mkdocs.yml` link to the new page.
  - `docs/analyzer/adding_recognizers.md` documents the `COUNTRY_CODE`
    convention for predefined recognizers.
  - `CONTRIBUTING.md` adds a "Contributing a New Predefined Recognizer"
    rule requiring `COUNTRY_CODE` for country-specific recognizers.
- Tests
  - Extended `test_recognizer_registry.py` with coverage for attribute-
    based filtering, generic (untagged) recognizers, the warning path,
    and `get_country_codes()`.
- CHANGELOG: entry under [unreleased] / Analyzer / Added.

Behavior:
- `load_predefined_recognizers(countries=None)` -> all recognizers
  (country-tagged + generic), unchanged from before.
- `load_predefined_recognizers(countries=["us"])` -> US-tagged
  recognizers + all generic (None) recognizers.
- Backwards compatible: any recognizer that does not set country_code
  (custom or otherwise) is treated as generic and is always returned.

Closes microsoft#1328 (analyzer side).

Signed-off-by: Nachiket Torwekar <nachiket.torwekar@gmail.com>
Made-with: Cursor
@ynachiket
Copy link
Copy Markdown
Contributor Author

Thanks for the steer! Pushed 54d0644 implementing plan A end-to-end.

EntityRecognizer

  • Added optional country_code constructor arg + instance attribute (lower-cased ISO-3166 alpha-2 by convention; None = generic).
  • Added COUNTRY_CODE: ClassVar so subclasses can declare country once, without overriding __init__.
  • to_dict() now serializes country_code when set.

PatternRecognizer

  • Forwards a country_code kwarg to the base class so YAML-defined custom recognizers can declare it too.

Predefined recognizers

  • All 60 country-specific predefined recognizers now declare COUNTRY_CODE as a class-level attribute (au, ca, de, es, fi, in, it, kr, ng, pl, se, sg, th, tr, uk, us). Generic recognizers (credit card, email, phone, IP, IBAN, dates, …) intentionally remain unset.

RecognizerRegistry

  • _get_recognizer_country now prefers recognizer.country_code; legacy module-path inference is kept as a fallback for any user-defined recognizer that hasn't adopted the attribute, so existing custom registries keep working.
  • _prepare_recognizer_kwargs drops country_code from kwargs when the target class's __init__ doesn't accept it (avoids TypeError on predefined recognizers that don't take it through their constructor).
  • filter_by_countries now keeps generic (None) recognizers and filters only country-tagged ones; emits a WARNING log when a requested country matches zero recognizers (discoverability).
  • New RecognizerRegistry.get_country_codes() helper returns the set of country codes currently loaded.

Behavior summary

  • load_predefined_recognizers(countries=None) → all recognizers (country-tagged + generic). Unchanged.
  • load_predefined_recognizers(countries=["us"]) → US-tagged recognizers + all generic recognizers.
  • Backwards compatible: any recognizer without country_code is treated as generic and is always returned.

Config / YAML

  • example_recognizers.yaml documents the new country_code field and adds a BR CPF recognizer that sets country_code: "br" to demonstrate custom usage.

Docs

  • New page: docs/analyzer/filtering_by_country.md covering how the filter works, declaring country_code on custom recognizers (class attribute, constructor arg, YAML), when to leave it unset, backwards-compat, and debugging tips.
  • docs/analyzer/index.md and mkdocs.yml link to the new page.
  • docs/analyzer/adding_recognizers.md documents the COUNTRY_CODE convention for predefined recognizers.
  • CONTRIBUTING.md adds a "Contributing a New Predefined Recognizer" rule requiring COUNTRY_CODE for country-specific additions.

Tests

  • Extended tests/test_recognizer_registry.py with coverage for attribute-based filtering, untagged/generic recognizers, the zero-match warning path, and get_country_codes().

CHANGELOG

  • Entry under [unreleased]AnalyzerAdded.

Local verification: 2126 passed, 60 skipped on presidio-analyzer/tests (the skipped/ignored modules are gated on optional dependencies — azure_ai_language, azure_auth_helper, ahds, langextract, nlp_engine_provider, slim_spacy_nlp_engine — and aren't touched by this change). ruff check and ruff format --check are clean on the modified files.

Diff is ~73 files / +695/-26 as you anticipated. Happy to split it (e.g., base attribute + filter logic in one commit, predefined-recognizer migration in a follow-up) if that helps review — just say the word.

Copy link
Copy Markdown
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments, thanks!

Comment thread docs/analyzer/filtering_by_country.md

MIN_SCORE = 0
MAX_SCORE = 1.0
COUNTRY_CODE: ClassVar[Optional[str]] = None
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having a class var is definitely the right way here, but how do we make this discoverable? having half the recognizers implement it and have keep it as None doesn't take us very far. Any suggestions?
I was thinking along those lines maybe:

from typing import ClassVar

class BaseRecognizer:
    COUNTRY_CODE: ClassVar[str | None] = None

    @classmethod
    def country_code(cls) -> str | None:
        return cls.COUNTRY_CODE

    @classmethod
    def is_country_specific(cls) -> bool:
        return cls.COUNTRY_CODE is not None

WDYT?

@SharonHart, how do you see this?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • EntityRecognizer now exposes COUNTRY_CODE: ClassVar[Optional[str]] = None as the single canonical declaration.
  • Added @classmethod country_code(cls) -> Optional[str] (lower-cases the class attr) and @classmethod is_country_specific(cls) -> bool.
  • Removed the country_code constructor kwarg, the self.country_code instance attribute, the country_code field in to_dict(), and the _prepare_recognizer_kwargs shim that swallowed the YAML kwarg.
  • Country tagging is now intentionally a class-level fact: instance overrides are a no-op, so COUNTRY_CODE is hard to miss when reading the registry / docs / a recognizer class definition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will wait for Sharon's POV on the YAML aspect

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know how you want to proceed.

recognizers = RecognizerListLoader.get(**configuration)

if countries is not None:
recognizers = RecognizerListLoader.filter_by_countries(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

instead of introducing this logic here, can we inject the country codes to the configuration and have the RecognizerListLoader do all the work, like we do with languages?
This essentially means that the country filter is now global in the yaml, which I feel is a step in the right direction, similar to how languages are filtering recognizers.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • RecognizerListLoader.get(...) gained supported_countries: Optional[Iterable[str]] = None. Filtering happens inline next to _is_language_supported_globally.
  • RecognizerRegistry.load_predefined_recognizers(countries=...) simply forwards into registry_configuration["supported_countries"]. The post-hoc filter_by_countries call site in recognizer_registry.py is gone.
  • The filter is now also driveable via a top-level supported_countries: ["us", "uk"] field in the registry YAML, mirroring supported_languages. Documented in docs/analyzer/filtering_by_country.md and conf/example_recognizers.yaml.

…country filter into RecognizerListLoader

Addresses both review comments on PR microsoft#2000:

1. Discoverability — country tag is now a fact about the recognizer
   *class*, not an instance attribute. ``EntityRecognizer`` exposes the
   ``COUNTRY_CODE`` ClassVar as the canonical declaration, and reads it
   back through two new classmethods so callers / new contributors land
   on a named API:

   - ``EntityRecognizer.country_code() -> Optional[str]`` — lower-cased
     ISO code or ``None``.
   - ``EntityRecognizer.is_country_specific() -> bool`` — named
     predicate; the filter logic uses this so the field is hard to miss
     when reading the registry / docs.

   The ``country_code`` constructor kwarg, the ``self.country_code``
   instance attribute, the ``country_code`` field in ``to_dict()``, and
   the temporary ``_prepare_recognizer_kwargs`` shim that swallowed the
   YAML kwarg are all removed. Setting ``COUNTRY_CODE`` on a subclass is
   now the only supported way to tag a recognizer; instance-level
   overrides are intentionally a no-op so the country tag is a single
   source of truth at the type level.

2. Threading the filter through ``RecognizerListLoader`` — instead of
   running the country filter post-hoc in
   ``RecognizerRegistry.load_predefined_recognizers``, the filter is now
   applied inline inside ``RecognizerListLoader.get(...)``, exactly
   alongside ``_is_language_supported_globally``. Concretely:

   - ``RecognizerListLoader.get`` gains a
     ``supported_countries: Optional[Iterable[str]] = None`` kwarg.
   - Threaded through ``RecognizerConfigurationLoader`` automatically
     (no schema changes needed — the configuration dict simply carries
     the new key when set).
   - ``RecognizerRegistry.load_predefined_recognizers(countries=...)``
     forwards into ``registry_configuration["supported_countries"]``.
     The post-hoc ``filter_by_countries`` call site in
     ``recognizer_registry.py`` is removed.
   - The filter is now also driveable from a top-level
     ``supported_countries: ["us", "uk"]`` field in the registry YAML,
     mirroring how ``supported_languages`` works. Documented in
     ``docs/analyzer/filtering_by_country.md`` and
     ``conf/example_recognizers.yaml``.

Other follow-ups in this commit:

- ``filter_by_countries`` survives as the helper called internally by
  ``RecognizerListLoader.get``; it now reads country via the
  classmethod (``recognizer.country_code()``) rather than poking at an
  instance attribute. Module-path inference is kept as a transitional
  fallback for any custom recognizer that follows the
  ``country_specific/<iso>/`` directory convention but hasn't adopted
  the class attribute.
- ``RecognizerRegistry.get_country_codes()`` reads via the classmethod.
- ``example_recognizers.yaml`` drops the BR CPF custom example that
  relied on the now-removed YAML ``country_code:`` field; that field
  was only ever a thin wrapper over the constructor kwarg, and the
  per-recognizer YAML route is intentionally left out of this commit
  pending follow-up review feedback.
- Docs (``filtering_by_country.md``, ``adding_recognizers.md``,
  ``CONTRIBUTING.md``) updated to point exclusively at the ClassVar +
  classmethod API.
- Tests rewritten against the classmethod API; tests for the removed
  constructor kwarg / instance attribute are dropped, and a new test
  exercises ``RecognizerListLoader.get(supported_countries=...)``
  directly to lock in the loader-level filter contract.

Behavior is unchanged from the prior commit on this branch:
``countries=None`` → all recognizers; ``countries=["us"]`` → US-tagged
+ all locale-agnostic recognizers; untagged recognizers are always
kept.

Signed-off-by: Nachiket Torwekar <nachiket.torwekar@gmail.com>
Made-with: Cursor
@ynachiket
Copy link
Copy Markdown
Contributor Author

Thanks for the review @omri3741375fbb addresses both threads. Summary:

1. Discoverability — went with the classmethod shape, but made it the only path.

Rather than COUNTRY_CODE ClassVar + a parallel country_code instance attribute (which would have given two competing APIs), COUNTRY_CODE is now the single source of truth at the class level. The country_code constructor kwarg, instance attribute, to_dict field, and the _prepare_recognizer_kwargs shim that swallowed the YAML kwarg are all removed.

EntityRecognizer exposes:

COUNTRY_CODE: ClassVar[Optional[str]] = None

@classmethod
def country_code(cls) -> Optional[str]:
    code = cls.COUNTRY_CODE
    return code.lower() if isinstance(code, str) else code

@classmethod
def is_country_specific(cls) -> bool:
    return cls.country_code() is not None
    
The filter logic now reads recognizer.country_code() everywhere; instance-level overrides are intentionally a no-op so country tagging is unambiguously a fact about the recognizer class. That pushes the field's discoverability up the stackevery code path that cares about country lands on is_country_specific() / country_code(), which makes the missing-attribute case much harder to overlook in review and at runtime.

2. Filter is now threaded through RecognizerListLoader exactly like supported_languages.

RecognizerListLoader.get(...) gained supported_countries: Optional[Iterable[str]] = None. The filter runs inline alongside _is_language_supported_globally, no separate code path.
The configuration loader threads it automatically, so the same filter is also driveable from the top-level YAML (supported_countries: ["us", "uk"]) the same way supported_languages is.
RecognizerRegistry.load_predefined_recognizers(countries=...) is now just a thin pass-through into registry_configuration["supported_countries"]. The post-hoc filter_by_countries(...) call site in recognizer_registry.py is gone.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented May 4, 2026

Thanks. busy week, we'll review asap

# ``supported_languages`` so the filter is applied inside
# ``RecognizerListLoader.get(...)`` and behaves uniformly
# whether driven from Python or from a YAML config file.
registry_configuration["supported_countries"] = countries
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please update default_recognizers.yaml with the country tag too. This would make this more explicit for those using the no-code yaml approach

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also make sure that there is input validation. What if the user inputs two countries? (actually, in some cases this could be applicable, like a phone number recognizer) but for now let's put it aside. Worst case, the user could define two recognizers one for each country.

:param supported_entity: The entity this recognizer can detect
"""

COUNTRY_CODE = "it"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omri374
Copy link
Copy Markdown
Collaborator

omri374 commented May 5, 2026

@ynachiket apologies for the delay. I've added a few minor comments but it's mostly ready. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Filter recognizers based on locale/country

2 participants