feat(analyzer): add optional country filter to load_predefined_recogn…#2000
feat(analyzer): add optional country filter to load_predefined_recogn…#2000ynachiket wants to merge 6 commits intomicrosoft:mainfrom
Conversation
…izers Allows callers to narrow the predefined recognizers loaded by RecognizerRegistry.load_predefined_recognizers() to a subset of countries, alongside the existing language filter. Country is inferred from the recognizer's module path (the segment directly under `country_specific/`), so no changes to individual recognizer classes are required. Locale-agnostic recognizers are always preserved. Fixes microsoft#1328
|
@microsoft-github-policy-service agree |
|
Thanks! this is great. One comment, which might not be easily solvable- if we're using the module path, it wouldn't catch any custom recognizer that a user would add and filtering on those would not work. Do you have any suggestions on how to include custom country recognizers as well? |
|
Great call out. Option A — Opt-in
|
|
Thanks. 1 and 2 fit together (I don't see them as competing alternatives). Once
Option 3 is interesting, but I agree with you that this requires the user to figure out the mechanism (callable is a generic interface) and use it. I'm not sure there are great answers to those questions, just wanted to get your perspective and make sure we think about this from multiple angles. |
|
Some options.. 1. Generic recognizers (credit card, date, email, URL, IBAN, crypto)Proposal: So
This keeps the filter's behavior intuitive ("filter country-specific recognizers; leave the generic ones alone") and matches what this PR already does for agnostic built-ins. It also avoids magic sentinels like Edge case worth flagging: a few built-ins are geographically skewed but not country-bound — IBAN is the usual example (European-ish but not one country), crypto addresses are global-ish. I'd keep those as
If that trade-off bites enough users later, promoting 2. DiscoverabilityYou're right that this is the weakest link of any opt-in metadata field. I don't think it's fully solvable — we can only reduce the failure mode. Concretely, I'd layer four things:
That combination won't be perfect, but it'll catch the common footgun where someone filters, sees a short list, and silently wonders why their custom recognizer is missing. 3. Backwards compatibility for untagged custom recognizersThis is the most consequential question, and I think the cleanest answer is the one implied by the rule in (1): An untagged recognizer ( To your phrasing — "do we assume they don't filter by country anyway?" — yes, exactly. A user who hasn't adopted the field hasn't opted into the filtering contract, and the safest interpretation is "this recognizer applies everywhere." That has the nice property of giving us one rule for both generics and untagged custom recognizers — easy to document, easy to reason about. The obvious failure mode: a user has a custom If we want to nudge harder, option 2.4 above (runtime warning when the filter returns zero hits for a requested country) covers the common case where they pass a country they expected to match. Summary of proposed defaults
Net effect:
Happy to do this as a single follow-up PR once this one lands (keeping this PR as the "module-path inference for the built-ins" step, which is then superseded by the attribute in PR #2), or collapse both into this PR if you'd prefer one bigger change. Slight lean toward the two-PR path because PR #2 will touch every file under |
|
Thanks! Please also add a doc section + update the recognizers yaml + contributing.md to make sure we set this rule properly from now on. |
|
Confirmed — that's exactly the rule:
The one-line invariant is "untagged = always included," which is what keeps the gradual migration backwards-compatible. On the three asks: Doc section — a new "Filtering recognizers by country" page (or subsection of analyzer customization). Will cover the Recognizers YAML — annotate the predefined entries under CONTRIBUTING.md — short subsection under "adding a new recognizer": any new country-specific recognizer must declare One scope question before I start: the doc / YAML / CONTRIBUTING work only really makes sense once
Slight lean toward B for review hygiene (the migration touches every file under |
|
Thanks! Sounds like a plan! let's go with A to not introduce changes and then remove them. It's ok if the PR is slightly big because it touches every predefined country recognizer. |
Adopts plan A from PR microsoft#2000 review: rather than inferring country from module path, country information is now a first-class attribute on EntityRecognizer, declared once and resolved consistently everywhere. Changes: - EntityRecognizer - New optional `country_code` constructor argument and instance attribute (lower-cased ISO-3166 alpha-2 by convention; None means generic). - New `COUNTRY_CODE` ClassVar so subclasses can declare their country once at the class level without overriding `__init__`. - `to_dict()` now serializes `country_code` when set. - PatternRecognizer - Forwards a new `country_code` constructor argument to the base class so YAML-defined custom recognizers can declare their country. - All 60 predefined country-specific recognizers now declare `COUNTRY_CODE` as a class-level attribute (au, ca, de, es, fi, in, it, kr, ng, pl, se, sg, th, tr, uk, us). Generic recognizers (credit card, email, phone, IP, IBAN, dates, etc.) intentionally remain unset. - RecognizerRegistry - `_get_recognizer_country` now prefers `recognizer.country_code`, with legacy module-path inference kept as a fallback for any user-defined recognizer that has not yet adopted the attribute. - `_prepare_recognizer_kwargs` drops `country_code` from kwargs when the target class's `__init__` does not accept it, so existing predefined recognizers that have not yet been migrated to accept the kwarg keep working unchanged. - `filter_by_countries` now keeps generic (None) recognizers and filters only country-tagged ones; emits a WARNING when a requested country matches zero recognizers, to aid discoverability. - New `RecognizerRegistry.get_country_codes()` helper returns the set of country codes currently loaded. - Configuration - `example_recognizers.yaml` documents the `country_code` field and adds a `BR CPF` example recognizer that declares `country_code: "br"`. - Documentation - New `docs/analyzer/filtering_by_country.md` page covering the country filter, how to declare `country_code` on custom recognizers, when to leave it unset, backwards compatibility, and debugging tips. - `docs/analyzer/index.md` and `mkdocs.yml` link to the new page. - `docs/analyzer/adding_recognizers.md` documents the `COUNTRY_CODE` convention for predefined recognizers. - `CONTRIBUTING.md` adds a "Contributing a New Predefined Recognizer" rule requiring `COUNTRY_CODE` for country-specific recognizers. - Tests - Extended `test_recognizer_registry.py` with coverage for attribute- based filtering, generic (untagged) recognizers, the warning path, and `get_country_codes()`. - CHANGELOG: entry under [unreleased] / Analyzer / Added. Behavior: - `load_predefined_recognizers(countries=None)` -> all recognizers (country-tagged + generic), unchanged from before. - `load_predefined_recognizers(countries=["us"])` -> US-tagged recognizers + all generic (None) recognizers. - Backwards compatible: any recognizer that does not set country_code (custom or otherwise) is treated as generic and is always returned. Closes microsoft#1328 (analyzer side). Signed-off-by: Nachiket Torwekar <nachiket.torwekar@gmail.com> Made-with: Cursor
|
Thanks for the steer! Pushed
Predefined recognizers
Behavior summary
Config / YAML
Docs
Tests
CHANGELOG
Local verification: Diff is ~73 files / +695/-26 as you anticipated. Happy to split it (e.g., base attribute + filter logic in one commit, predefined-recognizer migration in a follow-up) if that helps review — just say the word. |
omri374
left a comment
There was a problem hiding this comment.
Added a few comments, thanks!
|
|
||
| MIN_SCORE = 0 | ||
| MAX_SCORE = 1.0 | ||
| COUNTRY_CODE: ClassVar[Optional[str]] = None |
There was a problem hiding this comment.
Having a class var is definitely the right way here, but how do we make this discoverable? having half the recognizers implement it and have keep it as None doesn't take us very far. Any suggestions?
I was thinking along those lines maybe:
from typing import ClassVar
class BaseRecognizer:
COUNTRY_CODE: ClassVar[str | None] = None
@classmethod
def country_code(cls) -> str | None:
return cls.COUNTRY_CODE
@classmethod
def is_country_specific(cls) -> bool:
return cls.COUNTRY_CODE is not NoneWDYT?
@SharonHart, how do you see this?
There was a problem hiding this comment.
- EntityRecognizer now exposes COUNTRY_CODE: ClassVar[Optional[str]] = None as the single canonical declaration.
- Added @classmethod country_code(cls) -> Optional[str] (lower-cases the class attr) and @classmethod is_country_specific(cls) -> bool.
- Removed the country_code constructor kwarg, the self.country_code instance attribute, the country_code field in to_dict(), and the _prepare_recognizer_kwargs shim that swallowed the YAML kwarg.
- Country tagging is now intentionally a class-level fact: instance overrides are a no-op, so COUNTRY_CODE is hard to miss when reading the registry / docs / a recognizer class definition.
There was a problem hiding this comment.
Will wait for Sharon's POV on the YAML aspect
There was a problem hiding this comment.
Let me know how you want to proceed.
| recognizers = RecognizerListLoader.get(**configuration) | ||
|
|
||
| if countries is not None: | ||
| recognizers = RecognizerListLoader.filter_by_countries( |
There was a problem hiding this comment.
instead of introducing this logic here, can we inject the country codes to the configuration and have the RecognizerListLoader do all the work, like we do with languages?
This essentially means that the country filter is now global in the yaml, which I feel is a step in the right direction, similar to how languages are filtering recognizers.
There was a problem hiding this comment.
- RecognizerListLoader.get(...) gained supported_countries: Optional[Iterable[str]] = None. Filtering happens inline next to _is_language_supported_globally.
- RecognizerRegistry.load_predefined_recognizers(countries=...) simply forwards into registry_configuration["supported_countries"]. The post-hoc filter_by_countries call site in recognizer_registry.py is gone.
- The filter is now also driveable via a top-level supported_countries: ["us", "uk"] field in the registry YAML, mirroring supported_languages. Documented in docs/analyzer/filtering_by_country.md and conf/example_recognizers.yaml.
…country filter into RecognizerListLoader Addresses both review comments on PR microsoft#2000: 1. Discoverability — country tag is now a fact about the recognizer *class*, not an instance attribute. ``EntityRecognizer`` exposes the ``COUNTRY_CODE`` ClassVar as the canonical declaration, and reads it back through two new classmethods so callers / new contributors land on a named API: - ``EntityRecognizer.country_code() -> Optional[str]`` — lower-cased ISO code or ``None``. - ``EntityRecognizer.is_country_specific() -> bool`` — named predicate; the filter logic uses this so the field is hard to miss when reading the registry / docs. The ``country_code`` constructor kwarg, the ``self.country_code`` instance attribute, the ``country_code`` field in ``to_dict()``, and the temporary ``_prepare_recognizer_kwargs`` shim that swallowed the YAML kwarg are all removed. Setting ``COUNTRY_CODE`` on a subclass is now the only supported way to tag a recognizer; instance-level overrides are intentionally a no-op so the country tag is a single source of truth at the type level. 2. Threading the filter through ``RecognizerListLoader`` — instead of running the country filter post-hoc in ``RecognizerRegistry.load_predefined_recognizers``, the filter is now applied inline inside ``RecognizerListLoader.get(...)``, exactly alongside ``_is_language_supported_globally``. Concretely: - ``RecognizerListLoader.get`` gains a ``supported_countries: Optional[Iterable[str]] = None`` kwarg. - Threaded through ``RecognizerConfigurationLoader`` automatically (no schema changes needed — the configuration dict simply carries the new key when set). - ``RecognizerRegistry.load_predefined_recognizers(countries=...)`` forwards into ``registry_configuration["supported_countries"]``. The post-hoc ``filter_by_countries`` call site in ``recognizer_registry.py`` is removed. - The filter is now also driveable from a top-level ``supported_countries: ["us", "uk"]`` field in the registry YAML, mirroring how ``supported_languages`` works. Documented in ``docs/analyzer/filtering_by_country.md`` and ``conf/example_recognizers.yaml``. Other follow-ups in this commit: - ``filter_by_countries`` survives as the helper called internally by ``RecognizerListLoader.get``; it now reads country via the classmethod (``recognizer.country_code()``) rather than poking at an instance attribute. Module-path inference is kept as a transitional fallback for any custom recognizer that follows the ``country_specific/<iso>/`` directory convention but hasn't adopted the class attribute. - ``RecognizerRegistry.get_country_codes()`` reads via the classmethod. - ``example_recognizers.yaml`` drops the BR CPF custom example that relied on the now-removed YAML ``country_code:`` field; that field was only ever a thin wrapper over the constructor kwarg, and the per-recognizer YAML route is intentionally left out of this commit pending follow-up review feedback. - Docs (``filtering_by_country.md``, ``adding_recognizers.md``, ``CONTRIBUTING.md``) updated to point exclusively at the ClassVar + classmethod API. - Tests rewritten against the classmethod API; tests for the removed constructor kwarg / instance attribute are dropped, and a new test exercises ``RecognizerListLoader.get(supported_countries=...)`` directly to lock in the loader-level filter contract. Behavior is unchanged from the prior commit on this branch: ``countries=None`` → all recognizers; ``countries=["us"]`` → US-tagged + all locale-agnostic recognizers; untagged recognizers are always kept. Signed-off-by: Nachiket Torwekar <nachiket.torwekar@gmail.com> Made-with: Cursor
|
Thanks for the review @omri374 — 1. Discoverability — went with the classmethod shape, but made it the only path. Rather than
COUNTRY_CODE: ClassVar[Optional[str]] = None
@classmethod
def country_code(cls) -> Optional[str]:
code = cls.COUNTRY_CODE
return code.lower() if isinstance(code, str) else code
@classmethod
def is_country_specific(cls) -> bool:
return cls.country_code() is not None
The filter logic now reads recognizer.country_code() everywhere; instance-level overrides are intentionally a no-op so country tagging is unambiguously a fact about the recognizer class. That pushes the field's discoverability up the stack — every code path that cares about country lands on is_country_specific() / country_code(), which makes the missing-attribute case much harder to overlook in review and at runtime.
2. Filter is now threaded through RecognizerListLoader exactly like supported_languages.
RecognizerListLoader.get(...) gained supported_countries: Optional[Iterable[str]] = None. The filter runs inline alongside _is_language_supported_globally, no separate code path.
The configuration loader threads it automatically, so the same filter is also driveable from the top-level YAML (supported_countries: ["us", "uk"]) the same way supported_languages is.
RecognizerRegistry.load_predefined_recognizers(countries=...) is now just a thin pass-through into registry_configuration["supported_countries"]. The post-hoc filter_by_countries(...) call site in recognizer_registry.py is gone. |
|
Thanks. busy week, we'll review asap |
| # ``supported_languages`` so the filter is applied inside | ||
| # ``RecognizerListLoader.get(...)`` and behaves uniformly | ||
| # whether driven from Python or from a YAML config file. | ||
| registry_configuration["supported_countries"] = countries |
There was a problem hiding this comment.
Please update default_recognizers.yaml with the country tag too. This would make this more explicit for those using the no-code yaml approach
There was a problem hiding this comment.
Also make sure that there is input validation. What if the user inputs two countries? (actually, in some cases this could be applicable, like a phone number recognizer) but for now let's put it aside. Worst case, the user could define two recognizers one for each country.
| :param supported_entity: The entity this recognizer can detect | ||
| """ | ||
|
|
||
| COUNTRY_CODE = "it" |
There was a problem hiding this comment.
please add a country code for IT Vat code too (https://github.com/ynachiket/presidio/blob/main/presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/italy/it_vat_code.py)
|
@ynachiket apologies for the delay. I've added a few minor comments but it's mostly ready. Thanks! |
Change Description
Adds an optional
countriesparameter toRecognizerRegistry.load_predefined_recognizers()that lets callers restrict which country-specific predefined recognizers are loaded, alongside the existinglanguagesfilter. Locale-agnostic recognizers (credit cards, emails, URLs, crypto, IBAN, etc.) are always preserved regardless of the filter.Motivation
Today,
load_predefined_recognizers()either loads every country-specific recognizer or none. A US-only or EU-only deployment has to either accept the noise of every country's recognizer, or manually enumerate every recognizer class to reconstruct the registry. This change makes the common "I only care about these locales" case a one-liner.Approach
country_specific/(e.g.us,uk,es,in,pl,fi,sg,au,it). No changes to any individual recognizer class are required; the existing directory layout is the source of truth.US,Us,usall work).countries=[]keeps only the locale-agnostic recognizers.countries=None(the default) preserves exactly today's behavior — fully backwards compatible. No existing call site needs to change.Implementation lives in:
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py— newRecognizerListLoader._get_recognizer_country()andfilter_by_countries()helper.presidio-analyzer/presidio_analyzer/recognizer_registry/recognizer_registry.py— the newcountrieskwarg threaded through to the loader.Example
Tests
Added 5 unit tests in presidio-analyzer/tests/test_recognizer_registry.py:
default behavior unchanged when countries is not passed
single-country filtering keeps that country + all agnostic recognizers
case-insensitive matching ("US" == "us")
countries=[] keeps only agnostic recognizers
multi-country filtering (["us", "uk"])
All tests in tests/test_recognizer_registry.py pass locally.
Issue reference
Fixes #1328