feat: Add HuggingFaceNerRecognizer for direct NER model inference#1834
Conversation
- Bypass spaCy tokenizer alignment issues with agglutinative languages - Support any HuggingFace token-classification model
|
@microsoft-github-policy-service agree |
- Text chunking with configurable overlap - Batch inference with fallback for compatibility - Deduplication keeping highest confidence scores
- Allow ML-specific fields (model_name, device, etc.) in PredefinedRecognizerConfig. - Implement dynamic argument filtering in loader to match recognizer signatures. - Enable YAML support for HuggingFaceNerRecognizer while preserving backward compatibility.
- Document HuggingFace NER YAML configuration standard
|
I’ll reuse PR #1805’s chunking logic once it’s merged to keep things consistent. Any other feedback is welcome. |
… clean up parameters
|
Thanks! Sure, will do. |
|
@omri374 |
omri374
left a comment
There was a problem hiding this comment.
Thanks! This is a great addition! I tested it on some of the newer OpenMed transformer models (OpenMed/OpenMed-PII-SuperClinical-Base-184M-v1) and it worked well. I did leave some comments on making it more robust to different uses. Happy to help with anything needed!
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
This pull request adds HuggingFaceNerRecognizer, a new recognizer that uses HuggingFace Transformers pipeline directly for NER inference, solving spaCy tokenizer alignment issues with agglutinative languages (Korean, Japanese, Turkish, etc.). The PR also implements an inspect-based dynamic parameter passing mechanism in the recognizer loader to support extensible YAML configuration.
Changes:
- Adds HuggingFaceNerRecognizer with direct HuggingFace pipeline integration and character-based text chunking
- Implements inspect-based parameter filtering in RecognizerListLoader._prepare_recognizer_kwargs for dynamic argument passing while maintaining backward compatibility
- Adds ML model configuration fields to PredefinedRecognizerConfig YAML validation model
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py | New recognizer implementation with HuggingFace pipeline integration, device handling, label mapping, and chunking support |
| presidio-analyzer/tests/test_huggingface_ner_recognizer.py | Comprehensive unit tests covering initialization, analysis, chunking, device parsing, error handling, and edge cases |
| presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py | Modified _prepare_recognizer_kwargs to use inspect for smart parameter filtering, enabling dynamic kwargs passing |
| presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py | Added ML model configuration fields to PredefinedRecognizerConfig for HuggingFace and similar recognizers |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/init.py | Exported HuggingFaceNerRecognizer |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py | Exported HuggingFaceNerRecognizer |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Added HuggingFaceNerRecognizer to default configuration (disabled by default) |
| docs/analyzer/recognizer_registry_provider.md | Added configuration example and tip for handling agglutinative languages |
| CHANGELOG.md | Documented the new feature |
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
5e67135 to
0399913
Compare
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
|
@omri374 |
…om/ultramancode/presidio into feature/huggingface-ner-recognizer
omri374
left a comment
There was a problem hiding this comment.
Great progress! another round of feedback, but just to emphasize the importance, we see this addition as strategic for next steps of Presidio, hence the detailed review. Thanks again!
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Outdated
Show resolved
Hide resolved
|
@omri374 |
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py
Show resolved
Hide resolved
|
Hi @ultramancode, there are some linting issues and one comment I left on the recognizer loading utils (making sure this didn't change any previously working behavior. Other than that, looks great! Thanks |
|
Hi @omri374, thanks for the feedback. |
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Show resolved
Hide resolved
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
Outdated
Show resolved
Hide resolved
…ognizer - address Copilot review comments
omri374
left a comment
There was a problem hiding this comment.
Thanks! Sorry it took so long to finalize the review
|
Thanks @omri374! Your feedback helped me a lot. I really appreciate you taking the time to review this so carefully. |
|
Thanks @ultramancode! This is a great addition to Presidio, and we'd love for you to continue helping us shape the tool! |
* Samples: add telemetry redaction sample (microsoft#1824) * Samples: add otel redaction sample * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * sample-otel-redaction - fix pii dashboard query (re-commit) --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Feat: add class_name to allow multiple recognizers from same class (microsoft#1819) * fix: Rename method to get_recognizer_class_name for clarity and update usage * fix: Clarify comments regarding excluded recognizer attributes in RecognizerListLoader * feat: Add class_name parameter to BaseRecognizerConfig for improved recognizer identification * fix: Include 'class_name' in custom recognizers exclusion list for improved configuration handling * feat: Enhance Ollama recognizer to support custom instance names and update configuration handling * Enhance recognizers to accept additional keyword arguments - Updated various recognizers across different countries (India, Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to accept **kwargs in their constructors. - This change allows for more flexible configuration of recognizers without modifying their signatures. - Adjusted the recognizer loading mechanism to handle the new **kwargs parameter appropriately. * refactor: Simplify Ollama recognizer loading verification and assertions * test: Update Ollama recognizer loading verification to ensure single instance retrieval * feat: Enhance recognizer class name logic in RecognizerListLoader * Refactor recognizers to explicitly handle 'name' parameter in __init__ methods - Updated various recognizers across different countries (Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to include an optional 'name' parameter in their constructors. - Adjusted super() calls to pass the 'name' parameter appropriately. - Ensured that the 'Optional' type is imported where necessary. - Added a script to automate the updates for recognizers that were missing the 'name' parameter. * fix: Update Stanza and Transformers recognizers to handle additional kwargs in __init__ methods * fix: Correct the import order for constants in methods.py * refactor: Remove update_recognizers_name.py script as its functionality is no longer needed * check * fix: Remove unnecessary comments and clean up recognizer configuration code * Refactor recognizer constructors to remove unused **kwargs parameter - Updated multiple recognizer classes across various countries (Australia, Finland, India, Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to remove the **kwargs parameter from their constructors. - Simplified constructor signatures for better clarity and maintainability. * refactor: Remove unused **kwargs parameter from recognizer initializers * refactor: Remove unused **kwargs parameter from recognizer constructors * fix ci * refactor: format parameters in recognizer constructors for consistency * refactor: format parameters in recognizer constructors for consistency * Docs/gpu acceleration guide (microsoft#1826) * docs: Add GPU acceleration documentation for transformer models Addresses microsoft#1790 - Added comprehensive documentation for using GPU acceleration with spaCy transformer models and other NLP engines. - New GPU usage guide with examples for spaCy and Hugging Face transformers - Covers automatic GPU detection, prerequisites, and troubleshooting - Added cross-references from existing NLP engine documentation - Updated CHANGELOG and mkdocs navigation * chore: revert changes to CHANGELOG.md * chore: revert optional cross-reference links * docs: refine gpu installation instructions and add warnings * docs: streamline gpu docs based on review feedback * fix: restore accidentally deleted telemetry doc link * docs: remove apple silicon bash snippet per review * fixed the trailing whitespace. --------- Co-authored-by: dilshad <dilshad@dilshads-MacBook-Air.local> Co-authored-by: dilshad-aee <dilshad-aee@users.noreply.github.com> * [Feature] add korean business registration number recognizer (microsoft#1822) * add kr brn recognizer * add brn to docs * add reference in docstring * Refactor: lazy initialization for device_detector singleton (microsoft#1831) * Refactor: implement lazy initialization for device_detector instance * Fix: improve device detection logic in device_detector and update TransformersRecognizer to use it * doube lock added * [Feature] add Korean Foreigner Registration Number recognizer (microsoft#1825) * add frn recognizer * make copilot happy --------- Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * feat: Add MacAddressRecognizer (microsoft#1829) * Add MAC address recognizer Support colon, hyphen, and Cisco dot-separated MAC formats with validation and comprehensive tests * fix: linting errors in mac address recognizer * chore: add mac address recognizer to supported_entities, ahds_surrogate * fix: MacAddressRecognizer orde * add: MAC address reference * chore:Add MacAddressRecognizer to default recognizers * Update MAC address regex and validation logic * Refactor MAC address recognizer patterns Updated MAC address patterns and validation logic. * Add additional invalid MAC address test cases * Add test cases for lowercase and mixedcase MAC addresses * chore: fix lint error * chore: fix typo * fix: Add optional name parameter to MAC recognizer * fix: update expected count of recognizers in test --------- Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Fix gliner truncates text (microsoft#1805) * Add failing test for - gliner truncates text and misses names (PII) * Update gliner recognizer to implement basic chunking * Add changes for chunking capabilities including local chuking and call to chunking from gliner recognizer * Remove gliner image redaction test - not required * Rename local text chunker to character based text chunker * Fix rename leftovers * Update doc string * Add test for text without spaces and unicodes * Resove linting - format code * Add logging to character based text chunker * Update to remove redundent chunk_overlap parameter * Remove chunk size and chunk overlap from GlinerRecognizer constructor * Updated the utilities to use RecognizerResult * Update so that utils methods are part of base chunker * Add chunker factory * Create Lang chain text chunker * Remove Character based inhouse chunker * Fixed - deterministic offset tracking, fail-fast on misalignment * Resolve merge issue * Add chunk parameter validation * Fix chunk size tests * Fix liniting * Make langchain splitter mandetory * Add clearer type error - review comment * Fix langchain installtion - review comment * Add conditional import of lang chain * Revert to use in-house chunker * Fix line too long (lint) * Fix trailing whitespace lint error * Revemo not required comment * Remove gliner extras from e2e tests to fix CI disk space issue * Remove trailing comma in pyproject.toml to match main --------- Co-authored-by: AJ (Ashitosh Jedhe) <ajedhe@microsoft.com> Co-authored-by: Ron Shakutai <58519179+ShakutaiGit@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com> * Migrate short-running workflows to ubuntu-slim runners (microsoft#1840) * Initial plan * Migrate eligible workflows to ubuntu-slim runners for cost efficiency Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Document ubuntu-slim migration in CHANGELOG Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Revert CodeQL to ubuntu-latest - CPU-intensive analysis requires 2+ cores Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Fix CHANGELOG to use correct job name (github-pages-release) Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Remove CHANGELOG modifications for ubuntu-slim migration Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID (microsoft#1821) * Fix language in pattern recognizer example (microsoft#1835) * Update cryptography dependency to >=46.0.4 for CVE-2025-15467 (microsoft#1841) * Initial plan * Update cryptography dependency to >=46.0.4 to address CVE-2025-15467 Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Add a configurable LangExtract recognizer for use with any provider. (microsoft#1815) * Add a basic, configurable LangExtract-based recognizer class for use with any provider. * Add a basic, configurable LangExtract-based recognizer class for use with any provider. * Address comments (#4) * Address comments * Replace ollama_langextract_recognizer with basic_langextract_recognizer. * Replace ollama_langextract_recognizer with basic_langextract_recognizer. * Replace ollama_langextract_recognizer with basic_langextract_recognizer. * Working so far * Working so far * Working so far * remove dead code * bad comment --------- Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai> * Address comment in telackey/lellm (#5) * Replace ollama_langextract_recognizer with basic_langextract_recognizer. * Fix LangExtract error --------- Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai> * Remove changes not required * docstring * update docs * ruff --------- Co-authored-by: Kassymkhan Bekbolatov <kasymhan007@gmail.com> Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai> * Support batch processing over the REST API. (microsoft#1806) * Support batch processing over the REST API. * Partially fix e2e tests * Fix e2e tests * ruff * ruff * consistent use of strings * Update API docs * Support batch processing over the REST API. * Partially fix e2e tests * Fix e2e tests * ruff * ruff * consistent use of strings * Update API docs * Fix Analyzer build on 3.10 (microsoft#1848) * Update README.MD * Update pyproject.toml * Update pyproject.toml * Update pyproject.toml * Update pyproject.toml * Update pyproject.toml * Add salted hashing to hash operator to prevent brute-force attacks (microsoft#1846) * Initial plan * Implement salted hashing in Hash operator to prevent brute-force attacks Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Fix linting issues in hash operator implementation Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Address code review comments - improve salt precedence logic and fix type hint Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Pass hash_salt only to hash operator, not all operators Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Move salt generation into Hash operator, remove engine dependency Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Simplify hash operator: remove statefulness, use random salt per entity Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Add explicit breaking change disclaimer for hash operator Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Update version number to 2.2.361 in breaking change warning Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Convert Privacy Note to mkdocs note format Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Address s-zanella's security review: add salt validation and known-answer tests Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Remove unused operator_kwargs parameter from engine_base Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Update salt validation error message to clarify auto-generation Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Address review comments: restore operator_kwargs, move imports to top, reject empty salt Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> * Remove tests that relied on empty salt (now rejected) Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * restore tests with salt addition --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> Co-authored-by: Sharon Hart <shhart@microsoft.com> * Prepare release 2.2.361: bump versions and finalize changelog (microsoft#1851) * Initial plan * Update CHANGELOG.md with 30 PRs since v2.2.360 Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Prepare release 2.2.361: bump versions and update CHANGELOG.md Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Feature - GPU Device Control via Environment Variable (microsoft#1844) * init commit * gpu on windows * delete benchmark script and results file for Presidio Analyzer * ruff fix --------- Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> * feat: Add HuggingFaceNerRecognizer for direct NER model inference (microsoft#1834) * Fix Broken Links (microsoft#1856) This commit fixes broken links within the presidio documentation. * Pin dependencies to mitigate supply chain attacks (microsoft#1861) * Initial plan * Pin GitHub Actions and Docker images to specific hashes for security Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Pin pip, poetry, and other build tool versions for security Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Add comprehensive dependency pinning security documentation Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix CI: Replace incorrect setup-python SHA with v6 tag Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Update dependency pinning docs to reflect pragmatic approach Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix CI: Replace invalid codeql-action and security-devops-action SHAs with version tags Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Update docs to list all actions using version tags Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix CI: Replace invalid py-cov-action SHA with version tag Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Update docs to include py-cov-action in version tag list Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Remove DEPENDENCY_PINNING.md as requested by maintainer Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Pin actions/setup-python to commit SHA e797f83 (v6.0.0) Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Pin codeql-action, security-devops-action, and py-cov-action to commit SHAs Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Remove all pip upgrade commands as requested Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> * Fixing CVE-2024-47874 and CVE-2025-54121 (microsoft#1860) * Fixing CVE-2024-47874 and CVE-2025-54121 Fixing CVE-2024-47874 and CVE-2025-54121 by bumping fastapi in samples * Update FastAPI version to 0.129.0 * Fixing CVE-2025-2953 and CVE-2025-3730 (microsoft#1859) * fix: Fixed context enhancement substring matching bug (microsoft#1827) * Fix _process_names unconditionally treating all metadata as PHI (microsoft#1855) * feat: Add UK Postcode (UK_POSTCODE) recognizer (microsoft#1858) Add a pattern-based recognizer for UK postcodes covering all six standard formats (A9, A99, A9A, AA9, AA99, AA9A) plus GIR 0AA. The regex enforces position-specific letter restrictions per Royal Mail rules. Base score is 0.1 due to the short length of postcodes, with context words (postcode, address, delivery, etc.) boosting confidence. Disabled by default per country-specific convention. Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Pin ruff and build pip installs by hash for OSSF scorecard compliance (microsoft#1864) * Initial plan * Pin pip commands with hashes for OSSF scorecard compliance Add --require-hashes to all pip install commands across Dockerfiles, CI workflows, and shell scripts. Generate locked requirements files with hashes for e2e tests, docs, and sample deployments. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Revert requirements-locked.txt changes, keep only direct pip hash pinning Remove all changes that involve requirements-locked.txt files per user request. Retain direct pip install hash pinning for poetry, ruff, build, and other version-pinned packages. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * fix: Remove --require-hashes/--hash from pip install commands pip 26.0 removed the --hash CLI option. Hash verification via command-line arguments is no longer supported; hashes can only be specified in requirements files. Revert all pip install commands to use simple version pinning (package==version) which is compatible with all pip versions. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Pin pip commands by hash via requirements files for OSSF compliance Use pip install --require-hashes with dedicated requirements files instead of bare pip install commands. This satisfies the OSSF scorecard pinned-dependencies check. - Create requirements-poetry.txt, requirements-ruff.txt, and requirements-build.txt with SHA256 hashes - Update all Dockerfiles to use hash-pinned poetry install - Update ci.yml to use hash-pinned ruff, poetry, and build installs - Update release.yml to use hash-pinned build install - Update spark startup.sh to use hash-pinned requirements file Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Plan: inline hashes via /dev/stdin instead of separate files Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Inline pip hashes via printf|stdin, remove separate requirements files Replace requirements-poetry.txt, requirements-ruff.txt, and requirements-build.txt with inline printf|pip install -r /dev/stdin. pip --hash only works inside requirements files, not as a CLI flag, so piping through /dev/stdin is the most inline approach possible. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Simplify ruff hash: use single linux x86_64 hash instead of all 17 platform hashes The lint job runs only on ubuntu-slim (x86_64), so only the manylinux_2_17_x86_64 wheel hash is needed. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix Windows Dockerfiles: use PowerShell for hash-pinned pip install The printf|/dev/stdin pattern is Unix-only and doesn't work on Windows Server Core. Use PowerShell Set-Content to write a temp requirements file instead. The hash is correct — poetry is a pure Python wheel (py3-none-any), same on all platforms. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix CI: remove --no-deps from poetry and build pip installs poetry and build have transitive dependencies (cleo, pyproject_hooks, etc.) that must be installed. --no-deps prevented their installation, causing ModuleNotFoundError: No module named 'cleo'. ruff keeps --no-deps --require-hashes since it has zero Python deps. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Plan: move requirements files into .github/pipelines/ Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Move CI requirements files to .github/pipelines/ directory Organize all pip-compile generated requirements files under .github/pipelines/ for ruff, poetry, and build tools. Replaces inline printf hash approach with proper requirements files. Component-level requirements-poetry.txt copies remain for Docker build context (COPY requires files within build context). Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Remove unnecessary .in files from .github/pipelines/ The .in files are just pip-compile input files containing the direct dependency name. This info is already in the header comment of each generated .txt file, making the .in files redundant. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Share single requirements-poetry.txt via Docker --build-context Replace 4 identical copies of requirements-poetry.txt (713 lines each) with a single file at the repo root. Dockerfiles use COPY --from=pipelines to access it from a named build context, eliminating duplication. Also consolidate requirements-ruff.txt and requirements-build.txt at the repo root. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Rename Docker build context from 'pipelines' to 'root' The requirements files are at the repo root, not in a pipelines directory. Rename the named build context to 'root' for clarity. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Scope down to CI-only ruff and build hash pinning Revert all Docker changes (Dockerfiles, docker-compose.yml, build context). Keep hash-pinned pip installs only for ruff and build in CI workflow. Requirements files live in .github/pipelines/. Reverted: all 10 Dockerfiles, docker-compose.yml, release.yml, poetry install in ci.yml, spark startup.sh. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Add 0BSD to allowed licenses in dependency-review Ruff's SPDX license expression is '0BSD AND Apache-2.0 AND BSD-3-Clause AND MIT'. The 0BSD component was not in the allow list, causing the dependency review to fail. 0BSD is a permissive public-domain-equivalent license. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> * Fix dependency-review: allow ruff's full compound SPDX license The dependency-review-action doesn't decompose compound SPDX expressions. Ruff's license '0BSD AND Apache-2.0 AND BSD-3-Clause AND MIT' must be listed as a complete expression in allow-licenses. Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Add US NPI (National Provider Identifier) recognizer (microsoft#1847) * Add transformer-based MedicalNERRecognizer for clinical entity detection (microsoft#1853) * feat: Add Nigeria recognizers (National Identity Number and Vehicle Registration) (microsoft#1863) * fix validation_result type in api docs and type hint (microsoft#1869) * Bump actions/setup-python from 6.0.0 to 6.2.0 (microsoft#1879) Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.0.0 to 6.2.0. - [Release notes](https://github.com/actions/setup-python/releases) - [Commits](actions/setup-python@e797f83...a309ff8) --- updated-dependencies: - dependency-name: actions/setup-python dependency-version: 6.2.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump github/codeql-action from 3.32.3 to 4.32.4 (microsoft#1878) Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.32.3 to 4.32.4. - [Release notes](https://github.com/github/codeql-action/releases) - [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md) - [Commits](github/codeql-action@f5c2471...89a39a4) --- updated-dependencies: - dependency-name: github/codeql-action dependency-version: 4.32.4 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump actions/dependency-review-action from 3.1.5 to 4.8.3 (microsoft#1877) Bumps [actions/dependency-review-action](https://github.com/actions/dependency-review-action) from 3.1.5 to 4.8.3. - [Release notes](https://github.com/actions/dependency-review-action/releases) - [Commits](actions/dependency-review-action@c74b580...05fe457) --- updated-dependencies: - dependency-name: actions/dependency-review-action dependency-version: 4.8.3 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump microsoft/security-devops-action from 1.11.0 to 1.12.0 (microsoft#1876) Bumps [microsoft/security-devops-action](https://github.com/microsoft/security-devops-action) from 1.11.0 to 1.12.0. - [Release notes](https://github.com/microsoft/security-devops-action/releases) - [Commits](microsoft/security-devops-action@cc007d0...08976cb) --- updated-dependencies: - dependency-name: microsoft/security-devops-action dependency-version: 1.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump actions/github-script from 7.0.1 to 8.0.0 (microsoft#1875) Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 8.0.0. - [Release notes](https://github.com/actions/github-script/releases) - [Commits](actions/github-script@60a0d83...ed59741) --- updated-dependencies: - dependency-name: actions/github-script dependency-version: 8.0.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump azure/login from 2.1.1 to 2.3.0 (microsoft#1874) Bumps [azure/login](https://github.com/azure/login) from 2.1.1 to 2.3.0. - [Release notes](https://github.com/azure/login/releases) - [Commits](Azure/login@6c25186...a457da9) --- updated-dependencies: - dependency-name: azure/login dependency-version: 2.3.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump docker/setup-buildx-action from 3.7.1 to 3.12.0 (microsoft#1873) Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3.7.1 to 3.12.0. - [Release notes](https://github.com/docker/setup-buildx-action/releases) - [Commits](docker/setup-buildx-action@c47758b...8d2750c) --- updated-dependencies: - dependency-name: docker/setup-buildx-action dependency-version: 3.12.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump actions/cache from 4.2.0 to 5.0.3 (microsoft#1872) Bumps [actions/cache](https://github.com/actions/cache) from 4.2.0 to 5.0.3. - [Release notes](https://github.com/actions/cache/releases) - [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md) - [Commits](actions/cache@1bd1e32...cdf6c1f) --- updated-dependencies: - dependency-name: actions/cache dependency-version: 5.0.3 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump actions/checkout from 4.2.2 to 6.0.2 (microsoft#1871) Bumps [actions/checkout](https://github.com/actions/checkout) from 4.2.2 to 6.0.2. - [Release notes](https://github.com/actions/checkout/releases) - [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md) - [Commits](actions/checkout@11bd719...de0fac2) --- updated-dependencies: - dependency-name: actions/checkout dependency-version: 6.0.2 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Bump actions/setup-dotnet from 4.0.1 to 5.1.0 (microsoft#1870) Bumps [actions/setup-dotnet](https://github.com/actions/setup-dotnet) from 4.0.1 to 5.1.0. - [Release notes](https://github.com/actions/setup-dotnet/releases) - [Commits](actions/setup-dotnet@6bd8b7f...baa11fb) --- updated-dependencies: - dependency-name: actions/setup-dotnet dependency-version: 5.1.0 dependency-type: direct:production update-type: version-update:semver-major ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump python from `9e01bf1` to `f3fa41d` in /presidio-analyzer (microsoft#1887) Bumps python from `9e01bf1` to `f3fa41d`. --- updated-dependencies: - dependency-name: python dependency-version: 3.12-windowsservercore dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> * Bump python from `3de9a8d` to `f50f56f` in /presidio-anonymizer (microsoft#1886) Bumps python from `3de9a8d` to `f50f56f`. --- updated-dependencies: - dependency-name: python dependency-version: 3.13-windowsservercore dependency-type: direct:production ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> * Merge branch 'main' of https://github.com/microsoft/presidio --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: Jakob Serlier <37184788+Jakob-98@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Sharon Hart <sharonh.dev@gmail.com> Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com> Co-authored-by: Dilshad <124334195+dilshad-aee@users.noreply.github.com> Co-authored-by: dilshad <dilshad@dilshads-MacBook-Air.local> Co-authored-by: dilshad-aee <dilshad-aee@users.noreply.github.com> Co-authored-by: RektPunk <rektpunk@gmail.com> Co-authored-by: kim <83156897+kyoungbinkim@users.noreply.github.com> Co-authored-by: jedheaj314 <51018779+jedheaj314@users.noreply.github.com> Co-authored-by: AJ (Ashitosh Jedhe) <ajedhe@microsoft.com> Co-authored-by: Ron Shakutai <58519179+ShakutaiGit@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com> Co-authored-by: Chris von Csefalvay <chris@chrisvoncsefalvay.com> Co-authored-by: andyjessen <62343929+andyjessen@users.noreply.github.com> Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com> Co-authored-by: Thomas E Lackey <telackey@redbudcomputer.com> Co-authored-by: Kassymkhan Bekbolatov <kasymhan007@gmail.com> Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai> Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com> Co-authored-by: Omri Mendels <omri374@users.noreply.github.com> Co-authored-by: Sharon Hart <shhart@microsoft.com> Co-authored-by: taewoong Kim <116135174+ultramancode@users.noreply.github.com> Co-authored-by: ravi-jindal <ravi.23189@gmail.com> Co-authored-by: Harikrishna KP <harikp2002@gmail.com> Co-authored-by: Tolulope Jegede <49379077+tee-jagz@users.noreply.github.com> Co-authored-by: Steven Elliott <srichardelliottjr@gmail.com> Co-authored-by: AKIOS <hello@akios.ai> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Fixes #1833
Description
This PR adds
HuggingFaceNerRecognizer, a recognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.Why is this needed?
The standard approach using spaCy tokenizer with TransformersNlpEngine has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):
"김태웅이고"(name + particle)"김태웅"(name only)char_span()alignment fails, causing entities to be skipped or incorrectly boundedSolution
HuggingFaceNerRecognizerbypasses spaCy alignment by using HuggingFace pipeline directly.Key Features
HuggingFaceRecognizerConfigto strictly validate recognizer-specific parameters (e.g., model_name, device, label_mapping) directly from YAML, ensuring type safety and clear error messages.
Changes
presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.pytests/test_huggingface_ner_recognizer.pypresidio_analyzer/input_validation/yaml_recognizer_models.py(Allow ML fields in config)presidio_analyzer/recognizer_registry/recognizers_loader_utils.py(Pass custom args to recognizers)presidio_analyzer/predefined_recognizers/ner/__init__.py(export)presidio_analyzer/predefined_recognizers/__init__.py(export)presidio_analyzer/conf/default_recognizers.yaml(config, disabled by default)docs/analyzer/recognizer_registry_provider.md(Added configuration example)CHANGELOG.mdVerification Example: Side-by-Side Comparison
To prove the necessity of this feature, I configured a test environment where both the new
HuggingFaceNerRecognizerand the defaultSpacyRecognizerare enabled for Korean. This allows for comparing their performance directly on the same text.Test Configuration:
Request:
Result Analysis:
The response contains three entities. The first two are correct detections enabled by this PR, while the third is the incorrect "noise" produced by the default system.
HuggingFaceNerRecognizercorrectly splits and identifies "김태웅" (Kim Taewoong) as PERSON and "서울" (Seoul) as LOCATION.SpacyRecognizerfails to handle Korean agglutination, capturing the phrase "이름은 김태웅이고..." (My name is Kim Taewoong and...) as a single PERSON entity.[ { "analysis_explanation": { "original_score": 0.9791115522384644, "pattern": null, "pattern_name": null, "recognizer": "HuggingFace NER KR", "regex_flags": null, "score": 0.9791115522384644, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as PERSON by Leo97/KoELECTRA-small-v3-modu-ner (original label: PS)", "validation_result": null }, "end": 9, "entity_type": "PERSON", "score": 0.9791115522384644, "start": 6 }, { "analysis_explanation": { "original_score": 0.9564878344535828, "pattern": null, "pattern_name": null, "recognizer": "HuggingFace NER KR", "regex_flags": null, "score": 0.9564878344535828, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as LOCATION by Leo97/KoELECTRA-small-v3-modu-ner (original label: LC)", "validation_result": null }, "end": 14, "entity_type": "LOCATION", "score": 0.9564878344535828, "start": 12 }, { "analysis_explanation": { "original_score": 0.85, "pattern": null, "pattern_name": null, "recognizer": "SpacyRecognizer", "regex_flags": null, "score": 0.85, "score_context_improvement": 0, "supportive_context_word": "", "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition", "validation_result": null }, "end": 23, "entity_type": "PERSON", "score": 0.85, "start": 2 } ]Production Configuration Tip
Although the HuggingFace recognizer functions independently, the Presidio Analyzer Platform requires a default NLP engine declaration for startup. For production environments where Spacy NER is not needed, I recommend explicitly disabling the default SpacyRecognizer to avoid duplication and performance overhead.
Testing
Added unit tests covering:
English person/location/organization detection
Korean text with particles (agglutinative language demo)
Multiple entity types
Low confidence filtering
Custom label mapping
Empty text handling
Model name validation
Long text chunking & truncation safety
I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests
All unit tests and lint checks pass locally
My PR contains documentation updates / additions if required