Skip to content

feat: Add HuggingFaceNerRecognizer for direct NER model inference#1834

Merged
omri374 merged 27 commits intomicrosoft:mainfrom
ultramancode:feature/huggingface-ner-recognizer
Feb 13, 2026
Merged

feat: Add HuggingFaceNerRecognizer for direct NER model inference#1834
omri374 merged 27 commits intomicrosoft:mainfrom
ultramancode:feature/huggingface-ner-recognizer

Conversation

@ultramancode
Copy link
Contributor

@ultramancode ultramancode commented Jan 17, 2026

Fixes #1833

Description

This PR adds HuggingFaceNerRecognizer, a recognizer that uses HuggingFace Transformers pipeline directly for NER, bypassing spaCy tokenizer alignment issues.

Why is this needed?

The standard approach using spaCy tokenizer with TransformersNlpEngine has alignment issues for agglutinative languages (Korean, Japanese, Turkish, etc.):

  • Particles/postpositions attach to nouns
  • spaCy tokenizer includes particles: "김태웅이고" (name + particle)
  • NER model returns only the entity: "김태웅" (name only)
  • char_span() alignment fails, causing entities to be skipped or incorrectly bounded

Solution

HuggingFaceNerRecognizer bypasses spaCy alignment by using HuggingFace pipeline directly.

Key Features

  • Language-agnostic: Works with any HuggingFace NER model
  • Direct inference: No spaCy tokenizer dependency for entity boundaries
  • Extensible YAML Configuration: Implemented a dedicated Pydantic model HuggingFaceRecognizerConfig
    to strictly validate recognizer-specific parameters (e.g., model_name, device, label_mapping) directly from YAML, ensuring type safety and clear error messages.

Changes

  • NEW: presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py
  • NEW: tests/test_huggingface_ner_recognizer.py
  • MODIFY: presidio_analyzer/input_validation/yaml_recognizer_models.py (Allow ML fields in config)
  • MODIFY: presidio_analyzer/recognizer_registry/recognizers_loader_utils.py (Pass custom args to recognizers)
    • Updated argument preparation to correctly pass list-based arguments like supported_entities to the new recognizer.
  • MODIFY: presidio_analyzer/predefined_recognizers/ner/__init__.py (export)
  • MODIFY: presidio_analyzer/predefined_recognizers/__init__.py (export)
  • MODIFY: presidio_analyzer/conf/default_recognizers.yaml (config, disabled by default)
  • MODIFY: docs/analyzer/recognizer_registry_provider.md (Added configuration example)
  • MODIFY: CHANGELOG.md

Verification Example: Side-by-Side Comparison

To prove the necessity of this feature, I configured a test environment where both the new HuggingFaceNerRecognizer and the default SpacyRecognizer are enabled for Korean. This allows for comparing their performance directly on the same text.

Test Configuration:

recognizers:
  - name: "HuggingFace NER KR"
    class_name: "HuggingFaceNerRecognizer"
    model_name: "Leo97/KoELECTRA-small-v3-modu-ner"
    supported_languages: ["ko"]
  
  - name: "SpacyRecognizer" # Intentionally enabled for comparison
    class_name: "SpacyRecognizer"
    supported_languages: ["ko"]
    enabled: true
postman_2

Request:

curl --location 'http://localhost:3000/analyze' \
--header 'Content-Type: application/json' \
--data '{
    "text": "제 이름은 김태웅이고 서울에 살고 있습니다.",
    "language": "ko",
    "return_decision_process": true
}'

Result Analysis:
The response contains three entities. The first two are correct detections enabled by this PR, while the third is the incorrect "noise" produced by the default system.

  • Result 1 & 2 (Success): The HuggingFaceNerRecognizer correctly splits and identifies "김태웅" (Kim Taewoong) as PERSON and "서울" (Seoul) as LOCATION.
  • Result 3 (Failure): The SpacyRecognizer fails to handle Korean agglutination, capturing the phrase "이름은 김태웅이고..." (My name is Kim Taewoong and...) as a single PERSON entity.
[
    {
        "analysis_explanation": {
            "original_score": 0.9791115522384644,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9791115522384644,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Leo97/KoELECTRA-small-v3-modu-ner (original label: PS)",
            "validation_result": null
        },
        "end": 9,
        "entity_type": "PERSON",
        "score": 0.9791115522384644,
        "start": 6
    },
    {
        "analysis_explanation": {
            "original_score": 0.9564878344535828,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "HuggingFace NER KR",
            "regex_flags": null,
            "score": 0.9564878344535828,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as LOCATION by Leo97/KoELECTRA-small-v3-modu-ner (original label: LC)",
            "validation_result": null
        },
        "end": 14,
        "entity_type": "LOCATION",
        "score": 0.9564878344535828,
        "start": 12
    },
    {
        "analysis_explanation": {
            "original_score": 0.85,
            "pattern": null,
            "pattern_name": null,
            "recognizer": "SpacyRecognizer",
            "regex_flags": null,
            "score": 0.85,
            "score_context_improvement": 0,
            "supportive_context_word": "",
            "textual_explanation": "Identified as PERSON by Spacy's Named Entity Recognition",
            "validation_result": null
        },
        "end": 23,
        "entity_type": "PERSON",
        "score": 0.85,
        "start": 2
    }
]

Production Configuration Tip

Although the HuggingFace recognizer functions independently, the Presidio Analyzer Platform requires a default NLP engine declaration for startup. For production environments where Spacy NER is not needed, I recommend explicitly disabling the default SpacyRecognizer to avoid duplication and performance overhead.

- name: "SpacyRecognizer"
  type: "predefined"
  supported_languages: ["ko"]
  enabled: false

Testing

Added unit tests covering:

  • English person/location/organization detection

  • Korean text with particles (agglutinative language demo)

  • Multiple entity types

  • Low confidence filtering

  • Custom label mapping

  • Empty text handling

  • Model name validation

  • Long text chunking & truncation safety

  • I have reviewed the contribution guidelines

  • I have signed the CLA (if required)

  • My code includes unit tests

  • All unit tests and lint checks pass locally

  • My PR contains documentation updates / additions if required

- Bypass spaCy tokenizer alignment issues with agglutinative languages
- Support any HuggingFace token-classification model
@ultramancode
Copy link
Contributor Author

@microsoft-github-policy-service agree

- Text chunking with configurable overlap

- Batch inference with fallback for compatibility

- Deduplication keeping highest confidence scores
- Allow ML-specific fields (model_name, device, etc.) in PredefinedRecognizerConfig.
- Implement dynamic argument filtering in loader to match recognizer signatures.
- Enable YAML support for HuggingFaceNerRecognizer while preserving backward compatibility.
- Document HuggingFace NER YAML configuration standard
@ultramancode
Copy link
Contributor Author

I’ll reuse PR #1805’s chunking logic once it’s merged to keep things consistent. Any other feedback is welcome.

@ultramancode
Copy link
Contributor Author

Hi @omri374 I’ve updated the PR to adopt the BaseTextChunker chunking logic from #1805. When you have time, could you please review it?

@omri374
Copy link
Collaborator

omri374 commented Jan 31, 2026

Thanks! Sure, will do.

@ultramancode
Copy link
Contributor Author

@omri374
I added a few more tests so the CI checks pass. Thanks!

Copy link
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! This is a great addition! I tested it on some of the newer OpenMed transformer models (OpenMed/OpenMed-PII-SuperClinical-Base-184M-v1) and it worked well. I did leave some comments on making it more robust to different uses. Happy to help with anything needed!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request adds HuggingFaceNerRecognizer, a new recognizer that uses HuggingFace Transformers pipeline directly for NER inference, solving spaCy tokenizer alignment issues with agglutinative languages (Korean, Japanese, Turkish, etc.). The PR also implements an inspect-based dynamic parameter passing mechanism in the recognizer loader to support extensible YAML configuration.

Changes:

  • Adds HuggingFaceNerRecognizer with direct HuggingFace pipeline integration and character-based text chunking
  • Implements inspect-based parameter filtering in RecognizerListLoader._prepare_recognizer_kwargs for dynamic argument passing while maintaining backward compatibility
  • Adds ML model configuration fields to PredefinedRecognizerConfig YAML validation model

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/huggingface_ner_recognizer.py New recognizer implementation with HuggingFace pipeline integration, device handling, label mapping, and chunking support
presidio-analyzer/tests/test_huggingface_ner_recognizer.py Comprehensive unit tests covering initialization, analysis, chunking, device parsing, error handling, and edge cases
presidio-analyzer/presidio_analyzer/recognizer_registry/recognizers_loader_utils.py Modified _prepare_recognizer_kwargs to use inspect for smart parameter filtering, enabling dynamic kwargs passing
presidio-analyzer/presidio_analyzer/input_validation/yaml_recognizer_models.py Added ML model configuration fields to PredefinedRecognizerConfig for HuggingFace and similar recognizers
presidio-analyzer/presidio_analyzer/predefined_recognizers/ner/init.py Exported HuggingFaceNerRecognizer
presidio-analyzer/presidio_analyzer/predefined_recognizers/init.py Exported HuggingFaceNerRecognizer
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Added HuggingFaceNerRecognizer to default configuration (disabled by default)
docs/analyzer/recognizer_registry_provider.md Added configuration example and tip for handling agglutinative languages
CHANGELOG.md Documented the new feature

@ultramancode ultramancode force-pushed the feature/huggingface-ner-recognizer branch from 5e67135 to 0399913 Compare February 3, 2026 15:51
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
@ultramancode
Copy link
Contributor Author

@omri374
Thank you for the detailed feedback — it was really helpful!
I’ve addressed all the comments and pushed the updates. Could you please take another look when you have a moment?

Copy link
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great progress! another round of feedback, but just to emphasize the importance, we see this addition as strategic for next steps of Presidio, hence the detailed review. Thanks again!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 3 comments.

@ultramancode
Copy link
Contributor Author

@omri374
Thanks again for the thorough review! it really helped clarify my understanding of Presidio’s internals.
I’ve addressed the latest round of comments and pushed the updates. Could you please take another look when you have a moment?

@omri374
Copy link
Collaborator

omri374 commented Feb 8, 2026

Hi @ultramancode, there are some linting issues and one comment I left on the recognizer loading utils (making sure this didn't change any previously working behavior. Other than that, looks great! Thanks

@ultramancode
Copy link
Contributor Author

Hi @omri374, thanks for the feedback.
I’ve replied to your comments—appreciate you taking the time!

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

omri374
omri374 previously approved these changes Feb 12, 2026
Copy link
Collaborator

@omri374 omri374 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Sorry it took so long to finalize the review

@ultramancode
Copy link
Contributor Author

Thanks @omri374! Your feedback helped me a lot. I really appreciate you taking the time to review this so carefully.
I learned a lot from your review.
Also, it looks like the approval may have been dismissed after I resolved the merge conflict in CHANGELOG.md.

@omri374
Copy link
Collaborator

omri374 commented Feb 12, 2026

Thanks @ultramancode! This is a great addition to Presidio, and we'd love for you to continue helping us shape the tool!

@omri374 omri374 merged commit 4a2672d into microsoft:main Feb 13, 2026
1 check passed
AyushAggarwal1 added a commit to AyushAggarwal1/presidio that referenced this pull request Mar 2, 2026
* Samples: add telemetry redaction sample (microsoft#1824)

* Samples: add otel redaction sample

* Apply suggestions from code review

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* sample-otel-redaction - fix pii dashboard query (re-commit)

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Feat: add class_name to allow multiple recognizers from same class (microsoft#1819)

* fix: Rename method to get_recognizer_class_name for clarity and update usage

* fix: Clarify comments regarding excluded recognizer attributes in RecognizerListLoader

* feat: Add class_name parameter to BaseRecognizerConfig for improved recognizer identification

* fix: Include 'class_name' in custom recognizers exclusion list for improved configuration handling

* feat: Enhance Ollama recognizer to support custom instance names and update configuration handling

* Enhance recognizers to accept additional keyword arguments

- Updated various recognizers across different countries (India, Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to accept **kwargs in their constructors.
- This change allows for more flexible configuration of recognizers without modifying their signatures.
- Adjusted the recognizer loading mechanism to handle the new **kwargs parameter appropriately.

* refactor: Simplify Ollama recognizer loading verification and assertions

* test: Update Ollama recognizer loading verification to ensure single instance retrieval

* feat: Enhance recognizer class name logic in RecognizerListLoader

* Refactor recognizers to explicitly handle 'name' parameter in __init__ methods

- Updated various recognizers across different countries (Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to include an optional 'name' parameter in their constructors.
- Adjusted super() calls to pass the 'name' parameter appropriately.
- Ensured that the 'Optional' type is imported where necessary.
- Added a script to automate the updates for recognizers that were missing the 'name' parameter.

* fix: Update Stanza and Transformers recognizers to handle additional kwargs in __init__ methods

* fix: Correct the import order for constants in methods.py

* refactor: Remove update_recognizers_name.py script as its functionality is no longer needed

* check

* fix: Remove unnecessary comments and clean up recognizer configuration code

* Refactor recognizer constructors to remove unused **kwargs parameter

- Updated multiple recognizer classes across various countries (Australia, Finland, India, Italy, Korea, Poland, Singapore, Spain, Thailand, UK, US) to remove the **kwargs parameter from their constructors.
- Simplified constructor signatures for better clarity and maintainability.

* refactor: Remove unused **kwargs parameter from recognizer initializers

* refactor: Remove unused **kwargs parameter from recognizer constructors

* fix ci

* refactor: format parameters in recognizer constructors for consistency

* refactor: format parameters in recognizer constructors for consistency

* Docs/gpu acceleration guide (microsoft#1826)

* docs: Add GPU acceleration documentation for transformer models

Addresses microsoft#1790 - Added comprehensive documentation for using GPU
acceleration with spaCy transformer models and other NLP engines.

- New GPU usage guide with examples for spaCy and Hugging Face transformers
- Covers automatic GPU detection, prerequisites, and troubleshooting
- Added cross-references from existing NLP engine documentation
- Updated CHANGELOG and mkdocs navigation

* chore: revert changes to CHANGELOG.md

* chore: revert optional cross-reference links

* docs: refine gpu installation instructions and add warnings

* docs: streamline gpu docs based on review feedback

* fix: restore accidentally deleted telemetry doc link

* docs: remove apple silicon bash snippet per review

* fixed the trailing whitespace.

---------

Co-authored-by: dilshad <dilshad@dilshads-MacBook-Air.local>
Co-authored-by: dilshad-aee <dilshad-aee@users.noreply.github.com>

* [Feature] add korean business registration number recognizer (microsoft#1822)

* add kr brn recognizer

* add brn to docs

* add reference in docstring

* Refactor: lazy initialization for device_detector singleton (microsoft#1831)

* Refactor: implement lazy initialization for device_detector instance

* Fix: improve device detection logic in device_detector and update TransformersRecognizer to use it

* doube lock added

* [Feature] add Korean Foreigner Registration Number recognizer (microsoft#1825)

* add frn recognizer

* make copilot happy

---------

Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* feat: Add MacAddressRecognizer  (microsoft#1829)

* Add MAC address recognizer

Support colon, hyphen, and Cisco dot-separated MAC formats with validation and comprehensive tests

* fix: linting errors in mac address recognizer

* chore: add mac address recognizer to supported_entities, ahds_surrogate

* fix: MacAddressRecognizer orde

* add: MAC address reference

* chore:Add MacAddressRecognizer to default recognizers

* Update MAC address regex and validation logic

* Refactor MAC address recognizer patterns

Updated MAC address patterns and validation logic.

* Add additional invalid MAC address test cases

* Add test cases for lowercase and mixedcase MAC addresses

* chore: fix lint error

* chore: fix typo

* fix: Add optional name parameter to MAC recognizer

* fix: update expected count of recognizers in test

---------

Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Fix gliner truncates text (microsoft#1805)

* Add failing test for - gliner truncates text and misses names (PII)

* Update gliner recognizer to implement basic chunking

* Add changes for chunking capabilities including local chuking and call to chunking from gliner recognizer

* Remove gliner image redaction test - not required

* Rename local text chunker to character based text chunker

* Fix rename leftovers

* Update doc string

* Add test for text without spaces and unicodes

* Resove linting - format code

* Add logging to character based text chunker

* Update to remove redundent chunk_overlap parameter

* Remove chunk size and chunk overlap from GlinerRecognizer constructor

* Updated the utilities to use RecognizerResult

* Update so that utils methods are part of base chunker

* Add chunker factory

* Create Lang chain text chunker

* Remove Character based inhouse chunker

* Fixed - deterministic offset tracking, fail-fast on misalignment

* Resolve merge issue

* Add chunk parameter validation

* Fix chunk size tests

* Fix liniting

* Make langchain splitter mandetory

* Add clearer type error - review comment

* Fix langchain installtion - review comment

* Add conditional import of lang chain

* Revert to use in-house chunker

* Fix line too long (lint)

* Fix trailing whitespace lint error

* Revemo not required comment

* Remove gliner extras from e2e tests to fix CI disk space issue

* Remove trailing comma in pyproject.toml to match main

---------

Co-authored-by: AJ (Ashitosh Jedhe) <ajedhe@microsoft.com>
Co-authored-by: Ron Shakutai <58519179+ShakutaiGit@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com>

* Migrate short-running workflows to ubuntu-slim runners (microsoft#1840)

* Initial plan

* Migrate eligible workflows to ubuntu-slim runners for cost efficiency

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Document ubuntu-slim migration in CHANGELOG

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Revert CodeQL to ubuntu-latest - CPU-intensive analysis requires 2+ cores

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Fix CHANGELOG to use correct job name (github-pages-release)

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Remove CHANGELOG modifications for ubuntu-slim migration

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* feat(recognizers): add UsMbiRecognizer for US Medicare Beneficiary ID (microsoft#1821)

* Fix language in pattern recognizer example (microsoft#1835)

* Update cryptography dependency to >=46.0.4 for CVE-2025-15467 (microsoft#1841)

* Initial plan

* Update cryptography dependency to >=46.0.4 to address CVE-2025-15467

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Add a configurable LangExtract recognizer for use with any provider. (microsoft#1815)

* Add a basic, configurable LangExtract-based recognizer class for use with any provider.

* Add a basic, configurable LangExtract-based recognizer class for use with any provider.

* Address comments (#4)

* Address comments

* Replace ollama_langextract_recognizer with basic_langextract_recognizer.

* Replace ollama_langextract_recognizer with basic_langextract_recognizer.

* Replace ollama_langextract_recognizer with basic_langextract_recognizer.

* Working so far

* Working so far

* Working so far

* remove dead code

* bad comment

---------

Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai>

* Address comment in telackey/lellm (#5)

* Replace ollama_langextract_recognizer with basic_langextract_recognizer.

* Fix LangExtract error

---------

Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai>

* Remove changes not required

* docstring

* update docs

* ruff

---------

Co-authored-by: Kassymkhan Bekbolatov <kasymhan007@gmail.com>
Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai>

* Support batch processing over the REST API. (microsoft#1806)

* Support batch processing over the REST API.

* Partially fix e2e tests

* Fix e2e tests

* ruff

* ruff

* consistent use of strings

* Update API docs

* Support batch processing over the REST API.

* Partially fix e2e tests

* Fix e2e tests

* ruff

* ruff

* consistent use of strings

* Update API docs

* Fix Analyzer build on 3.10 (microsoft#1848)

* Update README.MD

* Update pyproject.toml

* Update pyproject.toml

* Update pyproject.toml

* Update pyproject.toml

* Update pyproject.toml

* Add salted hashing to hash operator to prevent brute-force attacks (microsoft#1846)

* Initial plan

* Implement salted hashing in Hash operator to prevent brute-force attacks

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Fix linting issues in hash operator implementation

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Address code review comments - improve salt precedence logic and fix type hint

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Pass hash_salt only to hash operator, not all operators

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Move salt generation into Hash operator, remove engine dependency

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Simplify hash operator: remove statefulness, use random salt per entity

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Add explicit breaking change disclaimer for hash operator

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Update version number to 2.2.361 in breaking change warning

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Convert Privacy Note to mkdocs note format

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Address s-zanella's security review: add salt validation and known-answer tests

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Remove unused operator_kwargs parameter from engine_base

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Update salt validation error message to clarify auto-generation

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Address review comments: restore operator_kwargs, move imports to top, reject empty salt

Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>

* Remove tests that relied on empty salt (now rejected)

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* restore tests with salt addition

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Omri Mendels <omri374@users.noreply.github.com>
Co-authored-by: Sharon Hart <shhart@microsoft.com>

* Prepare release 2.2.361: bump versions and finalize changelog (microsoft#1851)

* Initial plan

* Update CHANGELOG.md with 30 PRs since v2.2.360

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Prepare release 2.2.361: bump versions and update CHANGELOG.md

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Feature - GPU Device Control via Environment Variable (microsoft#1844)

* init commit

* gpu on windows

* delete benchmark script and results file for Presidio Analyzer

* ruff fix

---------

Co-authored-by: Omri Mendels <omri374@users.noreply.github.com>

* feat: Add HuggingFaceNerRecognizer for direct NER model inference (microsoft#1834)

* Fix Broken Links (microsoft#1856)

This commit fixes broken links within the presidio documentation.

* Pin dependencies to mitigate supply chain attacks (microsoft#1861)

* Initial plan

* Pin GitHub Actions and Docker images to specific hashes for security

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Pin pip, poetry, and other build tool versions for security

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Add comprehensive dependency pinning security documentation

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix CI: Replace incorrect setup-python SHA with v6 tag

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Update dependency pinning docs to reflect pragmatic approach

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix CI: Replace invalid codeql-action and security-devops-action SHAs with version tags

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Update docs to list all actions using version tags

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix CI: Replace invalid py-cov-action SHA with version tag

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Update docs to include py-cov-action in version tag list

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Remove DEPENDENCY_PINNING.md as requested by maintainer

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Pin actions/setup-python to commit SHA e797f83 (v6.0.0)

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Pin codeql-action, security-devops-action, and py-cov-action to commit SHAs

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Remove all pip upgrade commands as requested

Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>

* Fixing CVE-2024-47874 and CVE-2025-54121 (microsoft#1860)

* Fixing CVE-2024-47874 and CVE-2025-54121

Fixing CVE-2024-47874 and CVE-2025-54121 by bumping fastapi in samples

* Update FastAPI version to 0.129.0

* Fixing CVE-2025-2953 and CVE-2025-3730 (microsoft#1859)

* fix: Fixed context enhancement substring matching bug  (microsoft#1827)

* Fix _process_names unconditionally treating all metadata as PHI (microsoft#1855)

* feat: Add UK Postcode (UK_POSTCODE) recognizer (microsoft#1858)

Add a pattern-based recognizer for UK postcodes covering all six
standard formats (A9, A99, A9A, AA9, AA99, AA9A) plus GIR 0AA.
The regex enforces position-specific letter restrictions per Royal Mail
rules. Base score is 0.1 due to the short length of postcodes, with
context words (postcode, address, delivery, etc.) boosting confidence.
Disabled by default per country-specific convention.

Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Pin ruff and build pip installs by hash for OSSF scorecard compliance (microsoft#1864)

* Initial plan

* Pin pip commands with hashes for OSSF scorecard compliance

Add --require-hashes to all pip install commands across Dockerfiles,
CI workflows, and shell scripts. Generate locked requirements files
with hashes for e2e tests, docs, and sample deployments.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Revert requirements-locked.txt changes, keep only direct pip hash pinning

Remove all changes that involve requirements-locked.txt files per
user request. Retain direct pip install hash pinning for poetry,
ruff, build, and other version-pinned packages.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* fix: Remove --require-hashes/--hash from pip install commands

pip 26.0 removed the --hash CLI option. Hash verification via
command-line arguments is no longer supported; hashes can only be
specified in requirements files. Revert all pip install commands
to use simple version pinning (package==version) which is compatible
with all pip versions.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Pin pip commands by hash via requirements files for OSSF compliance

Use pip install --require-hashes with dedicated requirements files
instead of bare pip install commands. This satisfies the OSSF
scorecard pinned-dependencies check.

- Create requirements-poetry.txt, requirements-ruff.txt, and
  requirements-build.txt with SHA256 hashes
- Update all Dockerfiles to use hash-pinned poetry install
- Update ci.yml to use hash-pinned ruff, poetry, and build installs
- Update release.yml to use hash-pinned build install
- Update spark startup.sh to use hash-pinned requirements file

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Plan: inline hashes via /dev/stdin instead of separate files

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Inline pip hashes via printf|stdin, remove separate requirements files

Replace requirements-poetry.txt, requirements-ruff.txt, and
requirements-build.txt with inline printf|pip install -r /dev/stdin.
pip --hash only works inside requirements files, not as a CLI flag,
so piping through /dev/stdin is the most inline approach possible.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Simplify ruff hash: use single linux x86_64 hash instead of all 17 platform hashes

The lint job runs only on ubuntu-slim (x86_64), so only the
manylinux_2_17_x86_64 wheel hash is needed.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix Windows Dockerfiles: use PowerShell for hash-pinned pip install

The printf|/dev/stdin pattern is Unix-only and doesn't work on
Windows Server Core. Use PowerShell Set-Content to write a temp
requirements file instead. The hash is correct — poetry is a pure
Python wheel (py3-none-any), same on all platforms.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix CI: remove --no-deps from poetry and build pip installs

poetry and build have transitive dependencies (cleo, pyproject_hooks,
etc.) that must be installed. --no-deps prevented their installation,
causing ModuleNotFoundError: No module named 'cleo'.

ruff keeps --no-deps --require-hashes since it has zero Python deps.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Plan: move requirements files into .github/pipelines/

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Move CI requirements files to .github/pipelines/ directory

Organize all pip-compile generated requirements files under
.github/pipelines/ for ruff, poetry, and build tools. Replaces
inline printf hash approach with proper requirements files.

Component-level requirements-poetry.txt copies remain for Docker
build context (COPY requires files within build context).

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Remove unnecessary .in files from .github/pipelines/

The .in files are just pip-compile input files containing the direct
dependency name. This info is already in the header comment of each
generated .txt file, making the .in files redundant.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Share single requirements-poetry.txt via Docker --build-context

Replace 4 identical copies of requirements-poetry.txt (713 lines each)
with a single file at the repo root. Dockerfiles use COPY --from=pipelines
to access it from a named build context, eliminating duplication.

Also consolidate requirements-ruff.txt and requirements-build.txt
at the repo root.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Rename Docker build context from 'pipelines' to 'root'

The requirements files are at the repo root, not in a pipelines
directory. Rename the named build context to 'root' for clarity.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Scope down to CI-only ruff and build hash pinning

Revert all Docker changes (Dockerfiles, docker-compose.yml, build
context). Keep hash-pinned pip installs only for ruff and build in
CI workflow. Requirements files live in .github/pipelines/.

Reverted: all 10 Dockerfiles, docker-compose.yml, release.yml,
poetry install in ci.yml, spark startup.sh.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Add 0BSD to allowed licenses in dependency-review

Ruff's SPDX license expression is '0BSD AND Apache-2.0 AND
BSD-3-Clause AND MIT'. The 0BSD component was not in the allow
list, causing the dependency review to fail. 0BSD is a permissive
public-domain-equivalent license.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

* Fix dependency-review: allow ruff's full compound SPDX license

The dependency-review-action doesn't decompose compound SPDX
expressions. Ruff's license '0BSD AND Apache-2.0 AND BSD-3-Clause
AND MIT' must be listed as a complete expression in allow-licenses.

Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Add US NPI (National Provider Identifier) recognizer (microsoft#1847)

* Add transformer-based MedicalNERRecognizer for clinical entity detection (microsoft#1853)

* feat: Add Nigeria recognizers (National Identity Number and Vehicle Registration) (microsoft#1863)

* fix validation_result type in api docs and type hint (microsoft#1869)

* Bump actions/setup-python from 6.0.0 to 6.2.0 (microsoft#1879)

Bumps [actions/setup-python](https://github.com/actions/setup-python) from 6.0.0 to 6.2.0.
- [Release notes](https://github.com/actions/setup-python/releases)
- [Commits](actions/setup-python@e797f83...a309ff8)

---
updated-dependencies:
- dependency-name: actions/setup-python
  dependency-version: 6.2.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump github/codeql-action from 3.32.3 to 4.32.4 (microsoft#1878)

Bumps [github/codeql-action](https://github.com/github/codeql-action) from 3.32.3 to 4.32.4.
- [Release notes](https://github.com/github/codeql-action/releases)
- [Changelog](https://github.com/github/codeql-action/blob/main/CHANGELOG.md)
- [Commits](github/codeql-action@f5c2471...89a39a4)

---
updated-dependencies:
- dependency-name: github/codeql-action
  dependency-version: 4.32.4
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump actions/dependency-review-action from 3.1.5 to 4.8.3 (microsoft#1877)

Bumps [actions/dependency-review-action](https://github.com/actions/dependency-review-action) from 3.1.5 to 4.8.3.
- [Release notes](https://github.com/actions/dependency-review-action/releases)
- [Commits](actions/dependency-review-action@c74b580...05fe457)

---
updated-dependencies:
- dependency-name: actions/dependency-review-action
  dependency-version: 4.8.3
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump microsoft/security-devops-action from 1.11.0 to 1.12.0 (microsoft#1876)

Bumps [microsoft/security-devops-action](https://github.com/microsoft/security-devops-action) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/microsoft/security-devops-action/releases)
- [Commits](microsoft/security-devops-action@cc007d0...08976cb)

---
updated-dependencies:
- dependency-name: microsoft/security-devops-action
  dependency-version: 1.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump actions/github-script from 7.0.1 to 8.0.0 (microsoft#1875)

Bumps [actions/github-script](https://github.com/actions/github-script) from 7.0.1 to 8.0.0.
- [Release notes](https://github.com/actions/github-script/releases)
- [Commits](actions/github-script@60a0d83...ed59741)

---
updated-dependencies:
- dependency-name: actions/github-script
  dependency-version: 8.0.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump azure/login from 2.1.1 to 2.3.0 (microsoft#1874)

Bumps [azure/login](https://github.com/azure/login) from 2.1.1 to 2.3.0.
- [Release notes](https://github.com/azure/login/releases)
- [Commits](Azure/login@6c25186...a457da9)

---
updated-dependencies:
- dependency-name: azure/login
  dependency-version: 2.3.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump docker/setup-buildx-action from 3.7.1 to 3.12.0 (microsoft#1873)

Bumps [docker/setup-buildx-action](https://github.com/docker/setup-buildx-action) from 3.7.1 to 3.12.0.
- [Release notes](https://github.com/docker/setup-buildx-action/releases)
- [Commits](docker/setup-buildx-action@c47758b...8d2750c)

---
updated-dependencies:
- dependency-name: docker/setup-buildx-action
  dependency-version: 3.12.0
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump actions/cache from 4.2.0 to 5.0.3 (microsoft#1872)

Bumps [actions/cache](https://github.com/actions/cache) from 4.2.0 to 5.0.3.
- [Release notes](https://github.com/actions/cache/releases)
- [Changelog](https://github.com/actions/cache/blob/main/RELEASES.md)
- [Commits](actions/cache@1bd1e32...cdf6c1f)

---
updated-dependencies:
- dependency-name: actions/cache
  dependency-version: 5.0.3
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump actions/checkout from 4.2.2 to 6.0.2 (microsoft#1871)

Bumps [actions/checkout](https://github.com/actions/checkout) from 4.2.2 to 6.0.2.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](actions/checkout@11bd719...de0fac2)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-version: 6.0.2
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Bump actions/setup-dotnet from 4.0.1 to 5.1.0 (microsoft#1870)

Bumps [actions/setup-dotnet](https://github.com/actions/setup-dotnet) from 4.0.1 to 5.1.0.
- [Release notes](https://github.com/actions/setup-dotnet/releases)
- [Commits](actions/setup-dotnet@6bd8b7f...baa11fb)

---
updated-dependencies:
- dependency-name: actions/setup-dotnet
  dependency-version: 5.1.0
  dependency-type: direct:production
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump python from `9e01bf1` to `f3fa41d` in /presidio-analyzer (microsoft#1887)

Bumps python from `9e01bf1` to `f3fa41d`.

---
updated-dependencies:
- dependency-name: python
  dependency-version: 3.12-windowsservercore
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

* Bump python from `3de9a8d` to `f50f56f` in /presidio-anonymizer (microsoft#1886)

Bumps python from `3de9a8d` to `f50f56f`.

---
updated-dependencies:
- dependency-name: python
  dependency-version: 3.13-windowsservercore
  dependency-type: direct:production
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>

* Merge branch 'main' of https://github.com/microsoft/presidio

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: Jakob Serlier <37184788+Jakob-98@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Sharon Hart <sharonh.dev@gmail.com>
Co-authored-by: Ron Shakutai <58519179+RonShakutai@users.noreply.github.com>
Co-authored-by: Dilshad <124334195+dilshad-aee@users.noreply.github.com>
Co-authored-by: dilshad <dilshad@dilshads-MacBook-Air.local>
Co-authored-by: dilshad-aee <dilshad-aee@users.noreply.github.com>
Co-authored-by: RektPunk <rektpunk@gmail.com>
Co-authored-by: kim <83156897+kyoungbinkim@users.noreply.github.com>
Co-authored-by: jedheaj314 <51018779+jedheaj314@users.noreply.github.com>
Co-authored-by: AJ (Ashitosh Jedhe) <ajedhe@microsoft.com>
Co-authored-by: Ron Shakutai <58519179+ShakutaiGit@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
Co-authored-by: tamirkamara <26870601+tamirkamara@users.noreply.github.com>
Co-authored-by: Chris von Csefalvay <chris@chrisvoncsefalvay.com>
Co-authored-by: andyjessen <62343929+andyjessen@users.noreply.github.com>
Co-authored-by: SharonHart <15013757+SharonHart@users.noreply.github.com>
Co-authored-by: Thomas E Lackey <telackey@redbudcomputer.com>
Co-authored-by: Kassymkhan Bekbolatov <kasymhan007@gmail.com>
Co-authored-by: Kassymkhan Bekbolatov <kbekbolatov@solidcore.ai>
Co-authored-by: omri374 <3776619+omri374@users.noreply.github.com>
Co-authored-by: Omri Mendels <omri374@users.noreply.github.com>
Co-authored-by: Sharon Hart <shhart@microsoft.com>
Co-authored-by: taewoong Kim <116135174+ultramancode@users.noreply.github.com>
Co-authored-by: ravi-jindal <ravi.23189@gmail.com>
Co-authored-by: Harikrishna KP <harikp2002@gmail.com>
Co-authored-by: Tolulope Jegede <49379077+tee-jagz@users.noreply.github.com>
Co-authored-by: Steven Elliott <srichardelliottjr@gmail.com>
Co-authored-by: AKIOS <hello@akios.ai>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Add HuggingFaceNerRecognizer for direct NER model inference

3 participants