feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI#27237
feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI#27237azaanaliraza wants to merge 6 commits intoopen-metadata:mainfrom
Conversation
Adds locale-aware PII detection for Indian datasets to support DPDP Act 2023 compliance. - Add regex patterns for aadhaar, pan, upi column names - Add Verhoeff checksum validation for Aadhaar to reduce false positives - Add PAN format validation - Include unit tests Patterns require both name match and format validation. Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
|
Hi there 👋 Thanks for your contribution! The OpenMetadata team will review the PR shortly! Once it has been labeled as Let us know if you need any help! |
Code Review ✅ Approved 5 resolved / 5 findingsAdds comprehensive India PII patterns for Aadhaar, PAN, and UPI with proper validation logic, including fixed column regex, leading digit validation, type code checking, and batch row validation. All identified issues have been resolved and the module is now integrated into the PII scanner. ✅ 5 resolved✅ Bug: PAN column regex
|
| Compact |
|
Was this helpful? React with 👍 / 👎 | Gitar
There was a problem hiding this comment.
Pull request overview
Adds India-specific PII detection helpers (Aadhaar/PAN/UPI) to reduce false positives via basic format/checksum validation, aiming to improve auto-tagging for Indian datasets.
Changes:
- Added
india_patterns.pywith column-name regexes and Aadhaar/PAN validators. - Added unit tests for Aadhaar/PAN validation and column-name matching.
- Hooked India PII checks into the (legacy)
PIIProcessorpath before running the existing classifier.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.
| File | Description |
|---|---|
ingestion/src/metadata/pii/india_patterns.py |
New India PII column-name patterns + Aadhaar (Verhoeff) and PAN validation helpers. |
ingestion/src/metadata/pii/processor.py |
Adds India-specific detection/validation logic before invoking the existing PIISensitive classifier. |
ingestion/tests/unit/pii/test_india_patterns.py |
New unit tests for Aadhaar/PAN validation and column-name matching. |
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")] | ||
|
|
||
| if india_pii == "PAN" and sample_values: | ||
| valid_count = sum(1 for v in sample_values if validate_pan(v)) | ||
| if valid_count > len(sample_values) * 0.5: | ||
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")] |
There was a problem hiding this comment.
The new tag reasons (e.g., "India Aadhaar detected") diverge from the existing PIIProcessor reason format produced by explain_recognition_results(...). There are unit tests (ingestion/tests/unit/pii/test_processor.py) asserting the reason matches "Detected by ...Recognizer" for all produced tags; if this India path triggers it will break those expectations (and potentially any downstream consumers relying on the format). Consider reusing the existing reason builder / include consistent metadata, and add/adjust unit tests accordingly.
| valid_types = {'A', 'B', 'C', 'F', 'G', 'H', 'J', 'L', 'P', 'T'} | ||
| return number[3] in valid_types | ||
|
|
||
| def is_india_pii_column(column_name: str) -> str | None: |
There was a problem hiding this comment.
is_india_pii_column uses PEP 604 union syntax (str | None), which is invalid on Python 3.9 (the ingestion package declares requires-python >=3.9). Please switch to Optional[str] / Union[str, None] (or otherwise ensure 3.9 compatibility).
| INDIA_COLUMN_PATTERNS = { | ||
| "aadhaar": re.compile(r".*aadhaar.*|.*aadhar.*|.*uidai.*", re.IGNORECASE), | ||
| "pan": re.compile( | ||
| r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*", |
There was a problem hiding this comment.
The PAN column-name regex uses \b word boundaries, which don’t treat _ as a boundary. As a result, common snake_case names like customer_pan won’t match (and the added unit test for customer_pan will fail). Adjust the pattern/normalization to handle underscores (and consider matching bare pan tokens too).
| r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*", | |
| r".*(?:^|[^a-z0-9])pan(?:_?(?:card|number|no|num))?(?:$|[^a-z0-9]).*|.*permanent_account.*", |
| re.IGNORECASE, | ||
| ), | ||
| "upi": re.compile( | ||
| r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*", |
There was a problem hiding this comment.
The UPI column-name regex also relies on \b word boundaries, so snake_case names like customer_upi_id won’t match because _ is a word character. Consider normalizing separators (e.g., _ -> space) before applying \b, or using lookarounds/separator-aware matching instead of \b.
| r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*", | |
| r".*(?<![a-z0-9])upi_?(id|address|vpa)(?![a-z0-9]).*|.*(?<![a-z0-9])vpa(?![a-z0-9]).*", |
| @@ -0,0 +1,28 @@ | |||
| import pytest | |||
There was a problem hiding this comment.
pytest is imported but never used in this test module (all assertions are plain assert). This will be flagged by cleanup/lint tooling (e.g., pycln). Remove the unused import.
| import pytest |
| assert validate_pan("abcdE1234f") is False # must be uppercase | ||
| assert validate_pan("ABCD1234F") is False # too short | ||
| assert validate_pan("ABCDE12345") is False # wrong format |
There was a problem hiding this comment.
This test file doesn’t appear black-formatted (e.g., inline comments need two spaces before #). Please run the standard ingestion formatter (make py_format / black) so formatting checks pass.
| assert validate_pan("abcdE1234f") is False # must be uppercase | |
| assert validate_pan("ABCD1234F") is False # too short | |
| assert validate_pan("ABCDE12345") is False # wrong format | |
| assert validate_pan("abcdE1234f") is False # must be uppercase | |
| assert validate_pan("ABCD1234F") is False # too short | |
| assert validate_pan("ABCDE12345") is False # wrong format |
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")] | ||
|
|
||
| if india_pii == "PAN" and sample_values: | ||
| valid_count = sum(1 for v in sample_values if validate_pan(v)) | ||
| if valid_count > len(sample_values) * 0.5: | ||
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")] | ||
|
|
||
| if india_pii == "UPI" and sample_values: | ||
| # UPI validation is simpler - check for @ symbol in majority | ||
| valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5) | ||
| if valid_count > len(sample_values) * 0.5: | ||
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")] |
There was a problem hiding this comment.
These newly added return statements are not black-formatted (the build_tag_label(..., "India Aadhaar detected") line exceeds typical line length and black will reflow it). Please run make py_format/black on this file to avoid CI formatting failures.
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")] | |
| if india_pii == "PAN" and sample_values: | |
| valid_count = sum(1 for v in sample_values if validate_pan(v)) | |
| if valid_count > len(sample_values) * 0.5: | |
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")] | |
| if india_pii == "UPI" and sample_values: | |
| # UPI validation is simpler - check for @ symbol in majority | |
| valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5) | |
| if valid_count > len(sample_values) * 0.5: | |
| return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")] | |
| return [ | |
| self.build_tag_label( | |
| PIISensitivityTag.SENSITIVE, "India Aadhaar detected" | |
| ) | |
| ] | |
| if india_pii == "PAN" and sample_values: | |
| valid_count = sum(1 for v in sample_values if validate_pan(v)) | |
| if valid_count > len(sample_values) * 0.5: | |
| return [ | |
| self.build_tag_label( | |
| PIISensitivityTag.SENSITIVE, "India PAN detected" | |
| ) | |
| ] | |
| if india_pii == "UPI" and sample_values: | |
| # UPI validation is simpler - check for @ symbol in majority | |
| valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5) | |
| if valid_count > len(sample_values) * 0.5: | |
| return [ | |
| self.build_tag_label( | |
| PIISensitivityTag.SENSITIVE, "India UPI detected" | |
| ) | |
| ] |
| _VERHOEFF_D = [ | ||
| [0,1,2,3,4,5,6,7,8,9], | ||
| [1,2,3,4,0,6,7,8,9,5], | ||
| [2,3,4,0,1,7,8,9,5,6], | ||
| [3,4,0,1,2,8,9,5,6,7], |
There was a problem hiding this comment.
This module appears to need black formatting (e.g., the Verhoeff tables aren’t spaced/indented per project formatting), which will likely cause CI formatting checks to fail. Please run the ingestion formatter (make py_format / black) on this file before merging.
| # Check India PII patterns | ||
| india_pii = is_india_pii_column(column.name.root) | ||
| # India PII validation - check majority of samples like existing classifier | ||
| sample_values = [str(v) for v in sample_data if v] if sample_data else [] |
There was a problem hiding this comment.
The new India PII detection is added to PIIProcessor, but PIIProcessor is marked deprecated and the workflow defaults to tag-pii-processor (TagProcessor) in processor_factory. As written, most users won’t hit this new logic unless they explicitly opt into the legacy processor type. To meet the stated goal (Column Name Scanner / default auto-tagging), this should be integrated into the TagProcessor path (e.g., via TagAnalyzer/ColumnNameScanner or a custom recognizer) or the factory default needs to change.
Description
Adds locale-aware PII detection for Indian datasets to support DPDP Act 2023 compliance.
OpenMetadata currently auto-tags common English patterns like email and SSN through the Column Name Scanner. Indian companies need detection for Aadhaar, PAN, and UPI, but a simple 12-digit regex creates false positives on order IDs and phone numbers.
This PR adds a new validation module with format checking to reduce false positives.
Changes Made
New module
ingestion/src/metadata/pii/india_patterns.py:aadhaar,pan,upicolumn name detectionUnit tests
ingestion/tests/unit/pii/test_india_patterns.py:Why This Matters
India is a major user base for OpenMetadata. The DPDP Act 2023 requires companies to identify and protect personal data. Without locale-specific patterns, Indian PII remains untagged in the catalog.
The Verhoeff validation is critical because Aadhaar uses this checksum algorithm. A column named
order_idwith value123456789012will not be tagged, butaadhaar_numberwith a valid checksum will be.How I Tested
cd ingestion python -m pytest tests/unit/pii/test_india_patterns.py -vResults:
test_aadhaar_validation: PASSED (4 assertions)test_pan_validation: PASSED (4 assertions)test_column_name_matching: PASSED (4 assertions)Also ran full ingestion test suite:
No existing tests broken.