Skip to content

feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI#27237

Open
azaanaliraza wants to merge 6 commits intoopen-metadata:mainfrom
azaanaliraza:feat/india-pii-auto-tagging
Open

feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI#27237
azaanaliraza wants to merge 6 commits intoopen-metadata:mainfrom
azaanaliraza:feat/india-pii-auto-tagging

Conversation

@azaanaliraza
Copy link
Copy Markdown


Description

Adds locale-aware PII detection for Indian datasets to support DPDP Act 2023 compliance.

OpenMetadata currently auto-tags common English patterns like email and SSN through the Column Name Scanner. Indian companies need detection for Aadhaar, PAN, and UPI, but a simple 12-digit regex creates false positives on order IDs and phone numbers.

This PR adds a new validation module with format checking to reduce false positives.

Changes Made

  • New module ingestion/src/metadata/pii/india_patterns.py:

    • Regex patterns for aadhaar, pan, upi column name detection
    • Verhoeff checksum validation for Aadhaar numbers (prevents tagging random 12-digit IDs)
    • PAN format validation (5 letters + 4 digits + 1 letter)
  • Unit tests ingestion/tests/unit/pii/test_india_patterns.py:

    • Tests for valid and invalid Aadhaar numbers
    • Tests for PAN format validation
    • Tests for column name matching

Why This Matters

India is a major user base for OpenMetadata. The DPDP Act 2023 requires companies to identify and protect personal data. Without locale-specific patterns, Indian PII remains untagged in the catalog.

The Verhoeff validation is critical because Aadhaar uses this checksum algorithm. A column named order_id with value 123456789012 will not be tagged, but aadhaar_number with a valid checksum will be.

How I Tested

cd ingestion
python -m pytest tests/unit/pii/test_india_patterns.py -v

Results:

  • test_aadhaar_validation: PASSED (4 assertions)
  • test_pan_validation: PASSED (4 assertions)
  • test_column_name_matching: PASSED (4 assertions)

Also ran full ingestion test suite:

python -m pytest tests/unit/ -k pii -v

No existing tests broken.


Adds locale-aware PII detection for Indian datasets to support
DPDP Act 2023 compliance.

- Add regex patterns for aadhaar, pan, upi column names
- Add Verhoeff checksum validation for Aadhaar to reduce false positives
- Add PAN format validation
- Include unit tests

Patterns require both name match and format validation.

Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>
@azaanaliraza azaanaliraza requested a review from a team as a code owner April 10, 2026 10:59
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/pii/india_patterns.py Outdated
Comment thread ingestion/src/metadata/pii/india_patterns.py
Comment thread ingestion/src/metadata/pii/india_patterns.py
Comment thread ingestion/src/metadata/pii/india_patterns.py Outdated
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Comment thread ingestion/src/metadata/pii/processor.py Outdated
Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot AI review requested due to automatic review settings April 17, 2026 08:01
@github-actions
Copy link
Copy Markdown
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@gitar-bot
Copy link
Copy Markdown

gitar-bot bot commented Apr 17, 2026

Code Review ✅ Approved 5 resolved / 5 findings

Adds comprehensive India PII patterns for Aadhaar, PAN, and UPI with proper validation logic, including fixed column regex, leading digit validation, type code checking, and batch row validation. All identified issues have been resolved and the module is now integrated into the PII scanner.

✅ 5 resolved
Bug: PAN column regex .pan. matches common English words

📄 ingestion/src/metadata/pii/india_patterns.py:11-12
The regex r".*pan.*" for PAN column detection will match any column name containing the substring "pan", producing massive false positives. Examples: company_name, panel_id, expansion_rate, japan_sales, participant, span, pandemic_flag, etc. This will incorrectly tag a large number of unrelated columns as Indian PII.

The UPI pattern .*upi.* has a similar (though less severe) issue with words like grouping or backup_id.

Edge Case: Aadhaar numbers starting with 0 or 1 should be rejected

📄 ingestion/src/metadata/pii/india_patterns.py:40-52
UIDAI specifies that valid Aadhaar numbers cannot start with 0 or 1. The current validate_aadhaar function only checks length, digit-only, and Verhoeff checksum, but does not reject numbers starting with 0 or 1. This could allow invalid Aadhaar numbers to pass validation.

Quality: New module not integrated into existing PII scanner

📄 ingestion/src/metadata/pii/india_patterns.py:63-77
The new india_patterns.py module and its functions (is_india_pii_column, validate_aadhaar, validate_pan) are defined but never called from the existing PII scanning pipeline. Without integration into the Column Name Scanner (or equivalent), this code is dead on arrival and won't actually tag any columns. Consider adding a follow-up issue or documenting the integration plan.

Quality: PAN validation doesn't check 4th character type code

📄 ingestion/src/metadata/pii/india_patterns.py:54-61
PAN's 4th character encodes the holder type (C=Company, P=Individual, H=HUF, etc.) and has a restricted set of valid values: A, B, C, F, G, H, J, L, P, T. Validating this character would reduce false positives on PAN value checks.

Bug: India PII validation checks only first sample, not all rows

📄 ingestion/src/metadata/pii/processor.py:114-121
The India PII check at line 115 validates only sample_data[0], while the existing PIISensitiveClassifier processes all sample values to compute classification scores. Since sample_data contains all sampled rows for the column, the first value could be NULL, malformed, or unrepresentative. This means a valid Aadhaar/PAN column could be missed if the first row happens to have a bad value, or conversely the check short-circuits before the more robust general classifier runs.

Consider validating across multiple samples (e.g., majority vote) to match the robustness of the existing classifier:

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds India-specific PII detection helpers (Aadhaar/PAN/UPI) to reduce false positives via basic format/checksum validation, aiming to improve auto-tagging for Indian datasets.

Changes:

  • Added india_patterns.py with column-name regexes and Aadhaar/PAN validators.
  • Added unit tests for Aadhaar/PAN validation and column-name matching.
  • Hooked India PII checks into the (legacy) PIIProcessor path before running the existing classifier.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File Description
ingestion/src/metadata/pii/india_patterns.py New India PII column-name patterns + Aadhaar (Verhoeff) and PAN validation helpers.
ingestion/src/metadata/pii/processor.py Adds India-specific detection/validation logic before invoking the existing PIISensitive classifier.
ingestion/tests/unit/pii/test_india_patterns.py New unit tests for Aadhaar/PAN validation and column-name matching.

Comment on lines +121 to +126
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]

if india_pii == "PAN" and sample_values:
valid_count = sum(1 for v in sample_values if validate_pan(v))
if valid_count > len(sample_values) * 0.5:
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new tag reasons (e.g., "India Aadhaar detected") diverge from the existing PIIProcessor reason format produced by explain_recognition_results(...). There are unit tests (ingestion/tests/unit/pii/test_processor.py) asserting the reason matches "Detected by ...Recognizer" for all produced tags; if this India path triggers it will break those expectations (and potentially any downstream consumers relying on the format). Consider reusing the existing reason builder / include consistent metadata, and add/adjust unit tests accordingly.

Copilot uses AI. Check for mistakes.
valid_types = {'A', 'B', 'C', 'F', 'G', 'H', 'J', 'L', 'P', 'T'}
return number[3] in valid_types

def is_india_pii_column(column_name: str) -> str | None:
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is_india_pii_column uses PEP 604 union syntax (str | None), which is invalid on Python 3.9 (the ingestion package declares requires-python >=3.9). Please switch to Optional[str] / Union[str, None] (or otherwise ensure 3.9 compatibility).

Copilot uses AI. Check for mistakes.
INDIA_COLUMN_PATTERNS = {
"aadhaar": re.compile(r".*aadhaar.*|.*aadhar.*|.*uidai.*", re.IGNORECASE),
"pan": re.compile(
r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*",
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PAN column-name regex uses \b word boundaries, which don’t treat _ as a boundary. As a result, common snake_case names like customer_pan won’t match (and the added unit test for customer_pan will fail). Adjust the pattern/normalization to handle underscores (and consider matching bare pan tokens too).

Suggested change
r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*",
r".*(?:^|[^a-z0-9])pan(?:_?(?:card|number|no|num))?(?:$|[^a-z0-9]).*|.*permanent_account.*",

Copilot uses AI. Check for mistakes.
re.IGNORECASE,
),
"upi": re.compile(
r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*",
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The UPI column-name regex also relies on \b word boundaries, so snake_case names like customer_upi_id won’t match because _ is a word character. Consider normalizing separators (e.g., _ -> space) before applying \b, or using lookarounds/separator-aware matching instead of \b.

Suggested change
r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*",
r".*(?<![a-z0-9])upi_?(id|address|vpa)(?![a-z0-9]).*|.*(?<![a-z0-9])vpa(?![a-z0-9]).*",

Copilot uses AI. Check for mistakes.
@@ -0,0 +1,28 @@
import pytest
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pytest is imported but never used in this test module (all assertions are plain assert). This will be flagged by cleanup/lint tooling (e.g., pycln). Remove the unused import.

Suggested change
import pytest

Copilot uses AI. Check for mistakes.
Comment on lines +20 to +22
assert validate_pan("abcdE1234f") is False # must be uppercase
assert validate_pan("ABCD1234F") is False # too short
assert validate_pan("ABCDE12345") is False # wrong format
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test file doesn’t appear black-formatted (e.g., inline comments need two spaces before #). Please run the standard ingestion formatter (make py_format / black) so formatting checks pass.

Suggested change
assert validate_pan("abcdE1234f") is False # must be uppercase
assert validate_pan("ABCD1234F") is False # too short
assert validate_pan("ABCDE12345") is False # wrong format
assert validate_pan("abcdE1234f") is False # must be uppercase
assert validate_pan("ABCD1234F") is False # too short
assert validate_pan("ABCDE12345") is False # wrong format

Copilot uses AI. Check for mistakes.
Comment on lines +121 to +132
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]

if india_pii == "PAN" and sample_values:
valid_count = sum(1 for v in sample_values if validate_pan(v))
if valid_count > len(sample_values) * 0.5:
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]

if india_pii == "UPI" and sample_values:
# UPI validation is simpler - check for @ symbol in majority
valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
if valid_count > len(sample_values) * 0.5:
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")]
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These newly added return statements are not black-formatted (the build_tag_label(..., "India Aadhaar detected") line exceeds typical line length and black will reflow it). Please run make py_format/black on this file to avoid CI formatting failures.

Suggested change
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]
if india_pii == "PAN" and sample_values:
valid_count = sum(1 for v in sample_values if validate_pan(v))
if valid_count > len(sample_values) * 0.5:
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]
if india_pii == "UPI" and sample_values:
# UPI validation is simpler - check for @ symbol in majority
valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
if valid_count > len(sample_values) * 0.5:
return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")]
return [
self.build_tag_label(
PIISensitivityTag.SENSITIVE, "India Aadhaar detected"
)
]
if india_pii == "PAN" and sample_values:
valid_count = sum(1 for v in sample_values if validate_pan(v))
if valid_count > len(sample_values) * 0.5:
return [
self.build_tag_label(
PIISensitivityTag.SENSITIVE, "India PAN detected"
)
]
if india_pii == "UPI" and sample_values:
# UPI validation is simpler - check for @ symbol in majority
valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
if valid_count > len(sample_values) * 0.5:
return [
self.build_tag_label(
PIISensitivityTag.SENSITIVE, "India UPI detected"
)
]

Copilot uses AI. Check for mistakes.
Comment on lines +22 to +26
_VERHOEFF_D = [
[0,1,2,3,4,5,6,7,8,9],
[1,2,3,4,0,6,7,8,9,5],
[2,3,4,0,1,7,8,9,5,6],
[3,4,0,1,2,8,9,5,6,7],
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This module appears to need black formatting (e.g., the Verhoeff tables aren’t spaced/indented per project formatting), which will likely cause CI formatting checks to fail. Please run the ingestion formatter (make py_format / black) on this file before merging.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +116
# Check India PII patterns
india_pii = is_india_pii_column(column.name.root)
# India PII validation - check majority of samples like existing classifier
sample_values = [str(v) for v in sample_data if v] if sample_data else []
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new India PII detection is added to PIIProcessor, but PIIProcessor is marked deprecated and the workflow defaults to tag-pii-processor (TagProcessor) in processor_factory. As written, most users won’t hit this new logic unless they explicitly opt into the legacy processor type. To meet the stated goal (Column Name Scanner / default auto-tagging), this should be integrated into the TagProcessor path (e.g., via TagAnalyzer/ColumnNameScanner or a custom recognizer) or the factory default needs to change.

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants