feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI#27237

Open

azaanaliraza wants to merge 6 commits intoopen-metadata:mainfrom

azaanaliraza:feat/india-pii-auto-tagging

azaanaliraza commented Apr 10, 2026

Description

Adds locale-aware PII detection for Indian datasets to support DPDP Act 2023 compliance.

OpenMetadata currently auto-tags common English patterns like email and SSN through the Column Name Scanner. Indian companies need detection for Aadhaar, PAN, and UPI, but a simple 12-digit regex creates false positives on order IDs and phone numbers.

This PR adds a new validation module with format checking to reduce false positives.

Changes Made

New module ingestion/src/metadata/pii/india_patterns.py:
- Regex patterns for aadhaar, pan, upi column name detection
- Verhoeff checksum validation for Aadhaar numbers (prevents tagging random 12-digit IDs)
- PAN format validation (5 letters + 4 digits + 1 letter)
Unit tests ingestion/tests/unit/pii/test_india_patterns.py:
- Tests for valid and invalid Aadhaar numbers
- Tests for PAN format validation
- Tests for column name matching

Why This Matters

India is a major user base for OpenMetadata. The DPDP Act 2023 requires companies to identify and protect personal data. Without locale-specific patterns, Indian PII remains untagged in the catalog.

The Verhoeff validation is critical because Aadhaar uses this checksum algorithm. A column named order_id with value 123456789012 will not be tagged, but aadhaar_number with a valid checksum will be.

How I Tested

cd ingestion
python -m pytest tests/unit/pii/test_india_patterns.py -v

Results:

test_aadhaar_validation: PASSED (4 assertions)
test_pan_validation: PASSED (4 assertions)
test_column_name_matching: PASSED (4 assertions)

Also ran full ingestion test suite:

python -m pytest tests/unit/ -k pii -v

No existing tests broken.


          feat(ingestion): add India PII patterns for Aadhaar, PAN, UPI

Adds locale-aware PII detection for Indian datasets to support
DPDP Act 2023 compliance.

- Add regex patterns for aadhaar, pan, upi column names
- Add Verhoeff checksum validation for Aadhaar to reduce false positives
- Add PAN format validation
- Include unit tests

Patterns require both name match and format validation.

Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>

azaanaliraza requested a review from a team as a code owner

April 10, 2026 10:59

Contributor

github-actions bot commented Apr 10, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

gitar-bot bot reviewed

View reviewed changes

ingestion/src/metadata/pii/india_patterns.py Outdated

gitar-bot bot reviewed

View reviewed changes

ingestion/src/metadata/pii/india_patterns.py

gitar-bot bot reviewed

View reviewed changes

ingestion/src/metadata/pii/india_patterns.py

gitar-bot bot reviewed

View reviewed changes

ingestion/src/metadata/pii/india_patterns.py Outdated


          fix: address review comments - tighten regex, add UIDAI checks, valid…

c3b60a5

…ate PAN type

Contributor

github-actions bot commented Apr 10, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

gitar-bot bot reviewed

View reviewed changes

ingestion/src/metadata/pii/processor.py Outdated


          fix: validate India PII across all samples using majority vote

9f6c619

Signed-off-by: Azaan Ali Raza <azaanalirazavi@gmail.com>

Contributor

github-actions bot commented Apr 10, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!


          Merge branch 'main' into feat/india-pii-auto-tagging

82cec35

Contributor

github-actions bot commented Apr 10, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!


          Merge branch 'main' into feat/india-pii-auto-tagging

57f6053

Contributor

github-actions bot commented Apr 10, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!


          Merge branch 'open-metadata:main' into feat/india-pii-auto-tagging

0c8018b

Copilot AI review requested due to automatic review settings

April 17, 2026 08:01

Contributor

github-actions bot commented Apr 17, 2026

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

Copilot started reviewing on behalf of azaanaliraza

April 17, 2026 08:01

gitar-bot bot commented Apr 17, 2026 •

edited

Loading

Code Review ✅ Approved 5 resolved / 5 findings

Adds comprehensive India PII patterns for Aadhaar, PAN, and UPI with proper validation logic, including fixed column regex, leading digit validation, type code checking, and batch row validation. All identified issues have been resolved and the module is now integrated into the PII scanner.

✅ 5 resolved

✅ Bug: PAN column regex .pan. matches common English words

📄 ingestion/src/metadata/pii/india_patterns.py:11-12
The regex r".*pan.*" for PAN column detection will match any column name containing the substring "pan", producing massive false positives. Examples: company_name, panel_id, expansion_rate, japan_sales, participant, span, pandemic_flag, etc. This will incorrectly tag a large number of unrelated columns as Indian PII.

The UPI pattern .*upi.* has a similar (though less severe) issue with words like grouping or backup_id.

✅ Edge Case: Aadhaar numbers starting with 0 or 1 should be rejected

📄 ingestion/src/metadata/pii/india_patterns.py:40-52
UIDAI specifies that valid Aadhaar numbers cannot start with 0 or 1. The current validate_aadhaar function only checks length, digit-only, and Verhoeff checksum, but does not reject numbers starting with 0 or 1. This could allow invalid Aadhaar numbers to pass validation.

✅ Quality: New module not integrated into existing PII scanner

📄 ingestion/src/metadata/pii/india_patterns.py:63-77
The new india_patterns.py module and its functions (is_india_pii_column, validate_aadhaar, validate_pan) are defined but never called from the existing PII scanning pipeline. Without integration into the Column Name Scanner (or equivalent), this code is dead on arrival and won't actually tag any columns. Consider adding a follow-up issue or documenting the integration plan.

✅ Quality: PAN validation doesn't check 4th character type code

📄 ingestion/src/metadata/pii/india_patterns.py:54-61
PAN's 4th character encodes the holder type (C=Company, P=Individual, H=HUF, etc.) and has a restricted set of valid values: A, B, C, F, G, H, J, L, P, T. Validating this character would reduce false positives on PAN value checks.

✅ Bug: India PII validation checks only first sample, not all rows

📄 ingestion/src/metadata/pii/processor.py:114-121
The India PII check at line 115 validates only sample_data[0], while the existing PIISensitiveClassifier processes all sample values to compute classification scores. Since sample_data contains all sampled rows for the column, the first value could be NULL, malformed, or unrepresentative. This means a valid Aadhaar/PAN column could be missed if the first row happens to have a bad value, or conversely the check short-circuits before the more robust general classifier runs.

Consider validating across multiple samples (e.g., majority vote) to match the robustness of the existing classifier:

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

Copilot AI reviewed

View reviewed changes

Contributor

Copilot AI left a comment

Pull request overview

Adds India-specific PII detection helpers (Aadhaar/PAN/UPI) to reduce false positives via basic format/checksum validation, aiming to improve auto-tagging for Indian datasets.

Changes:

Added india_patterns.py with column-name regexes and Aadhaar/PAN validators.
Added unit tests for Aadhaar/PAN validation and column-name matching.
Hooked India PII checks into the (legacy) PIIProcessor path before running the existing classifier.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 9 comments.

File	Description
`ingestion/src/metadata/pii/india_patterns.py`	New India PII column-name patterns + Aadhaar (Verhoeff) and PAN validation helpers.
`ingestion/src/metadata/pii/processor.py`	Adds India-specific detection/validation logic before invoking the existing PIISensitive classifier.
`ingestion/tests/unit/pii/test_india_patterns.py`	New unit tests for Aadhaar/PAN validation and column-name matching.

ingestion/src/metadata/pii/processor.py

Comment on lines +121 to +126

+                              return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]
+                      if india_pii == "PAN" and sample_values:
+                          valid_count = sum(1 for v in sample_values if validate_pan(v))
+                          if valid_count > len(sample_values) * 0.5:
+                              return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]

Copilot AI Apr 17, 2026

The new tag reasons (e.g., "India Aadhaar detected") diverge from the existing PIIProcessor reason format produced by explain_recognition_results(...). There are unit tests (ingestion/tests/unit/pii/test_processor.py) asserting the reason matches "Detected by ...Recognizer" for all produced tags; if this India path triggers it will break those expectations (and potentially any downstream consumers relying on the format). Consider reusing the existing reason builder / include consistent metadata, and add/adjust unit tests accordingly.

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/india_patterns.py

+                  valid_types = {'A', 'B', 'C', 'F', 'G', 'H', 'J', 'L', 'P', 'T'}
+                  return number[3] in valid_types
+              def is_india_pii_column(column_name: str) -> str | None:

Copilot AI Apr 17, 2026

is_india_pii_column uses PEP 604 union syntax (str | None), which is invalid on Python 3.9 (the ingestion package declares requires-python >=3.9). Please switch to Optional[str] / Union[str, None] (or otherwise ensure 3.9 compatibility).

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/india_patterns.py

+              INDIA_COLUMN_PATTERNS = {
+                  "aadhaar": re.compile(r".*aadhaar.*|.*aadhar.*|.*uidai.*", re.IGNORECASE),
+                  "pan": re.compile(
+                      r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*",

Copilot AI Apr 17, 2026

The PAN column-name regex uses \b word boundaries, which don’t treat _ as a boundary. As a result, common snake_case names like customer_pan won’t match (and the added unit test for customer_pan will fail). Adjust the pattern/normalization to handle underscores (and consider matching bare pan tokens too).

Suggested change

      
                    r".*\bpan_?(card|number|no|num)\b.*|.*permanent_account.*",
          
                    r".*(?:^|[^a-z0-9])pan(?:_?(?:card|number|no|num))?(?:$|[^a-z0-9]).*|.*permanent_account.*",

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/india_patterns.py

+                      re.IGNORECASE,
+                  ),
+                  "upi": re.compile(
+                      r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*",

Copilot AI Apr 17, 2026

The UPI column-name regex also relies on \b word boundaries, so snake_case names like customer_upi_id won’t match because _ is a word character. Consider normalizing separators (e.g., _ -> space) before applying \b, or using lookarounds/separator-aware matching instead of \b.

Suggested change

      
                    r".*\bupi_?(id|address|vpa)\b.*|.*\bvpa\b.*",
          
                    r".*(?<![a-z0-9])upi_?(id|address|vpa)(?![a-z0-9]).*|.*(?<![a-z0-9])vpa(?![a-z0-9]).*",

Copilot uses AI. Check for mistakes.

ingestion/tests/unit/pii/test_india_patterns.py

		@@ -0,0 +1,28 @@
		import pytest

Copilot AI Apr 17, 2026

pytest is imported but never used in this test module (all assertions are plain assert). This will be flagged by cleanup/lint tooling (e.g., pycln). Remove the unused import.

Suggested change

import pytest

Copilot uses AI. Check for mistakes.

ingestion/tests/unit/pii/test_india_patterns.py

Comment on lines +20 to +22

+                  assert validate_pan("abcdE1234f") is False # must be uppercase
+                  assert validate_pan("ABCD1234F") is False # too short
+                  assert validate_pan("ABCDE12345") is False # wrong format

Copilot AI Apr 17, 2026

This test file doesn’t appear black-formatted (e.g., inline comments need two spaces before #). Please run the standard ingestion formatter (make py_format / black) so formatting checks pass.

Suggested change

      
                assert validate_pan("abcdE1234f") is False # must be uppercase
          
                assert validate_pan("ABCD1234F") is False # too short
          
                assert validate_pan("ABCDE12345") is False # wrong format
          
                assert validate_pan("abcdE1234f") is False  # must be uppercase
          
                assert validate_pan("ABCD1234F") is False  # too short
          
                assert validate_pan("ABCDE12345") is False  # wrong format

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/processor.py

Comment on lines +121 to +132

+                              return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]
+                      if india_pii == "PAN" and sample_values:
+                          valid_count = sum(1 for v in sample_values if validate_pan(v))
+                          if valid_count > len(sample_values) * 0.5:
+                              return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]
+                      if india_pii == "UPI" and sample_values:
+                          # UPI validation is simpler - check for @ symbol in majority
+                          valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
+                          if valid_count > len(sample_values) * 0.5:
+                              return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")]

Copilot AI Apr 17, 2026

These newly added return statements are not black-formatted (the build_tag_label(..., "India Aadhaar detected") line exceeds typical line length and black will reflow it). Please run make py_format/black on this file to avoid CI formatting failures.

Suggested change

      
                            return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India Aadhaar detected")]
          
                    if india_pii == "PAN" and sample_values:
          
                        valid_count = sum(1 for v in sample_values if validate_pan(v))
          
                        if valid_count > len(sample_values) * 0.5:
          
                            return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India PAN detected")]
          
                    if india_pii == "UPI" and sample_values:
          
                        # UPI validation is simpler - check for @ symbol in majority
          
                        valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
          
                        if valid_count > len(sample_values) * 0.5:
          
                            return [self.build_tag_label(PIISensitivityTag.SENSITIVE, "India UPI detected")]
          
                            return [
          
                                self.build_tag_label(
          
                                    PIISensitivityTag.SENSITIVE, "India Aadhaar detected"
          
                                )
          
                            ]
          
                    if india_pii == "PAN" and sample_values:
          
                        valid_count = sum(1 for v in sample_values if validate_pan(v))
          
                        if valid_count > len(sample_values) * 0.5:
          
                            return [
          
                                self.build_tag_label(
          
                                    PIISensitivityTag.SENSITIVE, "India PAN detected"
          
                                )
          
                            ]
          
                    if india_pii == "UPI" and sample_values:
          
                        # UPI validation is simpler - check for @ symbol in majority
          
                        valid_count = sum(1 for v in sample_values if "@" in v and len(v) > 5)
          
                        if valid_count > len(sample_values) * 0.5:
          
                            return [
          
                                self.build_tag_label(
          
                                    PIISensitivityTag.SENSITIVE, "India UPI detected"
          
                                )
          
                            ]

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/india_patterns.py

Comment on lines +22 to +26

+              _VERHOEFF_D = [
+               [0,1,2,3,4,5,6,7,8,9],
+               [1,2,3,4,0,6,7,8,9,5],
+               [2,3,4,0,1,7,8,9,5,6],
+               [3,4,0,1,2,8,9,5,6,7],

Copilot AI Apr 17, 2026

This module appears to need black formatting (e.g., the Verhoeff tables aren’t spaced/indented per project formatting), which will likely cause CI formatting checks to fail. Please run the ingestion formatter (make py_format / black) on this file before merging.

Copilot uses AI. Check for mistakes.

ingestion/src/metadata/pii/processor.py

Comment on lines +113 to +116

+                      # Check India PII patterns
+                      india_pii = is_india_pii_column(column.name.root)
+                      # India PII validation - check majority of samples like existing classifier
+                      sample_values = [str(v) for v in sample_data if v] if sample_data else []

Copilot AI Apr 17, 2026

The new India PII detection is added to PIIProcessor, but PIIProcessor is marked deprecated and the workflow defaults to tag-pii-processor (TagProcessor) in processor_factory. As written, most users won’t hit this new logic unless they explicitly opt into the legacy processor type. To meet the stated goal (Column Name Scanner / default auto-tagging), this should be integrated into the TagProcessor path (e.g., via TagAnalyzer/ColumnNameScanner or a custom recognizer) or the factory default needs to change.

Copilot uses AI. Check for mistakes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet