Skip to content

Fix incorrect PESEL checksum validation in PlPeselRecognizer#1520

Closed
BlaiseCz wants to merge 7 commits intomicrosoft:mainfrom
BlaiseCz:fix-pesel-checksum
Closed

Fix incorrect PESEL checksum validation in PlPeselRecognizer#1520
BlaiseCz wants to merge 7 commits intomicrosoft:mainfrom
BlaiseCz:fix-pesel-checksum

Conversation

@BlaiseCz
Copy link

@BlaiseCz BlaiseCz commented Jan 31, 2025

Bug Description

The PESEL checksum validation in PlPeselRecognizer.validate_result() is incorrect. The current implementation does not correctly compute the control digit, leading to false negatives, where valid PESEL numbers are incorrectly rejected.

This affects Presidio's ability to correctly recognize and validate PESEL numbers, impacting anonymization and sensitive data detection.


To Reproduce

Run the following test:

from presidio_analyzer.predefined_recognizers import PlPeselRecognizer

pesel_recognizer = PlPeselRecognizer()

valid_pesel = "44051401359"  # This is a valid PESEL
print(pesel_recognizer.validate_result(valid_pesel))  # Expected: True, Actual: False

**Note if unsure, check this: https://kalkulatory.gofin.pl/kalkulatory/sprawdzanie-pesel-weryfikacja-pesel

Observed Behavior

  • The function returns False for a valid PESEL due to incorrect checksum computation.

Expected Behavior

  • A valid PESEL (with the correct checksum) should return True.

Root Cause: Incorrect Checksum Calculation

The issue lies in the final checksum validation step. The existing code:

checksum = sum(digit * weight for digit, weight in zip(digits[:10], weights))
checksum %= 10

return checksum == digits[10]  # ❌ Incorrect final checksum check!

This incorrectly compares checksum directly to the last digit of PESEL instead of computing the correct control digit.


Proposed Fix

The correct formula to compute the PESEL checksum is:

def validate_result(self, pattern_text: str) -> bool:  # noqa D102
    if len(pattern_text) != 11 or not pattern_text.isdigit():
        return False  # Ensure the input is a valid 11-digit number

    digits = [int(digit) for digit in pattern_text]
    weights = [1, 3, 7, 9, 1, 3, 7, 9, 1, 3]  # Correct weights

    checksum = sum(digit * weight for digit, weight in zip(digits[:10], weights)) % 10
    check_digit = (10 - checksum) % 10  # ✅ Corrected final checksum computation

    return check_digit == digits[10]  # ✅ Now correctly compares with the last digit

Why This Fix Works

  • Ensures the checksum modulo 10 logic is correctly applied.
  • Guarantees that only valid PESELs pass the validation.
  • Fixes the false-negative issue without introducing false positives.

Additional Context

  • This issue impacts Polish users relying on PESEL validation in Presidio.
  • The bug affects data masking and validation accuracy.
  • Fixing this ensures compliance with official PESEL formatting rules.

@omri374
Copy link
Collaborator

omri374 commented Feb 4, 2025

Thanks!

@omri374
Copy link
Collaborator

omri374 commented Feb 4, 2025

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Collaborator

omri374 commented Feb 5, 2025

note that we have another issue which affects the CI (to be fixed in #1522), however the pesel recognizer tests are failing too. Would you mind taking a look?

@omri374
Copy link
Collaborator

omri374 commented Feb 6, 2025

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374
Copy link
Collaborator

omri374 commented Mar 31, 2025

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@omri374 omri374 requested a review from Copilot March 31, 2025 18:34
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@omri374 omri374 requested a review from Copilot March 31, 2025 18:44
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

The PR fixes an incorrect PESEL checksum validation in the PlPeselRecognizer that was causing valid PESEL numbers to be falsely rejected.

  • Added input validation to check for an 11-digit number.
  • Updated the checksum calculation to correctly compute the control digit.
  • Modified the final check to compare the computed control digit with the last digit of the PESEL.

@omri374
Copy link
Collaborator

omri374 commented Apr 6, 2025

@BlaiseCz this is a great addition. Would you be interested in fixing the tests and merging this into the package?

@omri374
Copy link
Collaborator

omri374 commented Jun 1, 2025

Closing due to missing tests, feel free to re-open if you're interested in completing the tests or get any support from the maintenance team.

@omri374 omri374 closed this Jun 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants