Skip to content

Fix incorrect PESEL checksum validation in PlPeselRecognizer#1998

Merged
omri374 merged 3 commits into
microsoft:mainfrom
sienioApius:fix/pl-pesel-checksum
Apr 28, 2026
Merged

Fix incorrect PESEL checksum validation in PlPeselRecognizer#1998
omri374 merged 3 commits into
microsoft:mainfrom
sienioApius:fix/pl-pesel-checksum

Conversation

@sienioApius
Copy link
Copy Markdown
Contributor

@sienioApius sienioApius commented Apr 23, 2026

Change Description

The Polish PESEL check-digit algorithm is check = (10 - weighted_sum % 10) % 10
(see https://en.wikipedia.org/wiki/PESEL#Check_digit), but
PlPeselRecognizer.validate_result currently compares the raw
weighted_sum % 10 to the check digit, which rejects real valid PESEL numbers.

Example — 44051401458 is the canonical example used in official Polish
documentation and is algorithmically valid, but today:

>>> from presidio_analyzer.predefined_recognizers import PlPeselRecognizer
>>> PlPeselRecognizer().validate_result("44051401458")
False   # expected True

This PR:

  • Corrects the check-digit formula.
  • Guards against non-11-digit / non-numeric inputs (previously would raise
    IndexError on a regex-produced string shorter than 11).
  • Replaces the existing test fixtures — which were generated using the buggy
    formula and would not pass real-world PESEL validators — with PESEL numbers
    that are valid under the true algorithm, plus negative cases covering wrong
    check digits, wrong length, and non-digit input.

The previous "valid" fixture 11111111114 has a real check digit of 6, not
4. The updated fixtures (44051401458, 02070803628, 11111111116) verify
against the actual PESEL standard.

Issue reference

This completes the work started in #1520 (closed on 2025-06-01 due to missing
tests). Attribution to the original author @BlaiseCz for the analysis.

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required) (will sign when CLA bot prompts)
  • My code includes unit tests
  • All unit tests and lint checks pass locally (pytest tests/test_pl_pesel_recognizer.py → 18 passed)
  • My PR contains documentation updates / additions if required

The PESEL check-digit algorithm is `check = (10 - weighted_sum % 10) % 10`
(https://en.wikipedia.org/wiki/PESEL#Check_digit). The previous
implementation compared the raw `weighted_sum % 10` to the check digit,
which incorrectly rejects valid PESEL numbers such as 44051401458
(the canonical example cited in official Polish documentation) and
accepts nothing that a real PESEL-issuing authority would produce.

This completes microsoft#1520, which was closed due to missing test coverage.

Changes:
- Correct the check-digit formula in `validate_result`.
- Guard against non-11-digit / non-numeric inputs.
- Replace the previous test fixtures (which relied on the buggy
  formula) with PESELs that are valid under the real algorithm, plus
  negative cases covering bad check digits, wrong length, and
  non-digit characters.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@sienioApius
Copy link
Copy Markdown
Contributor Author

@microsoft-github-policy-service agree company="APIUS Technologies S.A."

@omri374 omri374 merged commit 453bebc into microsoft:main Apr 28, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants