refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974) by MvdB · Pull Request #1990 · microsoft/presidio

MvdB · 2026-04-19T18:44:58Z

Change Description

Extends the pattern-focused work of #1909 by completing the Prüfziffer validation layer of the German recognizers. #1909 brought in all 13 DE recognizers with correct regex patterns and context words, but the checksum / structural-validation side was not comprehensively handled: several algorithms were implemented incorrectly and several were missing entirely. Issue #1972 was the visible symptom of this broader gap. During the follow-up audit I verified each recognizer against its primary source and corrected or added the missing Prüfziffer logic; each new test fixture is derived from a primary-source worked example to prevent the spec-vs-implementation drift from silently recurring.

Corrected checksum algorithms

DE_HEALTH_INSURANCE — factors per § 290 SGB V Anlage 1 ([1,2,1,2,…], Quersumme on products ≥ 10, sum mod 10). Verified against Anlage 1 worked examples A000500015 and C000500021.
DE_SOCIAL_SECURITY (RVNR) — weights per VKVV § 4 ([2,1,2,5,7,1,2,1,2,1,2,1]). Verified against DRV canonical example 15070649C103.
DE_LANR — weights per KBV Arztnummern-Richtlinie ([4,9,4,9,4,9]), without the spurious Quersumme step, using the complement-to-10 formula (10 − sum mod 10) mod 10. Verified against physician digits 123456 → 6.

Added missing structural checksums

DE_VAT_ID — ISO 7064 Mod 11,10, identical to DE_TAX_ID. Widely used by community validators (python-stdnum, VIES-adjacent); BZSt does not publish it in a Merkblatt but it is the in-practice-authoritative algorithm.
DE_PASSPORT, DE_ID_CARD — ICAO Doc 9303 MRZ check digit (weights 7, 3, 1 repeating; letters A=10 … Z=35; sum mod 10).
DE_ID_CARD regex last character tightened to [0-9] (the ICAO check digit is always numeric).
DE_BSNR — no public Prüfziffer exists (confirmed against KBV Arztnummern-Richtlinie and HL7-Wiki), so added a KV-regional-code whitelist per Arztnummern-Richtlinie Anlage 1 as defense-in-depth.

Minor

DE_TAX_ID — enforce the post-2016 BZSt repetition rule (max 3 repetitions in positions 1–10), replacing the lenient "all-10 must not be identical" check.
Registered DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer and DeFuehrerscheinRecognizer in default_recognizers.yaml — these four were imported in code but missing from the registry configuration, so they were unreachable via the default registry.
Recipe docs/recipes/german-language-support/README.md sample data updated to identifiers that pass the corrected validators.

Issue reference

Fixes #1972. Supersedes #1974.

Checklist

I have reviewed the contribution guidelines
I have signed the CLA (if required)
My code includes unit tests (each recognizer's fixtures are derived directly from a primary-source worked example cited in the test docstring)
All unit tests and lint checks pass locally (2239 passed, 12 skipped on the full analyzer suite; ruff clean)
My PR contains documentation updates / additions if required (CHANGELOG + recipe sample)

The previous implementation applied weights [2,1,2,1,2,1,2,1,2,1,2,1], which is a plain Luhn-like pattern and does not match VKVV § 4 (Verordnung über die Vergabe der Versicherungsnummer). The correct weights per VKVV § 4 and DRV technical documentation are [2,1,2,5,7,1,2,1,2,1,2,1], with a cross-sum (Quersumme) applied to every product and the final check digit = sum mod 10. Verified against the canonical DRV example 15070649C103: letter C=03, effective 150706490310, products 2,5,0,35,0,6,8,9,0,3,2,0, cross-sums 2,5,0,8,0,6,8,9,0,3,2,0 = 43, 43 mod 10 = 3 matches the check digit. Test fixtures regenerated from the corrected algorithm. The previous "valid" fixture 65070803A018 was self-consistent with the wrong weights only; under VKVV § 4 the same prefix yields check digit 9, so the fixture is replaced with 65070803A019 and the incorrect original is moved to the negative-case set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous implementation had three compounding errors: - Wrong weights [4,9,2,10,5,3] instead of alternating [4,9,4,9,4,9] - Spurious Quersumme step on products > 9 that the spec does not define - Wrong formula (total % 10) instead of (10 - total % 10) % 10 Per the KBV Arztnummern-Richtlinie ("Die ersten sechs Ziffern werden von links beginnend abwechselnd mit 4 und 9 multipliziert. Die Summe dieser Produkte wird durch Modulo 10 dividiert. Die Differenz dieser Zahl zu 10 ergibt die Prüfziffer (ist die Differenz 10, so ist die Prüfziffer 0)."), the correct algorithm uses alternating weights 4 and 9, no cross-sum, and the complement-to-10 formula. Verified against the KBV / Wikipedia canonical example: physician digits 123456 produce products 4, 18, 12, 36, 20, 54 with sum 144; (10 − 144 mod 10) mod 10 = 6, giving LANR 123456601. The three previous "valid" fixtures (123456901, 234567601, 100000401) were all self-consistent with the buggy code only and would be rejected by a KBV-conformant validator; they are replaced with the recomputed 123456601, 234567701, 100000601 and moved to the negative-case set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous recognizer was pure-regex and would match any DE + 9 digits sequence. The 9th digit of a German USt-IdNr. is in practice a check digit derived via ISO 7064 Mod 11,10 — identical to the algorithm used for the Steuer-IdNr. While BZSt does not publish this in an official Merkblatt, it is the algorithm used by every community implementation (python-stdnum, VIES-adjacent validators). Validating it here cuts the false-positive rate substantially at the cost of rejecting the rare theoretical USt-IdNr. whose 9th digit does not match. The docstring is explicit that this is a structural check, not a VIES existence confirmation. Verified against python-stdnum test vectors DE136695976 and DE129273398. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Neither recognizer had any structural check digit validation despite both document types following ICAO Doc 9303 Part 3 (Machine Readable Travel Documents). The 9th character in the document number is the ICAO check digit over the first 8 characters: weights 7, 3, 1 repeating, letters mapped A=10 … Z=35, sum modulo 10. For DE_ID_CARD the strict nPA pattern also tightened the last-character class from [CFGHJKLMNPRTVWXYZ0-9] to [0-9] — the 10th (here 9th, since the overall MRZ form is different) character is always the numeric ICAO check digit, never a letter. The overly-permissive "Relaxed" pattern is dropped to avoid overlap. Legacy pre-2010 Personalausweis numbers (format "T" + 8 digits) predate ICAO and carry no check digit; validate_result returns None for these so they remain matchable at pattern confidence without checksum. Test fixtures generated with the ICAO algorithm directly (worked examples: C01234565, F12345671, L01X00T44, CZ6311T03, G00000002). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

BSNR has no publicly documented Prüfziffer algorithm (confirmed against KBV Arztnummern-Richtlinie and HL7-Wiki), so structural validation is necessarily limited to the 2-digit KV Bereichskennzeichen at positions 1-2. A whitelist of known KV region codes per Arztnummern-Richtlinie Anlage 1 is now applied in validate_result: - True if the prefix matches a documented KV region (19 codes including the 17 standard KVs, KBV itself, and the Anlage 8 BMV-Ä hospital code 35). - None if the prefix is unknown — could still be valid historic/special cases, so the match is kept at pattern score rather than dropped. - False for clearly malformed input (wrong length, non-digit, all-zero). This cuts the false-positive rate on generic 9-digit sequences without rejecting legitimate BSNRs that sit outside the whitelist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous check only rejected Steuer-IdNrs whose first 10 digits were all identical (a case the pre-2016 rule forbade). The current BZSt rule, in force since the 2016 format revision, is stricter: no digit may appear more than three times in positions 1-10. The new check replaces `len(set(digits[:10])) == 1` with `max(Counter(digits[:10]).values()) > 3`, correctly rejecting 11112345678 (four 1s) and 12222234567 (five 2s). Checksum algorithm (ISO 7064 Mod 11,10) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The recipe README sample text used identifiers that were generated against the buggy algorithms (A123456787, 123456901, DE123456789, C01X00T47) and therefore would not pass the corrected validators. Replaced with identifiers that satisfy the corrected specifications: KVNR A123456780 (§ 290 SGB V Anlage 1) Steuer-IdNr 86095742719 (unchanged — was already correct) LANR 123456601 (KBV weights [4,9,4,9,4,9]) BSNR 021234568 (unchanged — KV prefix 02 is whitelisted) USt-IdNr DE136695976 (python-stdnum test vector) Reisepass C01234565 (ICAO check computed) CHANGELOG lists all seven DE-recognizer corrections under [unreleased] → Analyzer → Fixed/Added so the upstream review sees the full scope at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e KVNR sample 1) default_recognizers.yaml was missing DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer, and DeFuehrerscheinRecognizer — only 9 of the 13 German recognizers were actually wired into the registry. Added the four missing entries inside the existing DE block, same style (supported_languages: [de], type: predefined, enabled: false). 2) Recipe sample text referenced KVNR A123456780 which fails the KVNR checksum on current main (that value only validates under the corrected § 290 SGB V Anlage 1 algorithm from PR #1974). Reverted to A123456787 which matches the current test fixture, so the sample actually validates with the main-branch recognizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…age 1 The previous implementation applied weights [2,9,8,7,6,5,4,3,2,1] to the letter-expanded KVNR, which is not the algorithm defined in § 290 SGB V Anlage 1 (GKV-Spitzenverband, Stand 02.01.2023). The spec uses alternating factors [1,2,1,2,1,2,1,2,1,2] on the 10 effective digits (2-digit letter ordinal + 8 data digits), takes the Quersumme of any product ≥ 10, and checks sum mod 10 against the check digit at position 10. Verified against the two worked examples in Anlage 1: A000500015 → sum 5, PZ 5 ✓ C000500021 → sum 11, PZ 1 ✓ The previous "valid" fixtures (A123456787, M123456789, B123456787) were generated with the wrong algorithm and never satisfied the spec; they are replaced with the two Anlage 1 samples plus three recomputed fixtures (A123456780, B123456782, M123456785) and moved into the negative-case set. Closes #1972 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-review - de_social_security: "Ergänzungsmerkmal" was mis-associated in two places. The +50 rule on the birth day (Pos 3-4) is gender-agnostic — it disambiguates collisions between people sharing birth date, surname initial, and Bereichsnummer. The serial at Pos 10-11 is the Geschlechtskennung (00-49 male, 50-99 female), not an Ergänzungsmerkmal. - de_bsnr: the docstring listed "06 Nordrhein, 14 Berlin" as examples of KV regional codes — but those codes are NOT in the module's own VALID_KV_CODES whitelist (correct codes per KBV Arztnummern-Richtlinie Anlage 1 are 38 Nordrhein, 72 Berlin). Fictitious example BSNRs updated accordingly to use whitelisted prefixes. - de_passport: the worked example for C01X00T41 had arithmetically wrong products (listed 99 for X×7 and 29 for T×7; should be 231 and 203 respectively) and a truncated sum. The actual implementation is correct — only the docstring math was broken. Replaced with the real numbers: products 84+0+1+231+0+0+203+12 = 531 → 531 mod 10 = 1. No behavioral changes; docstring corrections only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Anchors C01X00T41 (the worked example in de_passport_recognizer.py's docstring) as a positive fixture in both the analyze() and validate_result() parametrize lists. Keeps the docstring math and the test suite in sync — any future drift in the worked example will now surface as a test failure rather than silently sitting in documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ecognizers DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer and DeFuehrerscheinRecognizer were imported in predefined_recognizers/ __init__.py but missing from conf/default_recognizers.yaml, which means users relying on the default registry never got them instantiated. The registry entry was added in a prior commit on this branch; CHANGELOG now surfaces the user-visible effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…orld normalization The BZSt does not publish the USt-IdNr Prüfziffer algorithm in any Merkblatt. The previous implementation applied ISO 7064 Mod 11,10 and returned False on mismatch — which Presidio treats as "drop the match" and is a silent false-negative risk if the community-consensus algorithm ever diverges from what BZSt actually uses for a given ID. Changes: - validate_result is now tri-state. Structural failure (wrong prefix, wrong length after normalisation, non-digit body) still returns False. Checksum pass returns True. Checksum fail returns None in the default mode — the match is kept at pattern score instead of being dropped. - Opt-in strict mode via strict_checksum=True constructor parameter for callers who prefer precision over recall (e.g. compliance-adjacent scans). - Real-world input normalisation: the second lenient pattern allows whitespace, dashes and dots between digit groups; validate_result strips them before the structural check. Handles "DE 123 456 789", "DE-123-456-789", "DE.123.456.789", lowercase and mixed forms commonly seen on invoices and Impressum pages. - The heuristic nature of the checksum is made explicit in the class docstring and inline so readers of the code see the rejection policy rationale immediately. No other DE checksums changed — KVNR (§ 290 SGB V Anlage 1), RVNR (VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr (BZSt Merkblatt) and the ICAO Doc 9303 MRZ checks are all officially documented and continue to reject on mismatch. For legally authoritative DE VAT existence verification, callers must still cross-check VIES — this recognizer performs detection, not compliance validation. Test suite extended from 25 to 56 DE_VAT_ID tests: default-mode tri-state coverage, strict-mode enforcement, and real-world formatting variants (space/dash/dot/mixed separators, lowercase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MvdB · 2026-04-21T20:32:48Z

Follow-up after an external review round focused on enterprise-grade correctness, specifically for DE_VAT_ID where the BZSt does not publish the Prüfziffer algorithm.

Problem: the previous validate_result returned False on checksum mismatch, which Presidio treats as "drop the match". Given that the ISO 7064 Mod 11,10 we run is community consensus (python-stdnum, VIES-adjacent) rather than a BZSt-normative rule, a real USt-IdNr that happens to diverge from the heuristic was silently lost.

Changes in 072e8b8:

validate_result is now tri-state. Structural failures (wrong prefix, wrong length after normalisation, non-digit body) still return False. Checksum pass returns True. Checksum fail returns None in the default mode — the match keeps its pattern score instead of being dropped.
Opt-in strict mode via strict_checksum=True constructor parameter for callers who prefer precision over recall (e.g. compliance-adjacent scans).
Real-world input normalisation: a second lenient pattern allows whitespace, dashes and dots between digit groups; validate_result strips them before structural validation. Handles DE 123 456 789, DE-123-456-789, DE.123.456.789, lowercase and mixed forms.
The heuristic nature of the checksum is made explicit in the class docstring and inline so readers of the code see the rejection policy rationale immediately.

No other DE checksums changed. KVNR (§ 290 SGB V Anlage 1), RVNR (VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr and the ICAO Doc 9303 MRZ checks all have officially documented algorithms and continue to reject on mismatch.

For legally authoritative DE VAT existence verification, callers must still cross-check VIES — this recognizer performs detection, not compliance validation. The class docstring says so explicitly.

Test suite extended from 25 to 56 DE_VAT_ID tests covering default-mode tri-state behaviour, strict-mode enforcement, and the real-world formatting variants. Full DE suite: 303 passing. Ruff clean.

Copilot

Pull request overview

Completes and corrects Prüfziffer (checksum) / structural validation for German (DE_*) pattern recognizers in Presidio Analyzer, aligning implementations and fixtures with primary sources and improving real-world robustness (e.g., MRZ/ICAO digits, VAT ID normalization), while also making previously unreachable recognizers available via the default registry.

Changes:

Corrected checksum algorithms for DE_HEALTH_INSURANCE, DE_SOCIAL_SECURITY, DE_LANR, and tightened DE_TAX_ID structural rules.
Added missing checksum/structural validation for DE_VAT_ID, DE_PASSPORT, DE_ID_CARD, and added KV-prefix whitelist validation for DE_BSNR.
Updated unit tests, registry configuration, changelog, and German-language recipe sample identifiers to match the corrected validators.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_health_insurance_recognizer.py	Fix KVNR checksum factors to match §290 SGB V Anlage 1.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_social_security_recognizer.py	Fix RVNR checksum weights to VKVV §4 and update documentation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_lanr_recognizer.py	Fix LANR checksum algorithm per KBV Arztnummern-Richtlinie.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_tax_id_recognizer.py	Enforce post-2016 BZSt repetition constraint for DE_TAX_ID.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_vat_id_recognizer.py	Add VAT ID normalization + ISO 7064 Mod 11,10 validation with heuristic/strict mode.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_passport_recognizer.py	Add ICAO Doc 9303 check digit validation for DE passports.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_id_card_recognizer.py	Tighten nPA regex and add ICAO check digit validation; keep legacy T-format as non-validated.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_bsnr_recognizer.py	Add KV-prefix whitelist-based structural validation for BSNR.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml	Register additional German recognizers in the default registry (disabled by default).
presidio-analyzer/tests/test_de_health_insurance_recognizer.py	Update KVNR fixtures and add edge-case coverage for corrected checksum.
presidio-analyzer/tests/test_de_social_security_recognizer.py	Replace fixtures with VKVV §4-aligned examples and negative cases.
presidio-analyzer/tests/test_de_lanr_recognizer.py	Replace fixtures to match corrected LANR checksum.
presidio-analyzer/tests/test_de_tax_id_recognizer.py	Add tests for the post-2016 repetition rule.
presidio-analyzer/tests/test_de_vat_id_recognizer.py	Add heuristic-vs-strict behavior tests plus normalization variants and span assertions.
presidio-analyzer/tests/test_de_passport_recognizer.py	Add ICAO checksum validation tests plus span/score assertions.
presidio-analyzer/tests/test_de_id_card_recognizer.py	Update fixtures for ICAO-valid nPA and legacy acceptance; add validate_result assertions.
presidio-analyzer/tests/test_de_bsnr_recognizer.py	Update fixtures for KV-prefix whitelist behavior and validate_result assertions.
docs/recipes/german-language-support/README.md	Update sample identifiers to values which pass corrected validators.
CHANGELOG.md	Document fixes/additions under `[unreleased]` for the German recognizers.

…k digit Addresses Copilot review comment on PR #1990: the docstring previously said the letter-ordinal expansion yields an "effective 13-digit string" while the implementation correctly builds a 12-digit `effective` buffer and applies 12 weights. Reworded to make clear that the 12 data digits (positions 1–8 + 2 letter-ordinal digits + positions 10–11) are what gets weighted, and the check digit at position 12 stands apart. No behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses Copilot review comments on PR #1990 (validator + test assertions). The previous validate_result returned True for any 9-digit input whose 2-digit prefix was in VALID_KV_CODES, which Presidio upgrades to MAX_SCORE (1.0). On a base pattern as broad as \b\d{9}\b this is too strong — a random 9-digit sequence that happens to start with one of the 19 whitelisted KV codes would match at max confidence without any context. New behaviour: - structurally invalid (wrong length, non-digit, all-zero) → False (drop) - everything else → None (keep pattern score 0.2, let the ContextAwareEnhancer use "Betriebsstättennummer"/"BSNR"/"Praxis"/… to drive final confidence) VALID_KV_CODES is retained as documentation of the expected KV range per KBV Arztnummern-Richtlinie Anlage 1; it just no longer acts as a confidence gate. Tests regenerated to parametrize expected_score per case instead of relying on a blanket max_score fixture — the previous test file declared expected_positions in its signature but never actually asserted spans or scores, so scoring regressions could not be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…etters Addresses Copilot review comment on PR #1990. The relaxed pattern \b[A-Z][A-Z0-9]{7}[0-9]\b accepted A, B, D, E, I, O, Q, S, U as first letter — all excluded from the ICAO Doc 9303 travel-document charset. validate_result would then still compute a MRZ-style check digit for these characters (A=10, B=11, …), so a non-German string whose check digit happened to match could be upgraded to MAX_SCORE. The strict pattern already covers every legitimate German passport number, so the relaxed pattern was pure false-positive surface without recall benefit. Additionally, validate_result now rejects ICAO-forbidden letters explicitly as a defense-in-depth guard against any future pattern regression. Two negative test cases added for coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Addresses Copilot review comment on PR #1990. The parametrized test previously accepted expected_positions and max_score in its signature but only asserted len(results), so scoring regressions (ICAO-valid nPA should score MAX vs legacy T-format at pattern score 0.5) could not be caught. Parametrize now carries an explicit expected_score per case: 1.0 for ICAO-validated nPA entries 0.5 for legacy "T + 8 digits" entries (pattern score, no checksum) None for dropped matches (no assertion runs) Full assert_result call is now executed in the loop so span and score are both verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Minor follow-up to 9051ffb — the updated docstring contains escaped regex characters (\b\d{9}\b), which ruff flags as D301 unless the docstring is marked raw. Switch to r""" and use the literal form, no content change.

MvdB · 2026-04-22T06:41:04Z

Addressed all five Copilot comments — commit mapping for the record:

Comment 1 (test_de_id_card_recognizer.py): 31341e7 — added per-case expected_score to the parametrize and now actually calls assert_result() in the loop. nPA with valid ICAO → MAX_SCORE; legacy "T + 8 digits" → pattern score 0.5.
Comment 2 (test_de_bsnr_recognizer.py): 9051ffb — same parametrize shape; matches the new BSNR scoring behaviour from comment 4.
Comment 3 (de_social_security_recognizer.py docstring): 2854583 — clarified that the letter expansion yields 12 data digits weighted, separate from the check digit at position 12.
Comment 4 (de_bsnr_recognizer.py): 9051ffb — validate_result no longer promotes whitelisted-prefix matches to MAX_SCORE. On a \b\d{9}\b base pattern that upgrade was too strong for a 2-digit-prefix check. Returns None for every structurally-valid input; context words drive final confidence. VALID_KV_CODES is retained as documentation of the expected KV range per KBV Arztnummern-Richtlinie Anlage 1.
Comment 5 (de_passport_recognizer.py): def5267 — removed the relaxed \b[A-Z][A-Z0-9]{7}[0-9]\b pattern entirely (the strict ICAO charset covers every legitimate German passport) and added a defense-in-depth check in validate_result that rejects ICAO-forbidden letters (A, B, D, E, I, O, Q, S, U) outright. Two negative tests added.

Plus 7f9526b — minor ruff D301 follow-up after the BSNR docstring update (raw docstring since it contains \b\d{9}\b).

Full DE suite: 306 passing. Ruff clean.

Copilot

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

…ntation-only Fixes Copilot review comment #3126014708. The original wording claimed a defense-in-depth check but validate_result does not reject unknown prefixes (that was intentional per commit 9051ffb). Clarify accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes Copilot review comment #3126014784. The relaxed pattern `\b\d{8}[A-Z]\d{3}\b` matches strings with impossible birth dates; without a date check, such a match could pass the VKVV checksum by chance and get promoted to MAX_SCORE. Enforce day (01-31 or 51-81 with Ergaenzungsmerkmal) and month (01-12) structurally before the checksum to close this false-positive vector. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MvdB · 2026-04-22T18:25:57Z

Second Copilot review round addressed — commit mapping:

CHANGELOG.md (#3126014708): ad16bca — reworded the DeBsnrRecognizer entry to match the actual behaviour. The KV regional-code list is retained for documentation / defense-in-depth purposes per KBV Arztnummern-Richtlinie Anlage 1, but validate_result does not reject unknown prefixes (9051ffb from the previous round made that deliberate — BSNR has no public checksum, so a 2-digit-prefix drop on a \b\d{9}\b base pattern would be too aggressive).
de_social_security_recognizer.py (#3126014784): 6a9ba8b — validate_result now enforces the VKVV § 4 structural invariants on the birth-day (01–31, or 51–81 with Ergänzungsmerkmal) and month (01–12) before running the checksum. Previously, the relaxed \b\d{8}[A-Z]\d{3}\b pattern could surface strings with impossible dates and, on a ~10% random chance, have them pass the checksum and be promoted to MAX_SCORE. The date check closes that false-positive vector. Four explicit impossible-date cases added to the validate_result parametrize.

Full DE suite: 310 passing. Ruff clean.

SharonHart

Amazing contribution, thanks @MvdB!

MvdB and others added 9 commits April 19, 2026 20:38

MvdB mentioned this pull request Apr 19, 2026

Fix/kvnr checksum 1972 #1974

Closed

4 tasks

github-actions Bot added the external label Apr 19, 2026

MvdB mentioned this pull request Apr 19, 2026

Bug: DE_HEALTH_INSURANCE (KVNR) checksum rejects valid values (factor sequence mismatch vs §290 SGB V Anlage 1) #1972

Closed

SharonHart and others added 6 commits April 20, 2026 08:46

Merge branch 'main' into fix/de-recognizers-spec-compliance

c02d257

Merge branch 'main' into fix/de-recognizers-spec-compliance

00acdb4

SharonHart requested a review from Copilot April 22, 2026 05:00

Copilot started reviewing on behalf of SharonHart April 22, 2026 05:01 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

MvdB and others added 5 commits April 22, 2026 08:20

style(de_bsnr): use raw docstring to silence ruff D301

7f9526b

Minor follow-up to 9051ffb — the updated docstring contains escaped regex characters (\b\d{9}\b), which ruff flags as D301 unless the docstring is marked raw. Switch to r""" and use the literal form, no content change.

Merge branch 'main' into fix/de-recognizers-spec-compliance

e71ed5d

omri374 requested a review from Copilot April 22, 2026 18:07

Copilot started reviewing on behalf of omri374 April 22, 2026 18:07 View session

Copilot AI reviewed Apr 22, 2026

View reviewed changes

Comment thread CHANGELOG.md Outdated

Comment thread ...io_analyzer/predefined_recognizers/country_specific/germany/de_social_security_recognizer.py

MvdB and others added 2 commits April 22, 2026 20:24

SharonHart previously approved these changes Apr 23, 2026

View reviewed changes

Merge branch 'main' into fix/de-recognizers-spec-compliance

83a7ed8

SharonHart dismissed their stale review via 83a7ed8 April 23, 2026 11:27

SharonHart approved these changes Apr 23, 2026

View reviewed changes

omri374 approved these changes Apr 23, 2026

View reviewed changes

omri374 merged commit 98fb8de into microsoft:main Apr 23, 2026
34 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974)#1990

refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974)#1990
omri374 merged 24 commits into
microsoft:mainfrom
MvdB:fix/de-recognizers-spec-compliance

MvdB commented Apr 19, 2026

Uh oh!

MvdB commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MvdB commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

MvdB commented Apr 22, 2026 •

edited

Loading

Uh oh!

SharonHart left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

MvdB commented Apr 19, 2026

Change Description

Corrected checksum algorithms

Added missing structural checksums

Minor

Issue reference

Checklist

Uh oh!

MvdB commented Apr 21, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MvdB commented Apr 22, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

MvdB commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SharonHart left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MvdB commented Apr 22, 2026 •

edited

Loading