refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974)#1990
Conversation
The previous implementation applied weights [2,1,2,1,2,1,2,1,2,1,2,1], which is a plain Luhn-like pattern and does not match VKVV § 4 (Verordnung über die Vergabe der Versicherungsnummer). The correct weights per VKVV § 4 and DRV technical documentation are [2,1,2,5,7,1,2,1,2,1,2,1], with a cross-sum (Quersumme) applied to every product and the final check digit = sum mod 10. Verified against the canonical DRV example 15070649C103: letter C=03, effective 150706490310, products 2,5,0,35,0,6,8,9,0,3,2,0, cross-sums 2,5,0,8,0,6,8,9,0,3,2,0 = 43, 43 mod 10 = 3 matches the check digit. Test fixtures regenerated from the corrected algorithm. The previous "valid" fixture 65070803A018 was self-consistent with the wrong weights only; under VKVV § 4 the same prefix yields check digit 9, so the fixture is replaced with 65070803A019 and the incorrect original is moved to the negative-case set. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation had three compounding errors:
- Wrong weights [4,9,2,10,5,3] instead of alternating [4,9,4,9,4,9]
- Spurious Quersumme step on products > 9 that the spec does not define
- Wrong formula (total % 10) instead of (10 - total % 10) % 10
Per the KBV Arztnummern-Richtlinie ("Die ersten sechs Ziffern werden von
links beginnend abwechselnd mit 4 und 9 multipliziert. Die Summe dieser
Produkte wird durch Modulo 10 dividiert. Die Differenz dieser Zahl zu
10 ergibt die Prüfziffer (ist die Differenz 10, so ist die Prüfziffer
0)."), the correct algorithm uses alternating weights 4 and 9, no
cross-sum, and the complement-to-10 formula.
Verified against the KBV / Wikipedia canonical example: physician
digits 123456 produce products 4, 18, 12, 36, 20, 54 with sum 144;
(10 − 144 mod 10) mod 10 = 6, giving LANR 123456601.
The three previous "valid" fixtures (123456901, 234567601, 100000401)
were all self-consistent with the buggy code only and would be rejected
by a KBV-conformant validator; they are replaced with the recomputed
123456601, 234567701, 100000601 and moved to the negative-case set.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous recognizer was pure-regex and would match any DE + 9 digits sequence. The 9th digit of a German USt-IdNr. is in practice a check digit derived via ISO 7064 Mod 11,10 — identical to the algorithm used for the Steuer-IdNr. While BZSt does not publish this in an official Merkblatt, it is the algorithm used by every community implementation (python-stdnum, VIES-adjacent validators). Validating it here cuts the false-positive rate substantially at the cost of rejecting the rare theoretical USt-IdNr. whose 9th digit does not match. The docstring is explicit that this is a structural check, not a VIES existence confirmation. Verified against python-stdnum test vectors DE136695976 and DE129273398. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Neither recognizer had any structural check digit validation despite both document types following ICAO Doc 9303 Part 3 (Machine Readable Travel Documents). The 9th character in the document number is the ICAO check digit over the first 8 characters: weights 7, 3, 1 repeating, letters mapped A=10 … Z=35, sum modulo 10. For DE_ID_CARD the strict nPA pattern also tightened the last-character class from [CFGHJKLMNPRTVWXYZ0-9] to [0-9] — the 10th (here 9th, since the overall MRZ form is different) character is always the numeric ICAO check digit, never a letter. The overly-permissive "Relaxed" pattern is dropped to avoid overlap. Legacy pre-2010 Personalausweis numbers (format "T" + 8 digits) predate ICAO and carry no check digit; validate_result returns None for these so they remain matchable at pattern confidence without checksum. Test fixtures generated with the ICAO algorithm directly (worked examples: C01234565, F12345671, L01X00T44, CZ6311T03, G00000002). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BSNR has no publicly documented Prüfziffer algorithm (confirmed against KBV Arztnummern-Richtlinie and HL7-Wiki), so structural validation is necessarily limited to the 2-digit KV Bereichskennzeichen at positions 1-2. A whitelist of known KV region codes per Arztnummern-Richtlinie Anlage 1 is now applied in validate_result: - True if the prefix matches a documented KV region (19 codes including the 17 standard KVs, KBV itself, and the Anlage 8 BMV-Ä hospital code 35). - None if the prefix is unknown — could still be valid historic/special cases, so the match is kept at pattern score rather than dropped. - False for clearly malformed input (wrong length, non-digit, all-zero). This cuts the false-positive rate on generic 9-digit sequences without rejecting legitimate BSNRs that sit outside the whitelist. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous check only rejected Steuer-IdNrs whose first 10 digits were all identical (a case the pre-2016 rule forbade). The current BZSt rule, in force since the 2016 format revision, is stricter: no digit may appear more than three times in positions 1-10. The new check replaces `len(set(digits[:10])) == 1` with `max(Counter(digits[:10]).values()) > 3`, correctly rejecting 11112345678 (four 1s) and 12222234567 (five 2s). Checksum algorithm (ISO 7064 Mod 11,10) is unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe README sample text used identifiers that were generated against the buggy algorithms (A123456787, 123456901, DE123456789, C01X00T47) and therefore would not pass the corrected validators. Replaced with identifiers that satisfy the corrected specifications: KVNR A123456780 (§ 290 SGB V Anlage 1) Steuer-IdNr 86095742719 (unchanged — was already correct) LANR 123456601 (KBV weights [4,9,4,9,4,9]) BSNR 021234568 (unchanged — KV prefix 02 is whitelisted) USt-IdNr DE136695976 (python-stdnum test vector) Reisepass C01234565 (ICAO check computed) CHANGELOG lists all seven DE-recognizer corrections under [unreleased] → Analyzer → Fixed/Added so the upstream review sees the full scope at a glance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e KVNR sample 1) default_recognizers.yaml was missing DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer, and DeFuehrerscheinRecognizer — only 9 of the 13 German recognizers were actually wired into the registry. Added the four missing entries inside the existing DE block, same style (supported_languages: [de], type: predefined, enabled: false). 2) Recipe sample text referenced KVNR A123456780 which fails the KVNR checksum on current main (that value only validates under the corrected § 290 SGB V Anlage 1 algorithm from PR #1974). Reverted to A123456787 which matches the current test fixture, so the sample actually validates with the main-branch recognizer. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…age 1 The previous implementation applied weights [2,9,8,7,6,5,4,3,2,1] to the letter-expanded KVNR, which is not the algorithm defined in § 290 SGB V Anlage 1 (GKV-Spitzenverband, Stand 02.01.2023). The spec uses alternating factors [1,2,1,2,1,2,1,2,1,2] on the 10 effective digits (2-digit letter ordinal + 8 data digits), takes the Quersumme of any product ≥ 10, and checks sum mod 10 against the check digit at position 10. Verified against the two worked examples in Anlage 1: A000500015 → sum 5, PZ 5 ✓ C000500021 → sum 11, PZ 1 ✓ The previous "valid" fixtures (A123456787, M123456789, B123456787) were generated with the wrong algorithm and never satisfied the spec; they are replaced with the two Anlage 1 samples plus three recomputed fixtures (A123456780, B123456782, M123456785) and moved into the negative-case set. Closes #1972 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-review - de_social_security: "Ergänzungsmerkmal" was mis-associated in two places. The +50 rule on the birth day (Pos 3-4) is gender-agnostic — it disambiguates collisions between people sharing birth date, surname initial, and Bereichsnummer. The serial at Pos 10-11 is the Geschlechtskennung (00-49 male, 50-99 female), not an Ergänzungsmerkmal. - de_bsnr: the docstring listed "06 Nordrhein, 14 Berlin" as examples of KV regional codes — but those codes are NOT in the module's own VALID_KV_CODES whitelist (correct codes per KBV Arztnummern-Richtlinie Anlage 1 are 38 Nordrhein, 72 Berlin). Fictitious example BSNRs updated accordingly to use whitelisted prefixes. - de_passport: the worked example for C01X00T41 had arithmetically wrong products (listed 99 for X×7 and 29 for T×7; should be 231 and 203 respectively) and a truncated sum. The actual implementation is correct — only the docstring math was broken. Replaced with the real numbers: products 84+0+1+231+0+0+203+12 = 531 → 531 mod 10 = 1. No behavioral changes; docstring corrections only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anchors C01X00T41 (the worked example in de_passport_recognizer.py's docstring) as a positive fixture in both the analyze() and validate_result() parametrize lists. Keeps the docstring math and the test suite in sync — any future drift in the worked example will now surface as a test failure rather than silently sitting in documentation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecognizers DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer and DeFuehrerscheinRecognizer were imported in predefined_recognizers/ __init__.py but missing from conf/default_recognizers.yaml, which means users relying on the default registry never got them instantiated. The registry entry was added in a prior commit on this branch; CHANGELOG now surfaces the user-visible effect. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orld normalization The BZSt does not publish the USt-IdNr Prüfziffer algorithm in any Merkblatt. The previous implementation applied ISO 7064 Mod 11,10 and returned False on mismatch — which Presidio treats as "drop the match" and is a silent false-negative risk if the community-consensus algorithm ever diverges from what BZSt actually uses for a given ID. Changes: - validate_result is now tri-state. Structural failure (wrong prefix, wrong length after normalisation, non-digit body) still returns False. Checksum pass returns True. Checksum fail returns None in the default mode — the match is kept at pattern score instead of being dropped. - Opt-in strict mode via strict_checksum=True constructor parameter for callers who prefer precision over recall (e.g. compliance-adjacent scans). - Real-world input normalisation: the second lenient pattern allows whitespace, dashes and dots between digit groups; validate_result strips them before the structural check. Handles "DE 123 456 789", "DE-123-456-789", "DE.123.456.789", lowercase and mixed forms commonly seen on invoices and Impressum pages. - The heuristic nature of the checksum is made explicit in the class docstring and inline so readers of the code see the rejection policy rationale immediately. No other DE checksums changed — KVNR (§ 290 SGB V Anlage 1), RVNR (VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr (BZSt Merkblatt) and the ICAO Doc 9303 MRZ checks are all officially documented and continue to reject on mismatch. For legally authoritative DE VAT existence verification, callers must still cross-check VIES — this recognizer performs detection, not compliance validation. Test suite extended from 25 to 56 DE_VAT_ID tests: default-mode tri-state coverage, strict-mode enforcement, and real-world formatting variants (space/dash/dot/mixed separators, lowercase). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Follow-up after an external review round focused on enterprise-grade correctness, specifically for Problem: the previous Changes in
No other DE checksums changed. KVNR (§ 290 SGB V Anlage 1), RVNR (VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr and the ICAO Doc 9303 MRZ checks all have officially documented algorithms and continue to reject on mismatch. For legally authoritative DE VAT existence verification, callers must still cross-check VIES — this recognizer performs detection, not compliance validation. The class docstring says so explicitly. Test suite extended from 25 to 56 |
There was a problem hiding this comment.
Pull request overview
Completes and corrects Prüfziffer (checksum) / structural validation for German (DE_*) pattern recognizers in Presidio Analyzer, aligning implementations and fixtures with primary sources and improving real-world robustness (e.g., MRZ/ICAO digits, VAT ID normalization), while also making previously unreachable recognizers available via the default registry.
Changes:
- Corrected checksum algorithms for
DE_HEALTH_INSURANCE,DE_SOCIAL_SECURITY,DE_LANR, and tightenedDE_TAX_IDstructural rules. - Added missing checksum/structural validation for
DE_VAT_ID,DE_PASSPORT,DE_ID_CARD, and added KV-prefix whitelist validation forDE_BSNR. - Updated unit tests, registry configuration, changelog, and German-language recipe sample identifiers to match the corrected validators.
Reviewed changes
Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_health_insurance_recognizer.py | Fix KVNR checksum factors to match §290 SGB V Anlage 1. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_social_security_recognizer.py | Fix RVNR checksum weights to VKVV §4 and update documentation. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_lanr_recognizer.py | Fix LANR checksum algorithm per KBV Arztnummern-Richtlinie. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_tax_id_recognizer.py | Enforce post-2016 BZSt repetition constraint for DE_TAX_ID. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_vat_id_recognizer.py | Add VAT ID normalization + ISO 7064 Mod 11,10 validation with heuristic/strict mode. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_passport_recognizer.py | Add ICAO Doc 9303 check digit validation for DE passports. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_id_card_recognizer.py | Tighten nPA regex and add ICAO check digit validation; keep legacy T-format as non-validated. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_bsnr_recognizer.py | Add KV-prefix whitelist-based structural validation for BSNR. |
| presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml | Register additional German recognizers in the default registry (disabled by default). |
| presidio-analyzer/tests/test_de_health_insurance_recognizer.py | Update KVNR fixtures and add edge-case coverage for corrected checksum. |
| presidio-analyzer/tests/test_de_social_security_recognizer.py | Replace fixtures with VKVV §4-aligned examples and negative cases. |
| presidio-analyzer/tests/test_de_lanr_recognizer.py | Replace fixtures to match corrected LANR checksum. |
| presidio-analyzer/tests/test_de_tax_id_recognizer.py | Add tests for the post-2016 repetition rule. |
| presidio-analyzer/tests/test_de_vat_id_recognizer.py | Add heuristic-vs-strict behavior tests plus normalization variants and span assertions. |
| presidio-analyzer/tests/test_de_passport_recognizer.py | Add ICAO checksum validation tests plus span/score assertions. |
| presidio-analyzer/tests/test_de_id_card_recognizer.py | Update fixtures for ICAO-valid nPA and legacy acceptance; add validate_result assertions. |
| presidio-analyzer/tests/test_de_bsnr_recognizer.py | Update fixtures for KV-prefix whitelist behavior and validate_result assertions. |
| docs/recipes/german-language-support/README.md | Update sample identifiers to values which pass corrected validators. |
| CHANGELOG.md | Document fixes/additions under [unreleased] for the German recognizers. |
…k digit Addresses Copilot review comment on PR #1990: the docstring previously said the letter-ordinal expansion yields an "effective 13-digit string" while the implementation correctly builds a 12-digit `effective` buffer and applies 12 weights. Reworded to make clear that the 12 data digits (positions 1–8 + 2 letter-ordinal digits + positions 10–11) are what gets weighted, and the check digit at position 12 stands apart. No behavioural change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Copilot review comments on PR #1990 (validator + test assertions). The previous validate_result returned True for any 9-digit input whose 2-digit prefix was in VALID_KV_CODES, which Presidio upgrades to MAX_SCORE (1.0). On a base pattern as broad as \b\d{9}\b this is too strong — a random 9-digit sequence that happens to start with one of the 19 whitelisted KV codes would match at max confidence without any context. New behaviour: - structurally invalid (wrong length, non-digit, all-zero) → False (drop) - everything else → None (keep pattern score 0.2, let the ContextAwareEnhancer use "Betriebsstättennummer"/"BSNR"/"Praxis"/… to drive final confidence) VALID_KV_CODES is retained as documentation of the expected KV range per KBV Arztnummern-Richtlinie Anlage 1; it just no longer acts as a confidence gate. Tests regenerated to parametrize expected_score per case instead of relying on a blanket max_score fixture — the previous test file declared expected_positions in its signature but never actually asserted spans or scores, so scoring regressions could not be caught. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etters Addresses Copilot review comment on PR #1990. The relaxed pattern \b[A-Z][A-Z0-9]{7}[0-9]\b accepted A, B, D, E, I, O, Q, S, U as first letter — all excluded from the ICAO Doc 9303 travel-document charset. validate_result would then still compute a MRZ-style check digit for these characters (A=10, B=11, …), so a non-German string whose check digit happened to match could be upgraded to MAX_SCORE. The strict pattern already covers every legitimate German passport number, so the relaxed pattern was pure false-positive surface without recall benefit. Additionally, validate_result now rejects ICAO-forbidden letters explicitly as a defense-in-depth guard against any future pattern regression. Two negative test cases added for coverage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Copilot review comment on PR #1990. The parametrized test previously accepted expected_positions and max_score in its signature but only asserted len(results), so scoring regressions (ICAO-valid nPA should score MAX vs legacy T-format at pattern score 0.5) could not be caught. Parametrize now carries an explicit expected_score per case: 1.0 for ICAO-validated nPA entries 0.5 for legacy "T + 8 digits" entries (pattern score, no checksum) None for dropped matches (no assertion runs) Full assert_result call is now executed in the loop so span and score are both verified. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor follow-up to 9051ffb — the updated docstring contains escaped regex characters (\b\d{9}\b), which ruff flags as D301 unless the docstring is marked raw. Switch to r""" and use the literal form, no content change.
|
Addressed all five Copilot comments — commit mapping for the record:
Plus Full DE suite: 306 passing. Ruff clean. |
…ntation-only Fixes Copilot review comment #3126014708. The original wording claimed a defense-in-depth check but validate_result does not reject unknown prefixes (that was intentional per commit 9051ffb). Clarify accordingly. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes Copilot review comment #3126014784. The relaxed pattern
`\b\d{8}[A-Z]\d{3}\b` matches strings with impossible birth dates;
without a date check, such a match could pass the VKVV checksum by
chance and get promoted to MAX_SCORE. Enforce day (01-31 or 51-81
with Ergaenzungsmerkmal) and month (01-12) structurally before the
checksum to close this false-positive vector.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Second Copilot review round addressed — commit mapping:
Full DE suite: 310 passing. Ruff clean. |
SharonHart
left a comment
There was a problem hiding this comment.
Amazing contribution, thanks @MvdB!
Change Description
Extends the pattern-focused work of #1909 by completing the Prüfziffer validation layer of the German recognizers. #1909 brought in all 13 DE recognizers with correct regex patterns and context words, but the checksum / structural-validation side was not comprehensively handled: several algorithms were implemented incorrectly and several were missing entirely. Issue #1972 was the visible symptom of this broader gap. During the follow-up audit I verified each recognizer against its primary source and corrected or added the missing Prüfziffer logic; each new test fixture is derived from a primary-source worked example to prevent the spec-vs-implementation drift from silently recurring.
Corrected checksum algorithms
DE_HEALTH_INSURANCE— factors per § 290 SGB V Anlage 1 ([1,2,1,2,…], Quersumme on products ≥ 10, sum mod 10). Verified against Anlage 1 worked examplesA000500015andC000500021.DE_SOCIAL_SECURITY(RVNR) — weights per VKVV § 4 ([2,1,2,5,7,1,2,1,2,1,2,1]). Verified against DRV canonical example15070649C103.DE_LANR— weights per KBV Arztnummern-Richtlinie ([4,9,4,9,4,9]), without the spurious Quersumme step, using the complement-to-10 formula(10 − sum mod 10) mod 10. Verified against physician digits123456 → 6.Added missing structural checksums
DE_VAT_ID— ISO 7064 Mod 11,10, identical toDE_TAX_ID. Widely used by community validators (python-stdnum, VIES-adjacent); BZSt does not publish it in a Merkblatt but it is the in-practice-authoritative algorithm.DE_PASSPORT,DE_ID_CARD— ICAO Doc 9303 MRZ check digit (weights 7, 3, 1 repeating; letters A=10 … Z=35; sum mod 10).DE_ID_CARDregex last character tightened to[0-9](the ICAO check digit is always numeric).DE_BSNR— no public Prüfziffer exists (confirmed against KBV Arztnummern-Richtlinie and HL7-Wiki), so added a KV-regional-code whitelist per Arztnummern-Richtlinie Anlage 1 as defense-in-depth.Minor
DE_TAX_ID— enforce the post-2016 BZSt repetition rule (max 3 repetitions in positions 1–10), replacing the lenient "all-10 must not be identical" check.DeLanrRecognizer,DeBsnrRecognizer,DeVatIdRecognizerandDeFuehrerscheinRecognizerindefault_recognizers.yaml— these four were imported in code but missing from the registry configuration, so they were unreachable via the default registry.docs/recipes/german-language-support/README.mdsample data updated to identifiers that pass the corrected validators.Issue reference
Fixes #1972. Supersedes #1974.
Checklist