Skip to content

refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974)#1990

Merged
omri374 merged 24 commits into
microsoft:mainfrom
MvdB:fix/de-recognizers-spec-compliance
Apr 23, 2026
Merged

refactor(de-recognizers): complete Prüfziffer validation per primary sources (fixes #1972, supersedes #1974)#1990
omri374 merged 24 commits into
microsoft:mainfrom
MvdB:fix/de-recognizers-spec-compliance

Conversation

@MvdB
Copy link
Copy Markdown
Contributor

@MvdB MvdB commented Apr 19, 2026

Change Description

Extends the pattern-focused work of #1909 by completing the Prüfziffer validation layer of the German recognizers. #1909 brought in all 13 DE recognizers with correct regex patterns and context words, but the checksum / structural-validation side was not comprehensively handled: several algorithms were implemented incorrectly and several were missing entirely. Issue #1972 was the visible symptom of this broader gap. During the follow-up audit I verified each recognizer against its primary source and corrected or added the missing Prüfziffer logic; each new test fixture is derived from a primary-source worked example to prevent the spec-vs-implementation drift from silently recurring.

Corrected checksum algorithms

  • DE_HEALTH_INSURANCE — factors per § 290 SGB V Anlage 1 ([1,2,1,2,…], Quersumme on products ≥ 10, sum mod 10). Verified against Anlage 1 worked examples A000500015 and C000500021.
  • DE_SOCIAL_SECURITY (RVNR) — weights per VKVV § 4 ([2,1,2,5,7,1,2,1,2,1,2,1]). Verified against DRV canonical example 15070649C103.
  • DE_LANR — weights per KBV Arztnummern-Richtlinie ([4,9,4,9,4,9]), without the spurious Quersumme step, using the complement-to-10 formula (10 − sum mod 10) mod 10. Verified against physician digits 123456 → 6.

Added missing structural checksums

  • DE_VAT_ID — ISO 7064 Mod 11,10, identical to DE_TAX_ID. Widely used by community validators (python-stdnum, VIES-adjacent); BZSt does not publish it in a Merkblatt but it is the in-practice-authoritative algorithm.
  • DE_PASSPORT, DE_ID_CARD — ICAO Doc 9303 MRZ check digit (weights 7, 3, 1 repeating; letters A=10 … Z=35; sum mod 10).
  • DE_ID_CARD regex last character tightened to [0-9] (the ICAO check digit is always numeric).
  • DE_BSNR — no public Prüfziffer exists (confirmed against KBV Arztnummern-Richtlinie and HL7-Wiki), so added a KV-regional-code whitelist per Arztnummern-Richtlinie Anlage 1 as defense-in-depth.

Minor

  • DE_TAX_ID — enforce the post-2016 BZSt repetition rule (max 3 repetitions in positions 1–10), replacing the lenient "all-10 must not be identical" check.
  • Registered DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer and DeFuehrerscheinRecognizer in default_recognizers.yaml — these four were imported in code but missing from the registry configuration, so they were unreachable via the default registry.
  • Recipe docs/recipes/german-language-support/README.md sample data updated to identifiers that pass the corrected validators.

Issue reference

Fixes #1972. Supersedes #1974.

Checklist

  • I have reviewed the contribution guidelines
  • I have signed the CLA (if required)
  • My code includes unit tests (each recognizer's fixtures are derived directly from a primary-source worked example cited in the test docstring)
  • All unit tests and lint checks pass locally (2239 passed, 12 skipped on the full analyzer suite; ruff clean)
  • My PR contains documentation updates / additions if required (CHANGELOG + recipe sample)

MvdB and others added 9 commits April 19, 2026 20:38
The previous implementation applied weights [2,1,2,1,2,1,2,1,2,1,2,1],
which is a plain Luhn-like pattern and does not match VKVV § 4
(Verordnung über die Vergabe der Versicherungsnummer). The correct
weights per VKVV § 4 and DRV technical documentation are
[2,1,2,5,7,1,2,1,2,1,2,1], with a cross-sum (Quersumme) applied to
every product and the final check digit = sum mod 10.

Verified against the canonical DRV example 15070649C103: letter C=03,
effective 150706490310, products 2,5,0,35,0,6,8,9,0,3,2,0, cross-sums
2,5,0,8,0,6,8,9,0,3,2,0 = 43, 43 mod 10 = 3 matches the check digit.

Test fixtures regenerated from the corrected algorithm. The previous
"valid" fixture 65070803A018 was self-consistent with the wrong weights
only; under VKVV § 4 the same prefix yields check digit 9, so the
fixture is replaced with 65070803A019 and the incorrect original is
moved to the negative-case set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous implementation had three compounding errors:
- Wrong weights [4,9,2,10,5,3] instead of alternating [4,9,4,9,4,9]
- Spurious Quersumme step on products > 9 that the spec does not define
- Wrong formula (total % 10) instead of (10 - total % 10) % 10

Per the KBV Arztnummern-Richtlinie ("Die ersten sechs Ziffern werden von
links beginnend abwechselnd mit 4 und 9 multipliziert. Die Summe dieser
Produkte wird durch Modulo 10 dividiert. Die Differenz dieser Zahl zu
10 ergibt die Prüfziffer (ist die Differenz 10, so ist die Prüfziffer
0)."), the correct algorithm uses alternating weights 4 and 9, no
cross-sum, and the complement-to-10 formula.

Verified against the KBV / Wikipedia canonical example: physician
digits 123456 produce products 4, 18, 12, 36, 20, 54 with sum 144;
(10 − 144 mod 10) mod 10 = 6, giving LANR 123456601.

The three previous "valid" fixtures (123456901, 234567601, 100000401)
were all self-consistent with the buggy code only and would be rejected
by a KBV-conformant validator; they are replaced with the recomputed
123456601, 234567701, 100000601 and moved to the negative-case set.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous recognizer was pure-regex and would match any DE + 9 digits
sequence. The 9th digit of a German USt-IdNr. is in practice a check
digit derived via ISO 7064 Mod 11,10 — identical to the algorithm used
for the Steuer-IdNr. While BZSt does not publish this in an official
Merkblatt, it is the algorithm used by every community implementation
(python-stdnum, VIES-adjacent validators).

Validating it here cuts the false-positive rate substantially at the
cost of rejecting the rare theoretical USt-IdNr. whose 9th digit does
not match. The docstring is explicit that this is a structural check,
not a VIES existence confirmation.

Verified against python-stdnum test vectors DE136695976 and DE129273398.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Neither recognizer had any structural check digit validation despite both
document types following ICAO Doc 9303 Part 3 (Machine Readable Travel
Documents). The 9th character in the document number is the ICAO check
digit over the first 8 characters: weights 7, 3, 1 repeating, letters
mapped A=10 … Z=35, sum modulo 10.

For DE_ID_CARD the strict nPA pattern also tightened the last-character
class from [CFGHJKLMNPRTVWXYZ0-9] to [0-9] — the 10th (here 9th, since
the overall MRZ form is different) character is always the numeric ICAO
check digit, never a letter. The overly-permissive "Relaxed" pattern is
dropped to avoid overlap.

Legacy pre-2010 Personalausweis numbers (format "T" + 8 digits) predate
ICAO and carry no check digit; validate_result returns None for these so
they remain matchable at pattern confidence without checksum.

Test fixtures generated with the ICAO algorithm directly (worked
examples: C01234565, F12345671, L01X00T44, CZ6311T03, G00000002).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BSNR has no publicly documented Prüfziffer algorithm (confirmed against
KBV Arztnummern-Richtlinie and HL7-Wiki), so structural validation is
necessarily limited to the 2-digit KV Bereichskennzeichen at positions
1-2. A whitelist of known KV region codes per Arztnummern-Richtlinie
Anlage 1 is now applied in validate_result:

- True if the prefix matches a documented KV region (19 codes including
  the 17 standard KVs, KBV itself, and the Anlage 8 BMV-Ä hospital code
  35).
- None if the prefix is unknown — could still be valid historic/special
  cases, so the match is kept at pattern score rather than dropped.
- False for clearly malformed input (wrong length, non-digit, all-zero).

This cuts the false-positive rate on generic 9-digit sequences without
rejecting legitimate BSNRs that sit outside the whitelist.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous check only rejected Steuer-IdNrs whose first 10 digits were
all identical (a case the pre-2016 rule forbade). The current BZSt rule,
in force since the 2016 format revision, is stricter: no digit may
appear more than three times in positions 1-10. The new check replaces
`len(set(digits[:10])) == 1` with
`max(Counter(digits[:10]).values()) > 3`, correctly rejecting
11112345678 (four 1s) and 12222234567 (five 2s).

Checksum algorithm (ISO 7064 Mod 11,10) is unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The recipe README sample text used identifiers that were generated
against the buggy algorithms (A123456787, 123456901, DE123456789,
C01X00T47) and therefore would not pass the corrected validators.
Replaced with identifiers that satisfy the corrected specifications:

  KVNR       A123456780   (§ 290 SGB V Anlage 1)
  Steuer-IdNr 86095742719 (unchanged — was already correct)
  LANR       123456601    (KBV weights [4,9,4,9,4,9])
  BSNR       021234568    (unchanged — KV prefix 02 is whitelisted)
  USt-IdNr   DE136695976  (python-stdnum test vector)
  Reisepass  C01234565    (ICAO check computed)

CHANGELOG lists all seven DE-recognizer corrections under [unreleased]
→ Analyzer → Fixed/Added so the upstream review sees the full scope at
a glance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e KVNR sample

1) default_recognizers.yaml was missing DeLanrRecognizer, DeBsnrRecognizer,
   DeVatIdRecognizer, and DeFuehrerscheinRecognizer — only 9 of the 13
   German recognizers were actually wired into the registry. Added the
   four missing entries inside the existing DE block, same style
   (supported_languages: [de], type: predefined, enabled: false).

2) Recipe sample text referenced KVNR A123456780 which fails the KVNR
   checksum on current main (that value only validates under the
   corrected § 290 SGB V Anlage 1 algorithm from PR #1974). Reverted to
   A123456787 which matches the current test fixture, so the sample
   actually validates with the main-branch recognizer.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…age 1

The previous implementation applied weights [2,9,8,7,6,5,4,3,2,1] to the
letter-expanded KVNR, which is not the algorithm defined in § 290 SGB V
Anlage 1 (GKV-Spitzenverband, Stand 02.01.2023). The spec uses alternating
factors [1,2,1,2,1,2,1,2,1,2] on the 10 effective digits (2-digit letter
ordinal + 8 data digits), takes the Quersumme of any product ≥ 10, and
checks sum mod 10 against the check digit at position 10.

Verified against the two worked examples in Anlage 1:
  A000500015 → sum 5,  PZ 5  ✓
  C000500021 → sum 11, PZ 1  ✓

The previous "valid" fixtures (A123456787, M123456789, B123456787) were
generated with the wrong algorithm and never satisfied the spec; they are
replaced with the two Anlage 1 samples plus three recomputed fixtures
(A123456780, B123456782, M123456785) and moved into the negative-case set.

Closes #1972

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SharonHart and others added 6 commits April 20, 2026 08:46
…-review

- de_social_security: "Ergänzungsmerkmal" was mis-associated in two places.
  The +50 rule on the birth day (Pos 3-4) is gender-agnostic — it
  disambiguates collisions between people sharing birth date, surname
  initial, and Bereichsnummer. The serial at Pos 10-11 is the
  Geschlechtskennung (00-49 male, 50-99 female), not an Ergänzungsmerkmal.

- de_bsnr: the docstring listed "06 Nordrhein, 14 Berlin" as examples of
  KV regional codes — but those codes are NOT in the module's own
  VALID_KV_CODES whitelist (correct codes per KBV Arztnummern-Richtlinie
  Anlage 1 are 38 Nordrhein, 72 Berlin). Fictitious example BSNRs updated
  accordingly to use whitelisted prefixes.

- de_passport: the worked example for C01X00T41 had arithmetically wrong
  products (listed 99 for X×7 and 29 for T×7; should be 231 and 203
  respectively) and a truncated sum. The actual implementation is
  correct — only the docstring math was broken. Replaced with the real
  numbers: products 84+0+1+231+0+0+203+12 = 531 → 531 mod 10 = 1.

No behavioral changes; docstring corrections only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Anchors C01X00T41 (the worked example in de_passport_recognizer.py's
docstring) as a positive fixture in both the analyze() and
validate_result() parametrize lists. Keeps the docstring math and the
test suite in sync — any future drift in the worked example will now
surface as a test failure rather than silently sitting in documentation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ecognizers

DeLanrRecognizer, DeBsnrRecognizer, DeVatIdRecognizer and
DeFuehrerscheinRecognizer were imported in predefined_recognizers/
__init__.py but missing from conf/default_recognizers.yaml, which means
users relying on the default registry never got them instantiated. The
registry entry was added in a prior commit on this branch; CHANGELOG now
surfaces the user-visible effect.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…orld normalization

The BZSt does not publish the USt-IdNr Prüfziffer algorithm in any
Merkblatt. The previous implementation applied ISO 7064 Mod 11,10 and
returned False on mismatch — which Presidio treats as "drop the match"
and is a silent false-negative risk if the community-consensus algorithm
ever diverges from what BZSt actually uses for a given ID.

Changes:

- validate_result is now tri-state. Structural failure (wrong prefix,
  wrong length after normalisation, non-digit body) still returns False.
  Checksum pass returns True. Checksum fail returns None in the default
  mode — the match is kept at pattern score instead of being dropped.

- Opt-in strict mode via strict_checksum=True constructor parameter for
  callers who prefer precision over recall (e.g. compliance-adjacent
  scans).

- Real-world input normalisation: the second lenient pattern allows
  whitespace, dashes and dots between digit groups; validate_result
  strips them before the structural check. Handles "DE 123 456 789",
  "DE-123-456-789", "DE.123.456.789", lowercase and mixed forms commonly
  seen on invoices and Impressum pages.

- The heuristic nature of the checksum is made explicit in the class
  docstring and inline so readers of the code see the rejection policy
  rationale immediately.

No other DE checksums changed — KVNR (§ 290 SGB V Anlage 1), RVNR
(VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr (BZSt
Merkblatt) and the ICAO Doc 9303 MRZ checks are all officially
documented and continue to reject on mismatch.

For legally authoritative DE VAT existence verification, callers must
still cross-check VIES — this recognizer performs detection, not
compliance validation.

Test suite extended from 25 to 56 DE_VAT_ID tests: default-mode
tri-state coverage, strict-mode enforcement, and real-world formatting
variants (space/dash/dot/mixed separators, lowercase).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MvdB
Copy link
Copy Markdown
Contributor Author

MvdB commented Apr 21, 2026

Follow-up after an external review round focused on enterprise-grade correctness, specifically for DE_VAT_ID where the BZSt does not publish the Prüfziffer algorithm.

Problem: the previous validate_result returned False on checksum mismatch, which Presidio treats as "drop the match". Given that the ISO 7064 Mod 11,10 we run is community consensus (python-stdnum, VIES-adjacent) rather than a BZSt-normative rule, a real USt-IdNr that happens to diverge from the heuristic was silently lost.

Changes in 072e8b8:

  1. validate_result is now tri-state. Structural failures (wrong prefix, wrong length after normalisation, non-digit body) still return False. Checksum pass returns True. Checksum fail returns None in the default mode — the match keeps its pattern score instead of being dropped.
  2. Opt-in strict mode via strict_checksum=True constructor parameter for callers who prefer precision over recall (e.g. compliance-adjacent scans).
  3. Real-world input normalisation: a second lenient pattern allows whitespace, dashes and dots between digit groups; validate_result strips them before structural validation. Handles DE 123 456 789, DE-123-456-789, DE.123.456.789, lowercase and mixed forms.
  4. The heuristic nature of the checksum is made explicit in the class docstring and inline so readers of the code see the rejection policy rationale immediately.

No other DE checksums changed. KVNR (§ 290 SGB V Anlage 1), RVNR (VKVV § 4), LANR (KBV Arztnummern-Richtlinie), Steuer-IdNr and the ICAO Doc 9303 MRZ checks all have officially documented algorithms and continue to reject on mismatch.

For legally authoritative DE VAT existence verification, callers must still cross-check VIES — this recognizer performs detection, not compliance validation. The class docstring says so explicitly.

Test suite extended from 25 to 56 DE_VAT_ID tests covering default-mode tri-state behaviour, strict-mode enforcement, and the real-world formatting variants. Full DE suite: 303 passing. Ruff clean.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Completes and corrects Prüfziffer (checksum) / structural validation for German (DE_*) pattern recognizers in Presidio Analyzer, aligning implementations and fixtures with primary sources and improving real-world robustness (e.g., MRZ/ICAO digits, VAT ID normalization), while also making previously unreachable recognizers available via the default registry.

Changes:

  • Corrected checksum algorithms for DE_HEALTH_INSURANCE, DE_SOCIAL_SECURITY, DE_LANR, and tightened DE_TAX_ID structural rules.
  • Added missing checksum/structural validation for DE_VAT_ID, DE_PASSPORT, DE_ID_CARD, and added KV-prefix whitelist validation for DE_BSNR.
  • Updated unit tests, registry configuration, changelog, and German-language recipe sample identifiers to match the corrected validators.

Reviewed changes

Copilot reviewed 19 out of 19 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_health_insurance_recognizer.py Fix KVNR checksum factors to match §290 SGB V Anlage 1.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_social_security_recognizer.py Fix RVNR checksum weights to VKVV §4 and update documentation.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_lanr_recognizer.py Fix LANR checksum algorithm per KBV Arztnummern-Richtlinie.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_tax_id_recognizer.py Enforce post-2016 BZSt repetition constraint for DE_TAX_ID.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_vat_id_recognizer.py Add VAT ID normalization + ISO 7064 Mod 11,10 validation with heuristic/strict mode.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_passport_recognizer.py Add ICAO Doc 9303 check digit validation for DE passports.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_id_card_recognizer.py Tighten nPA regex and add ICAO check digit validation; keep legacy T-format as non-validated.
presidio-analyzer/presidio_analyzer/predefined_recognizers/country_specific/germany/de_bsnr_recognizer.py Add KV-prefix whitelist-based structural validation for BSNR.
presidio-analyzer/presidio_analyzer/conf/default_recognizers.yaml Register additional German recognizers in the default registry (disabled by default).
presidio-analyzer/tests/test_de_health_insurance_recognizer.py Update KVNR fixtures and add edge-case coverage for corrected checksum.
presidio-analyzer/tests/test_de_social_security_recognizer.py Replace fixtures with VKVV §4-aligned examples and negative cases.
presidio-analyzer/tests/test_de_lanr_recognizer.py Replace fixtures to match corrected LANR checksum.
presidio-analyzer/tests/test_de_tax_id_recognizer.py Add tests for the post-2016 repetition rule.
presidio-analyzer/tests/test_de_vat_id_recognizer.py Add heuristic-vs-strict behavior tests plus normalization variants and span assertions.
presidio-analyzer/tests/test_de_passport_recognizer.py Add ICAO checksum validation tests plus span/score assertions.
presidio-analyzer/tests/test_de_id_card_recognizer.py Update fixtures for ICAO-valid nPA and legacy acceptance; add validate_result assertions.
presidio-analyzer/tests/test_de_bsnr_recognizer.py Update fixtures for KV-prefix whitelist behavior and validate_result assertions.
docs/recipes/german-language-support/README.md Update sample identifiers to values which pass corrected validators.
CHANGELOG.md Document fixes/additions under [unreleased] for the German recognizers.

Comment thread presidio-analyzer/tests/test_de_id_card_recognizer.py
Comment thread presidio-analyzer/tests/test_de_bsnr_recognizer.py
MvdB and others added 5 commits April 22, 2026 08:20
…k digit

Addresses Copilot review comment on PR #1990: the docstring previously
said the letter-ordinal expansion yields an "effective 13-digit string"
while the implementation correctly builds a 12-digit `effective` buffer
and applies 12 weights. Reworded to make clear that the 12 data digits
(positions 1–8 + 2 letter-ordinal digits + positions 10–11) are what
gets weighted, and the check digit at position 12 stands apart.

No behavioural change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Copilot review comments on PR #1990 (validator + test
assertions). The previous validate_result returned True for any 9-digit
input whose 2-digit prefix was in VALID_KV_CODES, which Presidio
upgrades to MAX_SCORE (1.0). On a base pattern as broad as \b\d{9}\b
this is too strong — a random 9-digit sequence that happens to start
with one of the 19 whitelisted KV codes would match at max confidence
without any context.

New behaviour:
- structurally invalid (wrong length, non-digit, all-zero) → False (drop)
- everything else                                          → None
  (keep pattern score 0.2, let the ContextAwareEnhancer use
  "Betriebsstättennummer"/"BSNR"/"Praxis"/… to drive final confidence)

VALID_KV_CODES is retained as documentation of the expected KV range
per KBV Arztnummern-Richtlinie Anlage 1; it just no longer acts as a
confidence gate.

Tests regenerated to parametrize expected_score per case instead of
relying on a blanket max_score fixture — the previous test file
declared expected_positions in its signature but never actually
asserted spans or scores, so scoring regressions could not be caught.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…etters

Addresses Copilot review comment on PR #1990. The relaxed pattern
\b[A-Z][A-Z0-9]{7}[0-9]\b accepted A, B, D, E, I, O, Q, S, U as first
letter — all excluded from the ICAO Doc 9303 travel-document charset.
validate_result would then still compute a MRZ-style check digit for
these characters (A=10, B=11, …), so a non-German string whose check
digit happened to match could be upgraded to MAX_SCORE.

The strict pattern already covers every legitimate German passport
number, so the relaxed pattern was pure false-positive surface without
recall benefit.

Additionally, validate_result now rejects ICAO-forbidden letters
explicitly as a defense-in-depth guard against any future pattern
regression. Two negative test cases added for coverage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Addresses Copilot review comment on PR #1990. The parametrized test
previously accepted expected_positions and max_score in its signature
but only asserted len(results), so scoring regressions (ICAO-valid nPA
should score MAX vs legacy T-format at pattern score 0.5) could not be
caught.

Parametrize now carries an explicit expected_score per case:
  1.0  for ICAO-validated nPA entries
  0.5  for legacy "T + 8 digits" entries (pattern score, no checksum)
  None for dropped matches (no assertion runs)

Full assert_result call is now executed in the loop so span and score
are both verified.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Minor follow-up to 9051ffb — the updated docstring contains escaped
regex characters (\b\d{9}\b), which ruff flags as D301 unless the
docstring is marked raw. Switch to r""" and use the literal form,
no content change.
@MvdB
Copy link
Copy Markdown
Contributor Author

MvdB commented Apr 22, 2026

Addressed all five Copilot comments — commit mapping for the record:

  • Comment 1 (test_de_id_card_recognizer.py): 31341e7 — added per-case expected_score to the parametrize and now actually calls assert_result() in the loop. nPA with valid ICAO → MAX_SCORE; legacy "T + 8 digits" → pattern score 0.5.
  • Comment 2 (test_de_bsnr_recognizer.py): 9051ffb — same parametrize shape; matches the new BSNR scoring behaviour from comment 4.
  • Comment 3 (de_social_security_recognizer.py docstring): 2854583 — clarified that the letter expansion yields 12 data digits weighted, separate from the check digit at position 12.
  • Comment 4 (de_bsnr_recognizer.py): 9051ffbvalidate_result no longer promotes whitelisted-prefix matches to MAX_SCORE. On a \b\d{9}\b base pattern that upgrade was too strong for a 2-digit-prefix check. Returns None for every structurally-valid input; context words drive final confidence. VALID_KV_CODES is retained as documentation of the expected KV range per KBV Arztnummern-Richtlinie Anlage 1.
  • Comment 5 (de_passport_recognizer.py): def5267 — removed the relaxed \b[A-Z][A-Z0-9]{7}[0-9]\b pattern entirely (the strict ICAO charset covers every legitimate German passport) and added a defense-in-depth check in validate_result that rejects ICAO-forbidden letters (A, B, D, E, I, O, Q, S, U) outright. Two negative tests added.

Plus 7f9526b — minor ruff D301 follow-up after the BSNR docstring update (raw docstring since it contains \b\d{9}\b).

Full DE suite: 306 passing. Ruff clean.

@omri374 omri374 requested a review from Copilot April 22, 2026 18:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 2 comments.

Comment thread CHANGELOG.md Outdated
MvdB and others added 2 commits April 22, 2026 20:24
…ntation-only

Fixes Copilot review comment #3126014708. The original wording claimed
a defense-in-depth check but validate_result does not reject unknown
prefixes (that was intentional per commit 9051ffb). Clarify accordingly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes Copilot review comment #3126014784. The relaxed pattern
`\b\d{8}[A-Z]\d{3}\b` matches strings with impossible birth dates;
without a date check, such a match could pass the VKVV checksum by
chance and get promoted to MAX_SCORE. Enforce day (01-31 or 51-81
with Ergaenzungsmerkmal) and month (01-12) structurally before the
checksum to close this false-positive vector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MvdB
Copy link
Copy Markdown
Contributor Author

MvdB commented Apr 22, 2026

Second Copilot review round addressed — commit mapping:

  • CHANGELOG.md (#3126014708): ad16bca — reworded the DeBsnrRecognizer entry to match the actual behaviour. The KV regional-code list is retained for documentation / defense-in-depth purposes per KBV Arztnummern-Richtlinie Anlage 1, but validate_result does not reject unknown prefixes (9051ffb from the previous round made that deliberate — BSNR has no public checksum, so a 2-digit-prefix drop on a \b\d{9}\b base pattern would be too aggressive).
  • de_social_security_recognizer.py (#3126014784): 6a9ba8bvalidate_result now enforces the VKVV § 4 structural invariants on the birth-day (01–31, or 51–81 with Ergänzungsmerkmal) and month (01–12) before running the checksum. Previously, the relaxed \b\d{8}[A-Z]\d{3}\b pattern could surface strings with impossible dates and, on a ~10% random chance, have them pass the checksum and be promoted to MAX_SCORE. The date check closes that false-positive vector. Four explicit impossible-date cases added to the validate_result parametrize.

Full DE suite: 310 passing. Ruff clean.

SharonHart
SharonHart previously approved these changes Apr 23, 2026
Copy link
Copy Markdown
Contributor

@SharonHart SharonHart left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing contribution, thanks @MvdB!

@omri374 omri374 merged commit 98fb8de into microsoft:main Apr 23, 2026
34 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: DE_HEALTH_INSURANCE (KVNR) checksum rejects valid values (factor sequence mismatch vs §290 SGB V Anlage 1)

4 participants