Skip to content

Add Finnish recognizers: FI_HETU (henkilötunnus) and FI_BUSINESS_ID (Y-tunnus)#4

Merged
gregmos merged 1 commit into
gregmos:mainfrom
akunikkola:feat/finnish-recognizers
May 28, 2026
Merged

Add Finnish recognizers: FI_HETU (henkilötunnus) and FI_BUSINESS_ID (Y-tunnus)#4
gregmos merged 1 commit into
gregmos:mainfrom
akunikkola:feat/finnish-recognizers

Conversation

@akunikkola
Copy link
Copy Markdown
Contributor

What

Adds two Finnish national-identifier recognizers. Finland had no pattern recognizers, so Finnish PII (personal identity code, business ID) slipped through.

  • FI_HETU — henkilötunnus (personal identity code): DDMMYY + century marker (+, -, AF, UY) + 3‑digit individual number + control character. The control char is restricted to 0123456789ABCDEFHJKLMNPRSTUVWXY (no G/I/O/Q) and the date is validated by the regex, so the shape is highly specific — base score 0.85.
  • FI_BUSINESS_ID — Y‑tunnus: 7 digits + hyphen + check digit. The bare shape (\d{7}-\d) is generic, so base score 0.4 leans on the +0.35 context boost — same approach as DE_TAX_ID, FR_CNI, etc.

Design

Follows the existing regex + context + score recognizer design exactly — no engine changes. Registered in entity-types.ts (SUPPORTED_ENTITIES + TAG_NAMES: FI_HETU, FI_YID).

Tests

nodejs-v2/tests/finnish-recognizers-test.ts (11 assertions, same standalone style as the other tests):

  • valid HETU across 1800s/1900s/2000s century markers
  • rejection of invalid day (32), invalid month (13), forbidden control char (G)
  • Y‑tunnus detection with y-tunnus / yritystunnus context
  • rejection of wrong shapes (6 digits)

Run: npx tsx tests/finnish-recognizers-test.ts11 passed, 0 failed.

Note: I verified the recognizers via the repo's test convention but did not run the full ONNX‑dependent build locally; CI (tsc --noEmit + smoke) covers integration. Happy to adjust scores, tag names, or add checksum validation (HETU control char / Y‑tunnus mod‑11) if you'd prefer stricter matching.

Context: contributed while building an open-source Finnish legal toolkit for Claude that recommends PII Shield for anonymisation — Finnish identifier coverage was the one gap.

…Y-tunnus)

Finland had no pattern recognizers, so Finnish national identifiers slipped
through. This adds two, following the existing regex + context + score design
(no engine changes):

- FI_HETU: DDMMYY + century marker (+,-,A–F,U–Y) + 3-digit individual number +
  control character. The control char is restricted to
  "0123456789ABCDEFHJKLMNPRSTUVWXY" (no G/I/O/Q) and the date is validated by the
  regex, so the shape is highly specific — base score 0.85.
- FI_BUSINESS_ID (Y-tunnus): 7 digits + hyphen + check digit. The bare shape is
  generic, so base score 0.4 leans on the +0.35 context boost (mirrors DE_TAX_ID,
  FR_CNI).

Registered in entity-types.ts (SUPPORTED_ENTITIES + TAG_NAMES: FI_HETU, FI_YID)
and README coverage updated.

Tests: tests/finnish-recognizers-test.ts (11 assertions) — valid HETU across
1800s/1900s/2000s century markers, rejection of invalid day/month and forbidden
control chars, Y-tunnus detection with context, rejection of wrong shapes. Run
with `npx tsx tests/finnish-recognizers-test.ts`; all pass. Did not run the full
ONNX-dependent build locally — CI (tsc --noEmit + smoke) covers integration.
akunikkola added a commit to akunikkola/claude-for-legal-finland that referenced this pull request May 27, 2026
…path

Wire PII Shield (local PII anonymization — "PII never enters the API") into the
toolkit as a recommended companion:

- tietosuoja/CLAUDE.md: new "Anonymisointi ennen analyysiä" guardrail
- tietosuoja/README.md: install paths (official .mcpb/CLI) + how to get Finnish
  HETU/Y-tunnus recognizers immediately by building our fork
  (akunikkola/PII-Shield feat/finnish-recognizers), pending upstream
  gregmos/PII-Shield#4
- juristi/CLAUDE.md: confidentiality guardrail now points to PII Shield

Not added to .mcp.json: PII Shield ships as a .mcpb extension / CLI, not a
portable npx MCP server, and the FI recognizers live in our fork until the PR
merges — so it's a local, machine-bound stdio setup documented in the README
rather than a portable plugin connector.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@gregmos
Copy link
Copy Markdown
Owner

gregmos commented May 28, 2026

Thanks for the contribution! Clean PR — recognizers follow the existing regex + context + score pattern exactly, tests match the standalone style, and the checksum-valid test value for HETU (131052-308T) is a nice touch. CI is green across the full matrix (3 OS × 2 Node).

Merging via squash. Will ship in v2.2.0 shortly.

@gregmos gregmos merged commit 5804fa8 into gregmos:main May 28, 2026
6 checks passed
gregmos added a commit that referenced this pull request May 28, 2026
Bump package, manifest, VERSION constant, README asset links, and
smoke-banner from 2.1.0 to 2.2.0. New in this release: Finnish
recognizers FI_HETU and FI_BUSINESS_ID (PR #4 by @akunikkola).

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
akunikkola added a commit to akunikkola/claude-for-legal-finland that referenced this pull request May 28, 2026
gregmos/PII-Shield#4 (Finnish HETU + Y-tunnus recognizers) was merged and
released as v2.2.0 on 2026-05-28. Official npm `pii-shield@2.2.0` and the
v2.2.0 .mcpb releases now ship FI_HETU and FI_BUSINESS_ID natively — no fork
needed.

- tietosuoja/README.md: replace the "build from fork" stopgap with three
  official install paths: A) .mcpb in Claude Desktop (recommended),
  B) npm install -g pii-shield (CLI), C) build server bundle for a local
  stdio MCP entry. Cite PR #4 as the merged origin of FI support.
- tietosuoja/CLAUDE.md: update Finnish-identifiers note — v2.2.0 supports
  FI_HETU and FI_BUSINESS_ID natively; still verify results.

Local PII-Shield checkout updated to upstream main (v2.2.0) and rebuilt;
the user-scope pii-shield MCP keeps working at the same path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants