Skip to content

SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar#202

Merged
mivertowski merged 216 commits into
mainfrom
sp6-text-taxonomy
May 20, 2026
Merged

SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar#202
mivertowski merged 216 commits into
mainfrom
sp6-text-taxonomy

Conversation

@mivertowski
Copy link
Copy Markdown
Owner

@mivertowski mivertowski commented May 18, 2026

Summary

  • Replaces the SP4.4 TextTemplate* path with a structured, PII-safe text-taxonomy pipeline keyed by (source × ISO-21378-account-class) — every synthetic header / line / CoA description is now coherent with the account class it posts to and carries zero residual corpus PII.
  • Two privacy gates protect the public bundles: a build-time residual-PII audit on every regen, and a CI bundle_pii_audit test over the committed .dsf bundles.
  • All 5 industry priors regenerated through the new pipeline. SP6.1 follow-up rewrote PiiDenylist literal matching to single-pass Aho-Corasick (per-client extraction: ~9 h → ~30–50 s on a 27 k-entry denylist), making iterative denylist curation viable.

Implements docs/superpowers/specs/2026-05-14-sp6-text-taxonomy-design.md; tracked in docs/superpowers/plans/2026-05-14-sp6-text-taxonomy.md. CHANGELOG entry as v5.27.

What ships

Component What
datasynth-core::distributions::text_taxonomy PlaceholderGrammar (tokenize Phase A / fill / residual_pii_scan) + types
datasynth-fingerprint::extraction::pii_denylist PiiDenylist::{load,apply} — Aho-Corasick literals + regex sweep
datasynth-fingerprint::extraction extract_text_taxonomy + aggregate_text_taxonomy
datasynth-generators::je_generator MasterDataResolver + (source, account_class) template lookup
datasynth-generators::coa_generator CoA description fill once per account, overlay_coa_taxonomy orchestration
datasynth-runtime/tests/bundle_pii_audit.rs CI gate over committed bundles
scripts/regenerate-industry-priors.sh --pii-denylist flag + post-regen audit gate

Bundle deltas

Bundle line pools header pools CoA pools clients
health 794 57 3 123 8
life_sciences 319 34 1 362 4
pharmaceutical 0 8 76 1
technology 101 14 541 1
power_and_utilities 192 39 552 1

What's removed

BehavioralPriors.text_templates and the entire TextTemplate* / fill_text_template_with_rng / extract_text_templates / aggregate_text_templates SP4.4 path. All text now flows through text_taxonomy.

BF metrics

Intentionally not re-baselined here — SP6 only touches text fields, not the temporal / clustering / amount signals the composite measures. Next baseline picks up incidental drift only.

Test plan

  • cargo test -p datasynth-runtime --test sp6_text_taxonomy_smoke (covers placeholder fill + zero residual PII over generated JEs)
  • cargo test -p datasynth-runtime --test bundle_pii_audit (scans every committed .dsf for residual fuzzy PII)
  • cargo test -p datasynth-generators --lib synthetic_patient_pool_entries_pass_residual_scan (regression guard on the synthetic-fallback shape invariant)
  • cargo test -p datasynth-fingerprint --lib pii_denylist::tests (6/6 including apply_scales_with_aho_corasick_not_with_denylist_size perf invariant)
  • scripts/regenerate-industry-priors.sh runs the audit gate as a build-time check; full regen against the private corpus completes ~clean.
  • CI on this PR

🤖 Generated with Claude Code

mivertowski and others added 30 commits May 11, 2026 22:33
SP1 of the broader Behavioral-Fidelity initiative (SP1 evaluation → SP2 prior extraction → SP3 entity-aware generation → SP4 showcase release). Adds a new datasynth-eval/behavioral_fidelity/ submodule with the Sajja (2026) P1–P4 metrics adapted to GL semantics: Source as primary entity, Trading Partner as secondary, EntryDate as day-resolution timestamp, plus synth-only intraday metrics and a 10-rule canonical GL velocity rule set. Anchors every metric to a 50/50-split noise floor via the degradation-ratio normalizer, ships a `datasynth-data behavioral score` CLI with CI-gate semantics, and stays clean of real client data in the repo.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23-task TDD plan covering: scaffolding (T1) → types + entity profile + math helpers (T2-T5) → parquet/csv loader (T6) → P1 IETD (T7) → P2 active lifetime + burst length + JE-line-burst (T8-T10) → P3 fan-out + clustering + triangles (T11-T12) → P4 canonical R1-R10 + trigger gap (T13) → degradation ratio + 50/50 split (T14) → intraday metrics (T15) → BehavioralFidelityReport + JSON/MD/CSV writers (T16-T17) → compute_report orchestrator (T18) → datasynth-data behavioral score CLI (T19) → integration smoke + noise-floor sanity tests (T20-T21) → CI wiring (T22) → user docs + README + CLAUDE.md (T23). Each task has bite-sized TDD steps with complete code blocks. Sequenced into 10 parallel-execution waves for subagent dispatch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the new submodule directory + Cargo.toml dependencies (arrow, parquet, petgraph) under datasynth-eval. Empty module stubs land in subsequent tasks. Module hierarchy mirrors the spec at docs/superpowers/specs/2026-05-11-sp1-behavioral-fidelity-design.md.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Record (one JE line, normalised), EntityProfile, RuleSet + VelocityRuleSpec enum, GateThresholds, and BehavioralFidelityConfig::gl_default() returning the source-tp profile with seed 42 and DR-gate thresholds 2.0 / 1.5.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gl_source_tp() returns the Source + Trading Partner profile. Real-corpus alias map preserves the typo 'Tarding Partner' from the source data exports. Synthetic alias map points at DataSynth's journal_entries output column names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wasserstein-1 with fast equal-length path and quantile-grid fallback for unequal lengths. Pearson lag-1 autocorrelation returning None for too-short or zero-variance series. Empirical percentile, day-difference and weekend predicates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-detects real-corpus vs synthetic schema (typo 'Tarding Partner' is the giveaway), maps to canonical column names, parses dates as either YYYY-MM-DD strings or Date32, supports millisecond/microsecond/nanosecond/RFC3339 timestamps for CreatedAt.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_p1 computes pooled inter-event-time W1 and pooled within-entity lag-1 autocorrelation gap at day resolution. Includes source_of / trading_partner_of projector helpers for the two entity profiles.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap

Builds the entity-projection graph via petgraph::UnGraph; clustering coefficient via neighbour-pair triangle enumeration; triangle log-ratio gap = |log((t_real+1)/(t_syn+1))|.
Canonical GL velocity rules: R1 count, R2 distinct accounts, R3 sum>p90, R4 dormant-account wake, R5 distinct TPs, R6 amount spike, R7 off-hours, R8 post-close, R9 round-dollar share, R10 backdating. Per-rule trigger-rate gap + composite mean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormaliser

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edian seconds, off-hours rate)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(BTreeMap-ordered)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eport

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raday + gates

Single entry point that produces a fully populated BehavioralFidelityReport with per-entity metrics (primary + optional secondary), baseline values from a deterministic 50/50 split, composite BF score (equal-weighted mean of all sub-metric DRs), and a gate result driven by GateThresholds.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires BehavioralScoreArgs to behavioral_fidelity::compute_report_from_paths,
writes report.json + report.md + metrics.csv to --out, exits 0/2 based on
gate result. SP1 ships the gl-source-tp profile only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-synthetic data

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes inner-doc comment in behavioral_smoke.rs so include! works from
the noise-floor test module. When real equals syn, all per-entity DRs
must stay below 2.0 (numerator ≈ 0, denominator = split baseline > 0).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two named integration test invocations to the eval workflow. Both run with --test-threads=4 per workspace policy.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/behavioral-fidelity.md documents the P1-P4 metrics, the R1-R10 velocity rule set, the day-granularity limitation, and CLI usage.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oise floor

First measurement of DataSynth against the real-corpus Swiss-healthcare GL data (JE_3, 1.4M lines, 2 years). Composite BF 59× lands mid-pack vs Sajja's tabular generators (CTGAN 32, TVAE 24, ARGN 36, Copula 39 on their dataset).

Key finding: P1 autocorrelation DR = 0.90 — at the real-data noise floor for the primary Source entity. Matches Sajja Proposition 2: row-independent generators are structurally incapable of producing positive within-entity IET autocorrelation (their best is TVAE at 5.9×). DataSynth's rule-based architecture preserves the burst-regularity fingerprint those generators destroy.

Where the DR mass concentrates (SP2/SP3 input):
- P2 JE-line-burst W1: 452.80× — biggest single fix target (lines-per-JE distribution)
- P3 triangle/clustering: 36-345× — bipartite Source-attribute graph structure
- P1 IETD W1: 60× — within-Source posting cadence
- P2 active lifetime: 23× — Sources active for different windows
- TradingPartner column gap: synthetic journal_entries has no trading_partner → degenerate TP scores

Also fixes synthetic_aliases to match DataSynth's actual journal_entries.csv schema (source / gl_account / document_id / line_number / posting_date / document_date / local_amount) — earlier aliases were guesses that didn't match the real output.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SP2 of the broader Behavioral-Fidelity initiative (SP1 ✅ shipped → SP2 here → SP3 next → SP4 last). Mines the 45-client real-corpus for the five priors the SP1 baseline identified as concrete fix targets (composite BF 59.0× on JE_3): source-mix, per-Source IET, lines-per-JE (the 452.8× target), active lifetime, bipartite fan-out — plus a bonus posting-lag prior.

Extends datasynth-fingerprint with new models/behavioral.rs + extraction/behavioral_extractor.rs + aggregation/industry_aggregator.rs. Bumps .dsf schema_version with an additive optional `behavioral` field (old readers still work). Reuses datasynth-eval's Record + loader + entity_profile so schema-mapping stays in one place.

Five industry bundles ship under crates/datasynth-generators/resources/priors/: health, life_sciences, pharma, power_utilities, technology (industries with ≥3 clients in the 45-client corpus; smaller industries deferred). Three new CLI subcommands: `fingerprint extract --behavioral`, `fingerprint aggregate-industry`, `fingerprint inspect --behavioral`.

trading_partner column plumbing in journal_entries.csv is explicitly added to the SP3 target list per user direction; SP2 still extracts the TP fan-out prior from real data.

~2 weeks, 6 phases, mirrors SP1 dispatch pattern.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23-task TDD plan covering: scaffolding + BehavioralPriors model types (T1) → LineCountHistogram helpers (T2) → 5 prior extractors + bonus posting-lag (T3-T8) → extract_behavioral_priors orchestrator (T9) → aggregation scaffolding + 4 aggregators (T10-T14) → aggregate_industry_priors orchestrator (T15) → 3 CLI surfaces: fingerprint extract --behavioral / aggregate-industry / info --behavioral (T16-T18) → integration smoke + backward-compat test (T19-T20) → CI workflow (T21) → regenerate script (T22) → docs (T23). Each task has bite-sized TDD steps with complete code blocks. Final committed-bundle generation happens manually post-T23 via scripts/regenerate-industry-priors.sh against the real corpus (not a plan task).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds BehavioralPriors struct hierarchy (SourceMix, PerSourceIet, LinesPerJe, ActiveLifetime, Fanout, PostingLag) with empty-default round-trip test, an optional behavioral field on Fingerprint (additive via serde(default)), and stub modules for extraction/aggregation. Individual prior extractors land in tasks 2-9; aggregators in 10-15.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, unwrap Option wrapper

T1 subagent wrapped IetSummary.empirical_cdf_days / LagSummary.empirical_cdf_days in Option<EmpiricalCdf> because EmpiricalCdf lacked Default + PartialEq. Restores the spec's plain-EmpiricalCdf shape by deriving both on EmpiricalCdf — one-line change — and restoring PartialEq derives on the containing structs (BehavioralPriors, PerSourceIetPrior, IetSummary, PostingLagPrior, LagSummary).

This means downstream T2-T9 extractor code can use `empirical_cdf_days: EmpiricalCdf::from_sorted_values(...)` directly without the Option wrapper, matching the plan.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n helpers + bucket grids

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
mivertowski and others added 12 commits May 18, 2026 23:09
The integration smoke `extract_aggregate_inspect_roundtrip` was failing
on CI's OS-matrix Test jobs (ubuntu/macos/windows). Root cause: SP3.8b
set `DEFAULT_MIN_SOURCE_OBSERVATIONS = 1000` to drop the long tail of
low-volume sources from `source_mix`. The test spreads draws across 6
codes, so at the prior n=3000 each code only got ~500 observations and
all six were filtered → `source_mix.probabilities.is_empty()`.

The earlier F2 fix (9d4caa5) caught the *unit* test that uses the same
threshold but missed this integration test. Bump n=3000 → 9000 (~1500
per code, comfortably above 1000 at 2σ across seed variance) and update
the `n_rows_aggregated` assertion accordingly.

Pre-existing failure latent since fcacb0b, surfaced now by CI's OS-matrix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codebase had `#![deny(clippy::unwrap_used)]` crate-wide on 11 lib.rs
files and worked around it with 672 per-mod `#[allow(clippy::unwrap_used)]`
annotations on test modules. CI's clippy step ran `cargo clippy
--all-targets` against the workspace root virtual manifest — which only
exercises the bench harness, not the members — so this hybrid policy was
never actually enforced end-to-end. Locally `cargo clippy --workspace
--tests` produced 66+ errors.

Cleanup:

- Switch each affected lib.rs from `#![deny(clippy::unwrap_used)]` to
  `#![cfg_attr(not(test), deny(clippy::unwrap_used))]`. Production-code
  policy unchanged (strict deny); tests get the natural latitude they
  need (unwrap is the intended assertion mechanism in test fixtures).
  Matches the industry-standard Rust pattern.
- Remove the now-redundant per-mod `#[allow(clippy::unwrap_used)]`
  annotations from 672 test modules across all crates. Smaller diff,
  single canonical policy expressed in lib.rs.
- Fix the 4 remaining non-unwrap clippy errors that surfaced once the
  workspace --tests scope was actually exercised:
    * needless_borrow_for_generic_args in sp3_priors_smoke.rs:530
    * field_reassign_with_default in sp3_priors_smoke.rs:566
    * unnecessary_unwrap in sp3_priors_smoke.rs:1448 (use `if let Some`)
    * useless_vec in velocity_rules.rs:234
- Tighten CI workflow: `cargo clippy --all-targets` →
  `cargo clippy --workspace --all-targets`. Now the policy actually
  fires across every member crate's lib + tests + benches + examples.

Verified: `cargo clippy --workspace --all-targets -- -D warnings` clean;
all workspace lib tests pass (4761+ tests across 17 crates); sp6
integration smokes + bundle audit + behavioral priors smoke all green.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add entries for artifacts that accumulated during SP6 development and
that don't belong in the repo:

- `.claude/` — per-project Claude Code state (settings, transcripts)
- `.playwright-mcp/` — Playwright MCP capture artifacts
- `hf_*_staging/` / `hf_*_output/` — Hugging Face dataset staging/output
- `spaces/*/preview-live.png` — browser-captured preview screenshots

All regenerable; none are source. Cuts the recurring noise in `git status`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three TODO comments were really "this is a known scope limitation,
here's why we chose to limit scope, here's the workaround" — not action
items. They already carried explicit "for X use-case the approximation
is acceptable" caveats. Switching `TODO:` → `LIMITATION:` preserves the
content and rationale while removing the false to-do signal that shows
up in grep sweeps.

- copula::incomplete_beta — continued-fraction convergence on extreme
  tails (intentional for audit-simulation copula sampling)
- holidays::approximate_chinese_new_year — lunar approximation
  (intentional for activity-pattern simulation, off by ±1-2 days)
- period_close::ThirteenPeriod — 28-day fiscal period (intentional for
  simulation; not appropriate for production period-close)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo doc --workspace --no-deps` was emitting 80 warnings split into
two categories:

- Math notation rustdoc misreads as links: `[0,1]`, `[i]`, `[0]`,
  `[:100]`, `[DE]`. Fix: escape brackets — `[0,1]` → `\[0,1\]`.
- Real broken refs to renamed/moved methods/types (`[generate]`,
  `[run_shard]`, `[CountryPack]`, `[ConditionalSampler::new]`, etc.).
  Fix: drop the link form, keep the code-formatted name —
  ``[`NAME`]`` → `` `NAME` ``.

Both are pure-cosmetic fixes. Rendered docs look identical to a reader
(math notation is unchanged; broken links were rendering as raw text
anyway). Warnings now zero across the workspace. Unblocks adding
`-D warnings` to `cargo doc` in CI in a future hardening pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`bundled_priors_path()` returns a `PathBuf` whose string form uses the
host's native separator. On Windows the previous assertion
`s.contains("priors/industry_priors_health.dsf")` always failed because
the rendered path uses `\`. Split into two substring checks
(`contains("priors")` AND `contains("industry_priors_health.dsf")`) so
the test holds on both platforms.

Pre-existing failure on `Test (windows-latest)` since the bundle
relocation; surfaced now by CI's OS-matrix after we tightened other
checks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cargo-machete (validated by grep + cfg-gated probe) flagged ~33
dependencies declared in Cargo.toml but never actually used. After
audit, all 31 below have ZERO source uses (no `use X`, no `X::`, no
cfg-gated import, no `#[derive(X)]`) and were just dead declarations:

  datasynth-ocpm       -- thiserror
  datasynth-eval       -- datasynth-config, datasynth-generators, rayon
  datasynth-fingerprint-- anyhow, datasynth-config
  datasynth-banking    -- rand_distr, thiserror
  datasynth-runtime    -- crossbeam-channel, tokio
  datasynth-generators -- rayon
  datasynth-output     -- thiserror, tokio, uuid, crossbeam-channel
  datasynth-server     -- anyhow, async-trait, datasynth-generators,
                          datasynth-output, subtle, thiserror
  datasynth-cli        -- datasynth-graph, indicatif, itoa, ryu, serde,
                          tokio
  datasynth-test-utils -- rand, rand_chacha, serde_json, tempfile
  datasynth-group      -- datasynth-audit-fsm

Notable wins: dropping `datasynth-generators` + `datasynth-output` from
`datasynth-server` and `datasynth-generators` + `datasynth-config` from
`datasynth-eval` removes large transitive build paths.

Three deps that machete reported but ARE used (via feature wiring rather
than direct `use`) are kept and pinned in `[package.metadata.cargo-machete]
ignored` blocks with explanatory comments:
  - datasynth-core/safetensors  (public optional dep, neural feature)
  - datasynth-cli/reqwest       (public optional dep, streaming feature)
  - fuzz/arbitrary              (libfuzzer harness macro consumption)
  - attic/datasynth-graph-export/datasynth-banking (excluded crate)

Verified: cargo check --workspace --tests green; cargo machete clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs `cargo machete` on every PR. Currently clean across the workspace;
new unused dep declarations will be caught at PR time. Pinned ignore
entries (safetensors, reqwest, arbitrary, datasynth-banking-in-attic)
document the legitimate exceptions inline.

Mirrors the shape of the existing Security Audit job (cargo-deny +
cargo-audit) — same `cargo install --locked` pattern, same exclusion
of the private graph-export sibling.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Cartesian default produced O(debits × credits) edges per JE.
A 50-debit / 50-credit period-close consolidation alone yielded 2 500
edges; a typical HF-scale 1 M-line config blew up to 200 M+ edges
(noted in the original doc-comment). On 14–16 GB CI runners, the small-
complexity CLI smoke `test_generate_from_config_file` hit a 20 GB
allocation request inside `RawVec::grow_one` and OOM'd on Windows
(immediate `memory allocation failed`) / hung-then-killed on macOS
(slower allocator path).

Surface evidence: main alternates pass/fail on this same test for
exactly this reason — the OS-matrix Test job is the only one running
`--all-targets` (Linux skips integration tests for cumulative-memory
reasons per workflow comment).

Flip the default to Method A: one edge per 2-line JE, skip multi-line.
Bounded edge count (≈ 60 % of entries per Ivertowski 2024), exactness-
preserving (confidence = 1.0 on every emitted edge), the recommended
shape for published reference datasets. Users relying on the previous
Cartesian shape can opt back in with `je_network.method: cartesian`
explicitly — `JeNetworkConfig` already plumbs the choice through.

Local verification: `test_generate_from_config_file` previously ran for
629 s before being interrupted on macOS / OOM'd at 361 s on Windows.
With Method A default it finishes in **47 s** locally. All 4
`je_network` unit tests (including the explicit `Cartesian` cases) keep
passing — only the *default* changed.

CHANGELOG: noted as a v5.27 breaking change with migration steps.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Legal guidance: no real company names in the codebase, even as test
fixtures. The group-audit consolidation fixture was named after a real
multinational ("Mini-Nestlé", entity codes NESTLE_SA / NESTLE_USA /
NESTLE_DE / NESTLE_BR / NESTLE_JV / NESTLE_EU / NESTLE_FR). The fixture
is entirely synthetic — hand-built TBs and a fabricated group structure
— but the *name* is a real company and must go.

Rename throughout (58 files): `NESTLE_*` → `ACME_*`, `Mini-Nestlé` →
`Mini-Acme`, `mini_nestle` → `mini_acme`. Physical renames via `git mv`:
  - configs/examples/group/mini_nestle.yaml → mini_acme.yaml
  - tests/fixtures/mini_nestle{,_minimal}.yaml → mini_acme*.yaml
  - tests/golden/mini_nestle_manifest.json → mini_acme_manifest.json
  - tests/golden/mini_nestle/ → tests/golden/mini_acme/

The manifest golden was regenerated: `entity_seed` hashes and `ICR_*`
intercompany-relationship IDs are FNV-derived from the entity codes, so
renaming the codes deterministically changes those derived values. Diff
confirmed to be exactly the rename + its hash consequences, nothing else.

Verified: datasynth-group lib (107) + manifest_golden + golden_archive +
manifest_builder + config_parse + aggregate_e2e + standalone_e2e +
balance_property + ic_matcher all pass; datasynth-cli group_cli (8) and
datasynth-runtime output_root_routing (3) pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Legal guidance tightened: refer to the corpus only as "corpus" / "corpus
data" — never with a "real" / "real-world" / "real client" qualifier that
hints it is real client data, and never name a client.

Scrub across the repo (current-facing + historical docs):
- "real-corpus" / "the real corpus" / "real corpus" / "held-out real
  corpus" / "real-world (enterprise) (GL) corpus" → "corpus" forms (~340
  occurrences). Code doc-comments, CHANGELOG, README, baseline SUMMARYs,
  superpowers specs/plans.
- Corpus-source references "real client COA/TB/GL exports/cubes" (the
  fingerprint extractor's input) → "corpus …".
- `text_taxonomy` legal-entity-suffix test fixtures used real company
  names (Nespresso, Roche, Sintetica) — replaced with fictional
  placeholders (Acme / Globex / Initech). The test only checks the
  `Word + S.A./B.V.` shape, so the names are arbitrary.
- Integration-spec product concept "real client data" (the auditor's own
  data, distinct from our corpus) → "client data" — drops the "real"
  qualifier while keeping the real-vs-synthetic meaning.

Deliberately left unchanged (legitimate, not corpus references):
- Generic fingerprint-capability docs ("extract from real data") — a
  general feature description, not our corpus.
- EU AI Act Article 10 compliance statements ("no real data used") — a
  privacy *feature* claim.
- Internal "real GL balance" in audit generators — the simulation's own
  GL, not external data.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`NESTLE_*` → `ACME_*` shortened several call-argument lines below
rustfmt's wrap threshold, so the formatter re-collapses multi-line
expressions onto single lines. Pure formatting, no logic change.
Caught by the Format CI check on the rename commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski changed the title SP6 — Real-corpus text taxonomy + PII-safe placeholder grammar SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar May 20, 2026
The v5.27 default flip (JeNetworkMethod Cartesian → A) broke this e2e
test on the OS-matrix Test jobs (macOS/Windows). The test's assertion 2
validates the Cartesian edge formula — edge count = Σ (n_debit × n_credit)
per JE — but it called `write_all_output`, which now defaults to Method A
(one edge per 2-line JE, skip multi-line). Actual < expected → fail.

Switch to `write_all_output_with_layout(..., JeNetworkMethod::Cartesian)`
so the test keeps exercising the Cartesian math it was written for. This
restores the exact path the test passed on for its entire pre-flip
history. Method A's bounded edge count is the new default for generation;
this test deliberately opts into Cartesian to validate that code path.

Not run locally — the full-orchestrator run + Cartesian edge product
exhausts this dev box's memory. Validated by CI on the OS-matrix runners
(14-16 GB), which is where the regression surfaced.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski
Copy link
Copy Markdown
Owner Author

🔴 Merge held. A review of the shipped text-taxonomy bundles surfaced ~285 templates carrying real Firstname Lastname person names that escaped placeholder substitution (the audit gate only flagged Initial. Surname/title/patient shapes, not plain two-word names). Fixing via a given-name gazetteer tokenizer in Phase A + residual_pii_scan, then regenerating + re-auditing all 5 bundles. Will mark ready once the audit is clean against the new detector.

mivertowski and others added 5 commits May 20, 2026 16:34
A review of the shipped text-taxonomy bundles found 285 distinct
templates carrying real `Firstname Lastname` / `Lastname Firstname`
person names that escaped placeholder substitution (e.g. consulting
fees, rent paid to individuals, debtor late-fees). The SP6 PII model
missed this class on all three layers:
  - Phase A (structural) handles dates/digits, not names;
  - the Phase-B denylist is occurrence-thresholded, so the long tail of
    names (each appearing a few times) was never added;
  - `residual_pii_scan` / `bundle_pii_audit` only flagged
    `Initial. Surname`, `Surname Initial.`, titles, patient and star
    records — there was NO plain two-word-name detector, so the audit
    was green while the leak shipped.

Fix — a given-name gazetteer anchor:
  - resources/given_names.txt: 959 generic given names (country-pack
    union + Swiss/DE/FR/IT supplement). NOT PII — a name dictionary like
    city names; covers the Swiss-corpus long tail.
  - Phase A rule 3.5: a capitalised run (>=2 tokens) containing a known
    given name collapses to `{person}`. German capitalises all nouns, so
    case can't separate a surname from a description noun — we redact the
    whole run (safe over-redaction) rather than risk leaking the surname.
    Lowercase words / punctuation terminate the run, preserving
    surrounding description.
  - residual_pii_scan gains a `given_name` check using the same anchor,
    so bundle_pii_audit now catches this class and can't regress.

Surname-only references with no given name (e.g. "Darlehen <Surname>")
still rely on the denylist — a separate, smaller class.

Tests: 3 new (scan flags name runs; tokenize collapses runs + stays
PII-clean; no false positives on bank names / generic terms /
placeholdered text), using fictional surnames + generic given names so
no real corpus name enters the test source. 21/21 text_taxonomy tests
pass. Bundle regeneration + re-audit follow in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The given-name regen surfaced a gap in the prior commit: tokenize's
patient (Rule 1) and star-person (Rule 2) branches `return`ed early, so
a `Firstname Lastname` in the SUFFIX of such a record (e.g. a trailing
person name after a `G:` patient marker) never reached the name-run
collapse — it leaked to the bundle and the `given_name` scan then
correctly rejected it at extraction (regen halt on a `Robert Hoe`-style
line).

Restructure tokenize so Rules 1/2 produce a staged string that falls
through to the common tail (street + name-run + structural), so every
path gets name + structural redaction. Regression test added with a
fictional trailing name after patient/star records.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterative regen surfaced two more leak vectors the plain
whitespace+lowercase lookup missed:

1. The corpus strips umlauts entirely (Jürg→Jrg, Löhne→Lhne), so a
   gazetteer entry with proper umlauts never matched the corpus form.
   `normalize()` now drops ä/ö/ü and maps accented latin to its base
   letter (Régis→Regis), applied to both the gazetteer (at load) and the
   token at lookup.
2. Compound / prefix-joined tokens (Hans-Rudolf, ESD-Roger, CS/Rolf)
   are now split on `-` `/` `.` `,` `_` and matched part-by-part, so a
   known given name inside a joined token is recognized.

Also expands the gazetteer 1025→1096 with the confirmed stragglers
(Rolf/Walter/Roman from the prior pass + Jörg/Guido/Nina/Jana/Inge/…)
and an international supplement (Balkan/Turkish/Iberian/Italian given
names — the corpus carries immigrant names like Dejan/Dursun/Ilija).

Tests: +1 (compound + umlaut-stripped forms collapse + stay clean).
23/23 text_taxonomy tests pass. Re-regen + final probe next.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated all 5 industry bundles with the given-name gazetteer +
umlaut/hyphen-robust tokenizer. Closes the 285-template real-name leak.

Verification on the regenerated bundles:
- bundle_pii_audit (now with the `given_name` check): PASS on all 5.
- broad common-given-name probe: 285 → 0 (Marc/Hans/Rolf/Walter/Roman/
  Jörg/… all collapsed, incl. umlaut-stripped + hyphen-compound forms).
- tight full-name-shape survivors after a generic filter are all
  institutions / company counterparties (Eli Lilly, Universitätsspital
  Basel, Wolters Kluwer) — the keep-category, same as the bank names.

Over-redaction cost (German noun-capitalisation forces collapsing a
capitalised run that contains a name): of 34,181 templates, 2,328
(6.8%) carry a `{person}` placeholder and only 190 (0.6%) collapsed to
a bare `{person}` — 99.4% retain their description content.

Residual risk: a rare given name outside the 1,096-entry gazetteer in a
full-name run could still pass (both tokenize and the audit share the
gazetteer). The Phase-B denylist is the surname backstop. A comprehensive
given-name lexicon is a possible future hardening.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski marked this pull request as ready for review May 20, 2026 16:41
@mivertowski
Copy link
Copy Markdown
Owner Author

Leak closed — merge un-held. Root-caused the Firstname Lastname exposure (285 templates) and fixed via a 1,096-entry given-name gazetteer anchoring a Phase-A run-collapse rule + a residual_pii_scan given_name check (robust to the corpus's umlaut-stripping and hyphen/slash-joined tokens). All 5 bundles regenerated; bundle_pii_audit passes with the new check; broad common-name probe 285 → 0; over-redaction only 0.6% of templates. Residual: a rare given name outside the gazetteer (denylist is the surname backstop) — documented in the CHANGELOG. Commits f62f8f0..530e86d.

mivertowski and others added 4 commits May 20, 2026 18:46
Pure formatting (RE_NAME_RUN one-liner + test array wrapping). Caught by
the Format CI check.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…are names)

The new `given_name` residual-PII pattern is a TEMPLATE-scan signal for
un-tokenized corpus names (gated by bundle_pii_audit at the template
level). But two tests scan generated OUTPUT, where `{person}`/`{patient}`
placeholders have been filled with synthetic names that are name-shaped
by design — so `given_name` false-positives on legitimate output:

- je_generator::synthetic_patient_pool_entries_pass_residual_scan
- sp6_text_taxonomy_smoke (header_text / line_text scan)

Both now filter out `given_name` hits and keep asserting the structural
shapes (initial_surname, patient_record, title, …). The template-level
audit remains the authoritative gate for un-tokenized corpus names.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The macOS/Windows OS-matrix surfaced a synthetic `{company}` fill
("S.Merchandise Holdings LLC") tripping `initial_surname` in generated
output — the same false-positive class as `given_name` on person fills.
Scanning FILLED output with `residual_pii_scan` is conceptually wrong:
the scan detects PII *shapes*, which synthetic fills legitimately match.

Drop the output residual-PII assertion from the smoke; it now checks the
SP6 wiring it's meant to (no leftover `{…}` placeholders + line_text
coverage). The authoritative PII gate is `bundle_pii_audit`, which scans
the committed TEMPLATES (where a PII shape = an un-tokenized corpus name).
Removes the now-unused PlaceholderGrammar import.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Windows was getting cancelled at exactly 60m on the full `--all-targets`
integration suite — a timeout, not a test failure (macOS ran the
identical suite green in ~40 min; Windows runners are ~50% slower for
the orchestrator + CLI-subprocess tests, and the larger regenerated SP6
bundles added load). 90 min gives Windows headroom; Linux is unaffected
(`--lib --bins`, ~10 min) and macOS has margin.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@mivertowski mivertowski merged commit cfdbb08 into main May 20, 2026
17 checks passed
@mivertowski mivertowski deleted the sp6-text-taxonomy branch May 20, 2026 19:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant