SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar#202
Conversation
SP1 of the broader Behavioral-Fidelity initiative (SP1 evaluation → SP2 prior extraction → SP3 entity-aware generation → SP4 showcase release). Adds a new datasynth-eval/behavioral_fidelity/ submodule with the Sajja (2026) P1–P4 metrics adapted to GL semantics: Source as primary entity, Trading Partner as secondary, EntryDate as day-resolution timestamp, plus synth-only intraday metrics and a 10-rule canonical GL velocity rule set. Anchors every metric to a 50/50-split noise floor via the degradation-ratio normalizer, ships a `datasynth-data behavioral score` CLI with CI-gate semantics, and stays clean of real client data in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23-task TDD plan covering: scaffolding (T1) → types + entity profile + math helpers (T2-T5) → parquet/csv loader (T6) → P1 IETD (T7) → P2 active lifetime + burst length + JE-line-burst (T8-T10) → P3 fan-out + clustering + triangles (T11-T12) → P4 canonical R1-R10 + trigger gap (T13) → degradation ratio + 50/50 split (T14) → intraday metrics (T15) → BehavioralFidelityReport + JSON/MD/CSV writers (T16-T17) → compute_report orchestrator (T18) → datasynth-data behavioral score CLI (T19) → integration smoke + noise-floor sanity tests (T20-T21) → CI wiring (T22) → user docs + README + CLAUDE.md (T23). Each task has bite-sized TDD steps with complete code blocks. Sequenced into 10 parallel-execution waves for subagent dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds the new submodule directory + Cargo.toml dependencies (arrow, parquet, petgraph) under datasynth-eval. Empty module stubs land in subsequent tasks. Module hierarchy mirrors the spec at docs/superpowers/specs/2026-05-11-sp1-behavioral-fidelity-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds Record (one JE line, normalised), EntityProfile, RuleSet + VelocityRuleSpec enum, GateThresholds, and BehavioralFidelityConfig::gl_default() returning the source-tp profile with seed 42 and DR-gate thresholds 2.0 / 1.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
gl_source_tp() returns the Source + Trading Partner profile. Real-corpus alias map preserves the typo 'Tarding Partner' from the source data exports. Synthetic alias map points at DataSynth's journal_entries output column names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wasserstein-1 with fast equal-length path and quantile-grid fallback for unequal lengths. Pearson lag-1 autocorrelation returning None for too-short or zero-variance series. Empirical percentile, day-difference and weekend predicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Auto-detects real-corpus vs synthetic schema (typo 'Tarding Partner' is the giveaway), maps to canonical column names, parses dates as either YYYY-MM-DD strings or Date32, supports millisecond/microsecond/nanosecond/RFC3339 timestamps for CreatedAt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
compute_p1 computes pooled inter-event-time W1 and pooled within-entity lag-1 autocorrelation gap at day resolution. Includes source_of / trading_partner_of projector helpers for the two entity profiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… gap Builds the entity-projection graph via petgraph::UnGraph; clustering coefficient via neighbour-pair triangle enumeration; triangle log-ratio gap = |log((t_real+1)/(t_syn+1))|.
Canonical GL velocity rules: R1 count, R2 distinct accounts, R3 sum>p90, R4 dormant-account wake, R5 distinct TPs, R6 amount spike, R7 off-hours, R8 post-close, R9 round-dollar share, R10 backdating. Per-rule trigger-rate gap + composite mean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ormaliser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…edian seconds, off-hours rate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(BTreeMap-ordered) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eport Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…raday + gates Single entry point that produces a fully populated BehavioralFidelityReport with per-entity metrics (primary + optional secondary), baseline values from a deterministic 50/50 split, composite BF score (equal-weighted mean of all sub-metric DRs), and a gate result driven by GateThresholds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires BehavioralScoreArgs to behavioral_fidelity::compute_report_from_paths, writes report.json + report.md + metrics.csv to --out, exits 0/2 based on gate result. SP1 ships the gl-source-tp profile only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…-synthetic data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes inner-doc comment in behavioral_smoke.rs so include! works from the noise-floor test module. When real equals syn, all per-entity DRs must stay below 2.0 (numerator ≈ 0, denominator = split baseline > 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds two named integration test invocations to the eval workflow. Both run with --test-threads=4 per workspace policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
docs/behavioral-fidelity.md documents the P1-P4 metrics, the R1-R10 velocity rule set, the day-granularity limitation, and CLI usage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…oise floor First measurement of DataSynth against the real-corpus Swiss-healthcare GL data (JE_3, 1.4M lines, 2 years). Composite BF 59× lands mid-pack vs Sajja's tabular generators (CTGAN 32, TVAE 24, ARGN 36, Copula 39 on their dataset). Key finding: P1 autocorrelation DR = 0.90 — at the real-data noise floor for the primary Source entity. Matches Sajja Proposition 2: row-independent generators are structurally incapable of producing positive within-entity IET autocorrelation (their best is TVAE at 5.9×). DataSynth's rule-based architecture preserves the burst-regularity fingerprint those generators destroy. Where the DR mass concentrates (SP2/SP3 input): - P2 JE-line-burst W1: 452.80× — biggest single fix target (lines-per-JE distribution) - P3 triangle/clustering: 36-345× — bipartite Source-attribute graph structure - P1 IETD W1: 60× — within-Source posting cadence - P2 active lifetime: 23× — Sources active for different windows - TradingPartner column gap: synthetic journal_entries has no trading_partner → degenerate TP scores Also fixes synthetic_aliases to match DataSynth's actual journal_entries.csv schema (source / gl_account / document_id / line_number / posting_date / document_date / local_amount) — earlier aliases were guesses that didn't match the real output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SP2 of the broader Behavioral-Fidelity initiative (SP1 ✅ shipped → SP2 here → SP3 next → SP4 last). Mines the 45-client real-corpus for the five priors the SP1 baseline identified as concrete fix targets (composite BF 59.0× on JE_3): source-mix, per-Source IET, lines-per-JE (the 452.8× target), active lifetime, bipartite fan-out — plus a bonus posting-lag prior. Extends datasynth-fingerprint with new models/behavioral.rs + extraction/behavioral_extractor.rs + aggregation/industry_aggregator.rs. Bumps .dsf schema_version with an additive optional `behavioral` field (old readers still work). Reuses datasynth-eval's Record + loader + entity_profile so schema-mapping stays in one place. Five industry bundles ship under crates/datasynth-generators/resources/priors/: health, life_sciences, pharma, power_utilities, technology (industries with ≥3 clients in the 45-client corpus; smaller industries deferred). Three new CLI subcommands: `fingerprint extract --behavioral`, `fingerprint aggregate-industry`, `fingerprint inspect --behavioral`. trading_partner column plumbing in journal_entries.csv is explicitly added to the SP3 target list per user direction; SP2 still extracts the TP fan-out prior from real data. ~2 weeks, 6 phases, mirrors SP1 dispatch pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
23-task TDD plan covering: scaffolding + BehavioralPriors model types (T1) → LineCountHistogram helpers (T2) → 5 prior extractors + bonus posting-lag (T3-T8) → extract_behavioral_priors orchestrator (T9) → aggregation scaffolding + 4 aggregators (T10-T14) → aggregate_industry_priors orchestrator (T15) → 3 CLI surfaces: fingerprint extract --behavioral / aggregate-industry / info --behavioral (T16-T18) → integration smoke + backward-compat test (T19-T20) → CI workflow (T21) → regenerate script (T22) → docs (T23). Each task has bite-sized TDD steps with complete code blocks. Final committed-bundle generation happens manually post-T23 via scripts/regenerate-industry-priors.sh against the real corpus (not a plan task). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds BehavioralPriors struct hierarchy (SourceMix, PerSourceIet, LinesPerJe, ActiveLifetime, Fanout, PostingLag) with empty-default round-trip test, an optional behavioral field on Fingerprint (additive via serde(default)), and stub modules for extraction/aggregation. Individual prior extractors land in tasks 2-9; aggregators in 10-15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…, unwrap Option wrapper T1 subagent wrapped IetSummary.empirical_cdf_days / LagSummary.empirical_cdf_days in Option<EmpiricalCdf> because EmpiricalCdf lacked Default + PartialEq. Restores the spec's plain-EmpiricalCdf shape by deriving both on EmpiricalCdf — one-line change — and restoring PartialEq derives on the containing structs (BehavioralPriors, PerSourceIetPrior, IetSummary, PostingLagPrior, LagSummary). This means downstream T2-T9 extractor code can use `empirical_cdf_days: EmpiricalCdf::from_sorted_values(...)` directly without the Option wrapper, matching the plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…n helpers + bucket grids Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The integration smoke `extract_aggregate_inspect_roundtrip` was failing on CI's OS-matrix Test jobs (ubuntu/macos/windows). Root cause: SP3.8b set `DEFAULT_MIN_SOURCE_OBSERVATIONS = 1000` to drop the long tail of low-volume sources from `source_mix`. The test spreads draws across 6 codes, so at the prior n=3000 each code only got ~500 observations and all six were filtered → `source_mix.probabilities.is_empty()`. The earlier F2 fix (9d4caa5) caught the *unit* test that uses the same threshold but missed this integration test. Bump n=3000 → 9000 (~1500 per code, comfortably above 1000 at 2σ across seed variance) and update the `n_rows_aggregated` assertion accordingly. Pre-existing failure latent since fcacb0b, surfaced now by CI's OS-matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The codebase had `#![deny(clippy::unwrap_used)]` crate-wide on 11 lib.rs
files and worked around it with 672 per-mod `#[allow(clippy::unwrap_used)]`
annotations on test modules. CI's clippy step ran `cargo clippy
--all-targets` against the workspace root virtual manifest — which only
exercises the bench harness, not the members — so this hybrid policy was
never actually enforced end-to-end. Locally `cargo clippy --workspace
--tests` produced 66+ errors.
Cleanup:
- Switch each affected lib.rs from `#![deny(clippy::unwrap_used)]` to
`#![cfg_attr(not(test), deny(clippy::unwrap_used))]`. Production-code
policy unchanged (strict deny); tests get the natural latitude they
need (unwrap is the intended assertion mechanism in test fixtures).
Matches the industry-standard Rust pattern.
- Remove the now-redundant per-mod `#[allow(clippy::unwrap_used)]`
annotations from 672 test modules across all crates. Smaller diff,
single canonical policy expressed in lib.rs.
- Fix the 4 remaining non-unwrap clippy errors that surfaced once the
workspace --tests scope was actually exercised:
* needless_borrow_for_generic_args in sp3_priors_smoke.rs:530
* field_reassign_with_default in sp3_priors_smoke.rs:566
* unnecessary_unwrap in sp3_priors_smoke.rs:1448 (use `if let Some`)
* useless_vec in velocity_rules.rs:234
- Tighten CI workflow: `cargo clippy --all-targets` →
`cargo clippy --workspace --all-targets`. Now the policy actually
fires across every member crate's lib + tests + benches + examples.
Verified: `cargo clippy --workspace --all-targets -- -D warnings` clean;
all workspace lib tests pass (4761+ tests across 17 crates); sp6
integration smokes + bundle audit + behavioral priors smoke all green.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Add entries for artifacts that accumulated during SP6 development and that don't belong in the repo: - `.claude/` — per-project Claude Code state (settings, transcripts) - `.playwright-mcp/` — Playwright MCP capture artifacts - `hf_*_staging/` / `hf_*_output/` — Hugging Face dataset staging/output - `spaces/*/preview-live.png` — browser-captured preview screenshots All regenerable; none are source. Cuts the recurring noise in `git status`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
All three TODO comments were really "this is a known scope limitation, here's why we chose to limit scope, here's the workaround" — not action items. They already carried explicit "for X use-case the approximation is acceptable" caveats. Switching `TODO:` → `LIMITATION:` preserves the content and rationale while removing the false to-do signal that shows up in grep sweeps. - copula::incomplete_beta — continued-fraction convergence on extreme tails (intentional for audit-simulation copula sampling) - holidays::approximate_chinese_new_year — lunar approximation (intentional for activity-pattern simulation, off by ±1-2 days) - period_close::ThirteenPeriod — 28-day fiscal period (intentional for simulation; not appropriate for production period-close) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo doc --workspace --no-deps` was emitting 80 warnings split into two categories: - Math notation rustdoc misreads as links: `[0,1]`, `[i]`, `[0]`, `[:100]`, `[DE]`. Fix: escape brackets — `[0,1]` → `\[0,1\]`. - Real broken refs to renamed/moved methods/types (`[generate]`, `[run_shard]`, `[CountryPack]`, `[ConditionalSampler::new]`, etc.). Fix: drop the link form, keep the code-formatted name — ``[`NAME`]`` → `` `NAME` ``. Both are pure-cosmetic fixes. Rendered docs look identical to a reader (math notation is unchanged; broken links were rendering as raw text anyway). Warnings now zero across the workspace. Unblocks adding `-D warnings` to `cargo doc` in CI in a future hardening pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`bundled_priors_path()` returns a `PathBuf` whose string form uses the
host's native separator. On Windows the previous assertion
`s.contains("priors/industry_priors_health.dsf")` always failed because
the rendered path uses `\`. Split into two substring checks
(`contains("priors")` AND `contains("industry_priors_health.dsf")`) so
the test holds on both platforms.
Pre-existing failure on `Test (windows-latest)` since the bundle
relocation; surfaced now by CI's OS-matrix after we tightened other
checks.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
cargo-machete (validated by grep + cfg-gated probe) flagged ~33
dependencies declared in Cargo.toml but never actually used. After
audit, all 31 below have ZERO source uses (no `use X`, no `X::`, no
cfg-gated import, no `#[derive(X)]`) and were just dead declarations:
datasynth-ocpm -- thiserror
datasynth-eval -- datasynth-config, datasynth-generators, rayon
datasynth-fingerprint-- anyhow, datasynth-config
datasynth-banking -- rand_distr, thiserror
datasynth-runtime -- crossbeam-channel, tokio
datasynth-generators -- rayon
datasynth-output -- thiserror, tokio, uuid, crossbeam-channel
datasynth-server -- anyhow, async-trait, datasynth-generators,
datasynth-output, subtle, thiserror
datasynth-cli -- datasynth-graph, indicatif, itoa, ryu, serde,
tokio
datasynth-test-utils -- rand, rand_chacha, serde_json, tempfile
datasynth-group -- datasynth-audit-fsm
Notable wins: dropping `datasynth-generators` + `datasynth-output` from
`datasynth-server` and `datasynth-generators` + `datasynth-config` from
`datasynth-eval` removes large transitive build paths.
Three deps that machete reported but ARE used (via feature wiring rather
than direct `use`) are kept and pinned in `[package.metadata.cargo-machete]
ignored` blocks with explanatory comments:
- datasynth-core/safetensors (public optional dep, neural feature)
- datasynth-cli/reqwest (public optional dep, streaming feature)
- fuzz/arbitrary (libfuzzer harness macro consumption)
- attic/datasynth-graph-export/datasynth-banking (excluded crate)
Verified: cargo check --workspace --tests green; cargo machete clean.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Runs `cargo machete` on every PR. Currently clean across the workspace; new unused dep declarations will be caught at PR time. Pinned ignore entries (safetensors, reqwest, arbitrary, datasynth-banking-in-attic) document the legitimate exceptions inline. Mirrors the shape of the existing Security Audit job (cargo-deny + cargo-audit) — same `cargo install --locked` pattern, same exclusion of the private graph-export sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The Cartesian default produced O(debits × credits) edges per JE. A 50-debit / 50-credit period-close consolidation alone yielded 2 500 edges; a typical HF-scale 1 M-line config blew up to 200 M+ edges (noted in the original doc-comment). On 14–16 GB CI runners, the small- complexity CLI smoke `test_generate_from_config_file` hit a 20 GB allocation request inside `RawVec::grow_one` and OOM'd on Windows (immediate `memory allocation failed`) / hung-then-killed on macOS (slower allocator path). Surface evidence: main alternates pass/fail on this same test for exactly this reason — the OS-matrix Test job is the only one running `--all-targets` (Linux skips integration tests for cumulative-memory reasons per workflow comment). Flip the default to Method A: one edge per 2-line JE, skip multi-line. Bounded edge count (≈ 60 % of entries per Ivertowski 2024), exactness- preserving (confidence = 1.0 on every emitted edge), the recommended shape for published reference datasets. Users relying on the previous Cartesian shape can opt back in with `je_network.method: cartesian` explicitly — `JeNetworkConfig` already plumbs the choice through. Local verification: `test_generate_from_config_file` previously ran for 629 s before being interrupted on macOS / OOM'd at 361 s on Windows. With Method A default it finishes in **47 s** locally. All 4 `je_network` unit tests (including the explicit `Cartesian` cases) keep passing — only the *default* changed. CHANGELOG: noted as a v5.27 breaking change with migration steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Legal guidance: no real company names in the codebase, even as test
fixtures. The group-audit consolidation fixture was named after a real
multinational ("Mini-Nestlé", entity codes NESTLE_SA / NESTLE_USA /
NESTLE_DE / NESTLE_BR / NESTLE_JV / NESTLE_EU / NESTLE_FR). The fixture
is entirely synthetic — hand-built TBs and a fabricated group structure
— but the *name* is a real company and must go.
Rename throughout (58 files): `NESTLE_*` → `ACME_*`, `Mini-Nestlé` →
`Mini-Acme`, `mini_nestle` → `mini_acme`. Physical renames via `git mv`:
- configs/examples/group/mini_nestle.yaml → mini_acme.yaml
- tests/fixtures/mini_nestle{,_minimal}.yaml → mini_acme*.yaml
- tests/golden/mini_nestle_manifest.json → mini_acme_manifest.json
- tests/golden/mini_nestle/ → tests/golden/mini_acme/
The manifest golden was regenerated: `entity_seed` hashes and `ICR_*`
intercompany-relationship IDs are FNV-derived from the entity codes, so
renaming the codes deterministically changes those derived values. Diff
confirmed to be exactly the rename + its hash consequences, nothing else.
Verified: datasynth-group lib (107) + manifest_golden + golden_archive +
manifest_builder + config_parse + aggregate_e2e + standalone_e2e +
balance_property + ic_matcher all pass; datasynth-cli group_cli (8) and
datasynth-runtime output_root_routing (3) pass.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Legal guidance tightened: refer to the corpus only as "corpus" / "corpus
data" — never with a "real" / "real-world" / "real client" qualifier that
hints it is real client data, and never name a client.
Scrub across the repo (current-facing + historical docs):
- "real-corpus" / "the real corpus" / "real corpus" / "held-out real
corpus" / "real-world (enterprise) (GL) corpus" → "corpus" forms (~340
occurrences). Code doc-comments, CHANGELOG, README, baseline SUMMARYs,
superpowers specs/plans.
- Corpus-source references "real client COA/TB/GL exports/cubes" (the
fingerprint extractor's input) → "corpus …".
- `text_taxonomy` legal-entity-suffix test fixtures used real company
names (Nespresso, Roche, Sintetica) — replaced with fictional
placeholders (Acme / Globex / Initech). The test only checks the
`Word + S.A./B.V.` shape, so the names are arbitrary.
- Integration-spec product concept "real client data" (the auditor's own
data, distinct from our corpus) → "client data" — drops the "real"
qualifier while keeping the real-vs-synthetic meaning.
Deliberately left unchanged (legitimate, not corpus references):
- Generic fingerprint-capability docs ("extract from real data") — a
general feature description, not our corpus.
- EU AI Act Article 10 compliance statements ("no real data used") — a
privacy *feature* claim.
- Internal "real GL balance" in audit generators — the simulation's own
GL, not external data.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`NESTLE_*` → `ACME_*` shortened several call-argument lines below rustfmt's wrap threshold, so the formatter re-collapses multi-line expressions onto single lines. Pure formatting, no logic change. Caught by the Format CI check on the rename commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The v5.27 default flip (JeNetworkMethod Cartesian → A) broke this e2e test on the OS-matrix Test jobs (macOS/Windows). The test's assertion 2 validates the Cartesian edge formula — edge count = Σ (n_debit × n_credit) per JE — but it called `write_all_output`, which now defaults to Method A (one edge per 2-line JE, skip multi-line). Actual < expected → fail. Switch to `write_all_output_with_layout(..., JeNetworkMethod::Cartesian)` so the test keeps exercising the Cartesian math it was written for. This restores the exact path the test passed on for its entire pre-flip history. Method A's bounded edge count is the new default for generation; this test deliberately opts into Cartesian to validate that code path. Not run locally — the full-orchestrator run + Cartesian edge product exhausts this dev box's memory. Validated by CI on the OS-matrix runners (14-16 GB), which is where the regression surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔴 Merge held. A review of the shipped text-taxonomy bundles surfaced ~285 templates carrying real |
A review of the shipped text-taxonomy bundles found 285 distinct
templates carrying real `Firstname Lastname` / `Lastname Firstname`
person names that escaped placeholder substitution (e.g. consulting
fees, rent paid to individuals, debtor late-fees). The SP6 PII model
missed this class on all three layers:
- Phase A (structural) handles dates/digits, not names;
- the Phase-B denylist is occurrence-thresholded, so the long tail of
names (each appearing a few times) was never added;
- `residual_pii_scan` / `bundle_pii_audit` only flagged
`Initial. Surname`, `Surname Initial.`, titles, patient and star
records — there was NO plain two-word-name detector, so the audit
was green while the leak shipped.
Fix — a given-name gazetteer anchor:
- resources/given_names.txt: 959 generic given names (country-pack
union + Swiss/DE/FR/IT supplement). NOT PII — a name dictionary like
city names; covers the Swiss-corpus long tail.
- Phase A rule 3.5: a capitalised run (>=2 tokens) containing a known
given name collapses to `{person}`. German capitalises all nouns, so
case can't separate a surname from a description noun — we redact the
whole run (safe over-redaction) rather than risk leaking the surname.
Lowercase words / punctuation terminate the run, preserving
surrounding description.
- residual_pii_scan gains a `given_name` check using the same anchor,
so bundle_pii_audit now catches this class and can't regress.
Surname-only references with no given name (e.g. "Darlehen <Surname>")
still rely on the denylist — a separate, smaller class.
Tests: 3 new (scan flags name runs; tokenize collapses runs + stays
PII-clean; no false positives on bank names / generic terms /
placeholdered text), using fictional surnames + generic given names so
no real corpus name enters the test source. 21/21 text_taxonomy tests
pass. Bundle regeneration + re-audit follow in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The given-name regen surfaced a gap in the prior commit: tokenize's patient (Rule 1) and star-person (Rule 2) branches `return`ed early, so a `Firstname Lastname` in the SUFFIX of such a record (e.g. a trailing person name after a `G:` patient marker) never reached the name-run collapse — it leaked to the bundle and the `given_name` scan then correctly rejected it at extraction (regen halt on a `Robert Hoe`-style line). Restructure tokenize so Rules 1/2 produce a staged string that falls through to the common tail (street + name-run + structural), so every path gets name + structural redaction. Regression test added with a fictional trailing name after patient/star records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Iterative regen surfaced two more leak vectors the plain whitespace+lowercase lookup missed: 1. The corpus strips umlauts entirely (Jürg→Jrg, Löhne→Lhne), so a gazetteer entry with proper umlauts never matched the corpus form. `normalize()` now drops ä/ö/ü and maps accented latin to its base letter (Régis→Regis), applied to both the gazetteer (at load) and the token at lookup. 2. Compound / prefix-joined tokens (Hans-Rudolf, ESD-Roger, CS/Rolf) are now split on `-` `/` `.` `,` `_` and matched part-by-part, so a known given name inside a joined token is recognized. Also expands the gazetteer 1025→1096 with the confirmed stragglers (Rolf/Walter/Roman from the prior pass + Jörg/Guido/Nina/Jana/Inge/…) and an international supplement (Balkan/Turkish/Iberian/Italian given names — the corpus carries immigrant names like Dejan/Dursun/Ilija). Tests: +1 (compound + umlaut-stripped forms collapse + stay clean). 23/23 text_taxonomy tests pass. Re-regen + final probe next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Regenerated all 5 industry bundles with the given-name gazetteer +
umlaut/hyphen-robust tokenizer. Closes the 285-template real-name leak.
Verification on the regenerated bundles:
- bundle_pii_audit (now with the `given_name` check): PASS on all 5.
- broad common-given-name probe: 285 → 0 (Marc/Hans/Rolf/Walter/Roman/
Jörg/… all collapsed, incl. umlaut-stripped + hyphen-compound forms).
- tight full-name-shape survivors after a generic filter are all
institutions / company counterparties (Eli Lilly, Universitätsspital
Basel, Wolters Kluwer) — the keep-category, same as the bank names.
Over-redaction cost (German noun-capitalisation forces collapsing a
capitalised run that contains a name): of 34,181 templates, 2,328
(6.8%) carry a `{person}` placeholder and only 190 (0.6%) collapsed to
a bare `{person}` — 99.4% retain their description content.
Residual risk: a rare given name outside the 1,096-entry gazetteer in a
full-name run could still pass (both tokenize and the audit share the
gazetteer). The Phase-B denylist is the surname backstop. A comprehensive
given-name lexicon is a possible future hardening.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
✅ Leak closed — merge un-held. Root-caused the |
Pure formatting (RE_NAME_RUN one-liner + test array wrapping). Caught by the Format CI check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…are names)
The new `given_name` residual-PII pattern is a TEMPLATE-scan signal for
un-tokenized corpus names (gated by bundle_pii_audit at the template
level). But two tests scan generated OUTPUT, where `{person}`/`{patient}`
placeholders have been filled with synthetic names that are name-shaped
by design — so `given_name` false-positives on legitimate output:
- je_generator::synthetic_patient_pool_entries_pass_residual_scan
- sp6_text_taxonomy_smoke (header_text / line_text scan)
Both now filter out `given_name` hits and keep asserting the structural
shapes (initial_surname, patient_record, title, …). The template-level
audit remains the authoritative gate for un-tokenized corpus names.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The macOS/Windows OS-matrix surfaced a synthetic `{company}` fill
("S.Merchandise Holdings LLC") tripping `initial_surname` in generated
output — the same false-positive class as `given_name` on person fills.
Scanning FILLED output with `residual_pii_scan` is conceptually wrong:
the scan detects PII *shapes*, which synthetic fills legitimately match.
Drop the output residual-PII assertion from the smoke; it now checks the
SP6 wiring it's meant to (no leftover `{…}` placeholders + line_text
coverage). The authoritative PII gate is `bundle_pii_audit`, which scans
the committed TEMPLATES (where a PII shape = an un-tokenized corpus name).
Removes the now-unused PlaceholderGrammar import.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Windows was getting cancelled at exactly 60m on the full `--all-targets` integration suite — a timeout, not a test failure (macOS ran the identical suite green in ~40 min; Windows runners are ~50% slower for the orchestrator + CLI-subprocess tests, and the larger regenerated SP6 bundles added load). 90 min gives Windows headroom; Linux is unaffected (`--lib --bins`, ~10 min) and macOS has margin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
TextTemplate*path with a structured, PII-safe text-taxonomy pipeline keyed by(source × ISO-21378-account-class)— every synthetic header / line / CoA description is now coherent with the account class it posts to and carries zero residual corpus PII.bundle_pii_audittest over the committed.dsfbundles.PiiDenylistliteral matching to single-pass Aho-Corasick (per-client extraction: ~9 h → ~30–50 s on a 27 k-entry denylist), making iterative denylist curation viable.Implements
docs/superpowers/specs/2026-05-14-sp6-text-taxonomy-design.md; tracked indocs/superpowers/plans/2026-05-14-sp6-text-taxonomy.md. CHANGELOG entry asv5.27.What ships
datasynth-core::distributions::text_taxonomyPlaceholderGrammar(tokenize Phase A / fill / residual_pii_scan) + typesdatasynth-fingerprint::extraction::pii_denylistPiiDenylist::{load,apply}— Aho-Corasick literals + regex sweepdatasynth-fingerprint::extractionextract_text_taxonomy+aggregate_text_taxonomydatasynth-generators::je_generatorMasterDataResolver+(source, account_class)template lookupdatasynth-generators::coa_generatoroverlay_coa_taxonomyorchestrationdatasynth-runtime/tests/bundle_pii_audit.rsscripts/regenerate-industry-priors.sh--pii-denylistflag + post-regen audit gateBundle deltas
What's removed
BehavioralPriors.text_templatesand the entireTextTemplate*/fill_text_template_with_rng/extract_text_templates/aggregate_text_templatesSP4.4 path. All text now flows throughtext_taxonomy.BF metrics
Intentionally not re-baselined here — SP6 only touches text fields, not the temporal / clustering / amount signals the composite measures. Next baseline picks up incidental drift only.
Test plan
cargo test -p datasynth-runtime --test sp6_text_taxonomy_smoke(covers placeholder fill + zero residual PII over generated JEs)cargo test -p datasynth-runtime --test bundle_pii_audit(scans every committed.dsffor residual fuzzy PII)cargo test -p datasynth-generators --lib synthetic_patient_pool_entries_pass_residual_scan(regression guard on the synthetic-fallback shape invariant)cargo test -p datasynth-fingerprint --lib pii_denylist::tests(6/6 includingapply_scales_with_aho_corasick_not_with_denylist_sizeperf invariant)scripts/regenerate-industry-priors.shruns the audit gate as a build-time check; full regen against the private corpus completes ~clean.🤖 Generated with Claude Code