SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar by mivertowski · Pull Request #202 · mivertowski/SyntheticData

mivertowski · 2026-05-18T14:16:15Z

Summary

Replaces the SP4.4 TextTemplate* path with a structured, PII-safe text-taxonomy pipeline keyed by (source × ISO-21378-account-class) — every synthetic header / line / CoA description is now coherent with the account class it posts to and carries zero residual corpus PII.
Two privacy gates protect the public bundles: a build-time residual-PII audit on every regen, and a CI bundle_pii_audit test over the committed .dsf bundles.
All 5 industry priors regenerated through the new pipeline. SP6.1 follow-up rewrote PiiDenylist literal matching to single-pass Aho-Corasick (per-client extraction: ~9 h → ~30–50 s on a 27 k-entry denylist), making iterative denylist curation viable.

Implements docs/superpowers/specs/2026-05-14-sp6-text-taxonomy-design.md; tracked in docs/superpowers/plans/2026-05-14-sp6-text-taxonomy.md. CHANGELOG entry as v5.27.

What ships

Component	What
`datasynth-core::distributions::text_taxonomy`	`PlaceholderGrammar` (tokenize Phase A / fill / residual_pii_scan) + types
`datasynth-fingerprint::extraction::pii_denylist`	`PiiDenylist::{load,apply}` — Aho-Corasick literals + regex sweep
`datasynth-fingerprint::extraction`	`extract_text_taxonomy` + `aggregate_text_taxonomy`
`datasynth-generators::je_generator`	`MasterDataResolver` + `(source, account_class)` template lookup
`datasynth-generators::coa_generator`	CoA description fill once per account, `overlay_coa_taxonomy` orchestration
`datasynth-runtime/tests/bundle_pii_audit.rs`	CI gate over committed bundles
`scripts/regenerate-industry-priors.sh`	`--pii-denylist` flag + post-regen audit gate

Bundle deltas

Bundle	line pools	header pools	CoA pools	clients
health	794	57	3 123	8
life_sciences	319	34	1 362	4
pharmaceutical	0	8	76	1
technology	101	14	541	1
power_and_utilities	192	39	552	1

What's removed

BehavioralPriors.text_templates and the entire TextTemplate* / fill_text_template_with_rng / extract_text_templates / aggregate_text_templates SP4.4 path. All text now flows through text_taxonomy.

BF metrics

Intentionally not re-baselined here — SP6 only touches text fields, not the temporal / clustering / amount signals the composite measures. Next baseline picks up incidental drift only.

Test plan

cargo test -p datasynth-runtime --test sp6_text_taxonomy_smoke (covers placeholder fill + zero residual PII over generated JEs)
cargo test -p datasynth-runtime --test bundle_pii_audit (scans every committed .dsf for residual fuzzy PII)
cargo test -p datasynth-generators --lib synthetic_patient_pool_entries_pass_residual_scan (regression guard on the synthetic-fallback shape invariant)
cargo test -p datasynth-fingerprint --lib pii_denylist::tests (6/6 including apply_scales_with_aho_corasick_not_with_denylist_size perf invariant)
scripts/regenerate-industry-priors.sh runs the audit gate as a build-time check; full regen against the private corpus completes ~clean.
CI on this PR

🤖 Generated with Claude Code

SP1 of the broader Behavioral-Fidelity initiative (SP1 evaluation → SP2 prior extraction → SP3 entity-aware generation → SP4 showcase release). Adds a new datasynth-eval/behavioral_fidelity/ submodule with the Sajja (2026) P1–P4 metrics adapted to GL semantics: Source as primary entity, Trading Partner as secondary, EntryDate as day-resolution timestamp, plus synth-only intraday metrics and a 10-rule canonical GL velocity rule set. Anchors every metric to a 50/50-split noise floor via the degradation-ratio normalizer, ships a `datasynth-data behavioral score` CLI with CI-gate semantics, and stays clean of real client data in the repo. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

23-task TDD plan covering: scaffolding (T1) → types + entity profile + math helpers (T2-T5) → parquet/csv loader (T6) → P1 IETD (T7) → P2 active lifetime + burst length + JE-line-burst (T8-T10) → P3 fan-out + clustering + triangles (T11-T12) → P4 canonical R1-R10 + trigger gap (T13) → degradation ratio + 50/50 split (T14) → intraday metrics (T15) → BehavioralFidelityReport + JSON/MD/CSV writers (T16-T17) → compute_report orchestrator (T18) → datasynth-data behavioral score CLI (T19) → integration smoke + noise-floor sanity tests (T20-T21) → CI wiring (T22) → user docs + README + CLAUDE.md (T23). Each task has bite-sized TDD steps with complete code blocks. Sequenced into 10 parallel-execution waves for subagent dispatch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the new submodule directory + Cargo.toml dependencies (arrow, parquet, petgraph) under datasynth-eval. Empty module stubs land in subsequent tasks. Module hierarchy mirrors the spec at docs/superpowers/specs/2026-05-11-sp1-behavioral-fidelity-design.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds Record (one JE line, normalised), EntityProfile, RuleSet + VelocityRuleSpec enum, GateThresholds, and BehavioralFidelityConfig::gl_default() returning the source-tp profile with seed 42 and DR-gate thresholds 2.0 / 1.5. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

gl_source_tp() returns the Source + Trading Partner profile. Real-corpus alias map preserves the typo 'Tarding Partner' from the source data exports. Synthetic alias map points at DataSynth's journal_entries output column names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wasserstein-1 with fast equal-length path and quantile-grid fallback for unequal lengths. Pearson lag-1 autocorrelation returning None for too-short or zero-variance series. Empirical percentile, day-difference and weekend predicates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Auto-detects real-corpus vs synthetic schema (typo 'Tarding Partner' is the giveaway), maps to canonical column names, parses dates as either YYYY-MM-DD strings or Date32, supports millisecond/microsecond/nanosecond/RFC3339 timestamps for CreatedAt. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

compute_p1 computes pooled inter-event-time W1 and pooled within-entity lag-1 autocorrelation gap at day resolution. Includes source_of / trading_partner_of projector helpers for the two entity profiles. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… gap Builds the entity-projection graph via petgraph::UnGraph; clustering coefficient via neighbour-pair triangle enumeration; triangle log-ratio gap = |log((t_real+1)/(t_syn+1))|.

Canonical GL velocity rules: R1 count, R2 distinct accounts, R3 sum>p90, R4 dormant-account wake, R5 distinct TPs, R6 amount spike, R7 off-hours, R8 post-close, R9 round-dollar share, R10 backdating. Per-rule trigger-rate gap + composite mean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ormaliser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…edian seconds, off-hours rate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…(BTreeMap-ordered) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eport Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…raday + gates Single entry point that produces a fully populated BehavioralFidelityReport with per-entity metrics (primary + optional secondary), baseline values from a deterministic 50/50 split, composite BF score (equal-weighted mean of all sub-metric DRs), and a gate result driven by GateThresholds. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires BehavioralScoreArgs to behavioral_fidelity::compute_report_from_paths, writes report.json + report.md + metrics.csv to --out, exits 0/2 based on gate result. SP1 ships the gl-source-tp profile only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…-synthetic data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Fixes inner-doc comment in behavioral_smoke.rs so include! works from the noise-floor test module. When real equals syn, all per-entity DRs must stay below 2.0 (numerator ≈ 0, denominator = split baseline > 0). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds two named integration test invocations to the eval workflow. Both run with --test-threads=4 per workspace policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs/behavioral-fidelity.md documents the P1-P4 metrics, the R1-R10 velocity rule set, the day-granularity limitation, and CLI usage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…oise floor First measurement of DataSynth against the real-corpus Swiss-healthcare GL data (JE_3, 1.4M lines, 2 years). Composite BF 59× lands mid-pack vs Sajja's tabular generators (CTGAN 32, TVAE 24, ARGN 36, Copula 39 on their dataset). Key finding: P1 autocorrelation DR = 0.90 — at the real-data noise floor for the primary Source entity. Matches Sajja Proposition 2: row-independent generators are structurally incapable of producing positive within-entity IET autocorrelation (their best is TVAE at 5.9×). DataSynth's rule-based architecture preserves the burst-regularity fingerprint those generators destroy. Where the DR mass concentrates (SP2/SP3 input): - P2 JE-line-burst W1: 452.80× — biggest single fix target (lines-per-JE distribution) - P3 triangle/clustering: 36-345× — bipartite Source-attribute graph structure - P1 IETD W1: 60× — within-Source posting cadence - P2 active lifetime: 23× — Sources active for different windows - TradingPartner column gap: synthetic journal_entries has no trading_partner → degenerate TP scores Also fixes synthetic_aliases to match DataSynth's actual journal_entries.csv schema (source / gl_account / document_id / line_number / posting_date / document_date / local_amount) — earlier aliases were guesses that didn't match the real output. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

SP2 of the broader Behavioral-Fidelity initiative (SP1 ✅ shipped → SP2 here → SP3 next → SP4 last). Mines the 45-client real-corpus for the five priors the SP1 baseline identified as concrete fix targets (composite BF 59.0× on JE_3): source-mix, per-Source IET, lines-per-JE (the 452.8× target), active lifetime, bipartite fan-out — plus a bonus posting-lag prior. Extends datasynth-fingerprint with new models/behavioral.rs + extraction/behavioral_extractor.rs + aggregation/industry_aggregator.rs. Bumps .dsf schema_version with an additive optional `behavioral` field (old readers still work). Reuses datasynth-eval's Record + loader + entity_profile so schema-mapping stays in one place. Five industry bundles ship under crates/datasynth-generators/resources/priors/: health, life_sciences, pharma, power_utilities, technology (industries with ≥3 clients in the 45-client corpus; smaller industries deferred). Three new CLI subcommands: `fingerprint extract --behavioral`, `fingerprint aggregate-industry`, `fingerprint inspect --behavioral`. trading_partner column plumbing in journal_entries.csv is explicitly added to the SP3 target list per user direction; SP2 still extracts the TP fan-out prior from real data. ~2 weeks, 6 phases, mirrors SP1 dispatch pattern. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

23-task TDD plan covering: scaffolding + BehavioralPriors model types (T1) → LineCountHistogram helpers (T2) → 5 prior extractors + bonus posting-lag (T3-T8) → extract_behavioral_priors orchestrator (T9) → aggregation scaffolding + 4 aggregators (T10-T14) → aggregate_industry_priors orchestrator (T15) → 3 CLI surfaces: fingerprint extract --behavioral / aggregate-industry / info --behavioral (T16-T18) → integration smoke + backward-compat test (T19-T20) → CI workflow (T21) → regenerate script (T22) → docs (T23). Each task has bite-sized TDD steps with complete code blocks. Final committed-bundle generation happens manually post-T23 via scripts/regenerate-industry-priors.sh against the real corpus (not a plan task). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds BehavioralPriors struct hierarchy (SourceMix, PerSourceIet, LinesPerJe, ActiveLifetime, Fanout, PostingLag) with empty-default round-trip test, an optional behavioral field on Fingerprint (additive via serde(default)), and stub modules for extraction/aggregation. Individual prior extractors land in tasks 2-9; aggregators in 10-15. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…, unwrap Option wrapper T1 subagent wrapped IetSummary.empirical_cdf_days / LagSummary.empirical_cdf_days in Option<EmpiricalCdf> because EmpiricalCdf lacked Default + PartialEq. Restores the spec's plain-EmpiricalCdf shape by deriving both on EmpiricalCdf — one-line change — and restoring PartialEq derives on the containing structs (BehavioralPriors, PerSourceIetPrior, IetSummary, PostingLagPrior, LagSummary). This means downstream T2-T9 extractor code can use `empirical_cdf_days: EmpiricalCdf::from_sorted_values(...)` directly without the Option wrapper, matching the plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n helpers + bucket grids Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The integration smoke `extract_aggregate_inspect_roundtrip` was failing on CI's OS-matrix Test jobs (ubuntu/macos/windows). Root cause: SP3.8b set `DEFAULT_MIN_SOURCE_OBSERVATIONS = 1000` to drop the long tail of low-volume sources from `source_mix`. The test spreads draws across 6 codes, so at the prior n=3000 each code only got ~500 observations and all six were filtered → `source_mix.probabilities.is_empty()`. The earlier F2 fix (9d4caa5) caught the *unit* test that uses the same threshold but missed this integration test. Bump n=3000 → 9000 (~1500 per code, comfortably above 1000 at 2σ across seed variance) and update the `n_rows_aggregated` assertion accordingly. Pre-existing failure latent since fcacb0b, surfaced now by CI's OS-matrix. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The codebase had `#![deny(clippy::unwrap_used)]` crate-wide on 11 lib.rs files and worked around it with 672 per-mod `#[allow(clippy::unwrap_used)]` annotations on test modules. CI's clippy step ran `cargo clippy --all-targets` against the workspace root virtual manifest — which only exercises the bench harness, not the members — so this hybrid policy was never actually enforced end-to-end. Locally `cargo clippy --workspace --tests` produced 66+ errors. Cleanup: - Switch each affected lib.rs from `#![deny(clippy::unwrap_used)]` to `#![cfg_attr(not(test), deny(clippy::unwrap_used))]`. Production-code policy unchanged (strict deny); tests get the natural latitude they need (unwrap is the intended assertion mechanism in test fixtures). Matches the industry-standard Rust pattern. - Remove the now-redundant per-mod `#[allow(clippy::unwrap_used)]` annotations from 672 test modules across all crates. Smaller diff, single canonical policy expressed in lib.rs. - Fix the 4 remaining non-unwrap clippy errors that surfaced once the workspace --tests scope was actually exercised: * needless_borrow_for_generic_args in sp3_priors_smoke.rs:530 * field_reassign_with_default in sp3_priors_smoke.rs:566 * unnecessary_unwrap in sp3_priors_smoke.rs:1448 (use `if let Some`) * useless_vec in velocity_rules.rs:234 - Tighten CI workflow: `cargo clippy --all-targets` → `cargo clippy --workspace --all-targets`. Now the policy actually fires across every member crate's lib + tests + benches + examples. Verified: `cargo clippy --workspace --all-targets -- -D warnings` clean; all workspace lib tests pass (4761+ tests across 17 crates); sp6 integration smokes + bundle audit + behavioral priors smoke all green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add entries for artifacts that accumulated during SP6 development and that don't belong in the repo: - `.claude/` — per-project Claude Code state (settings, transcripts) - `.playwright-mcp/` — Playwright MCP capture artifacts - `hf_*_staging/` / `hf_*_output/` — Hugging Face dataset staging/output - `spaces/*/preview-live.png` — browser-captured preview screenshots All regenerable; none are source. Cuts the recurring noise in `git status`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

All three TODO comments were really "this is a known scope limitation, here's why we chose to limit scope, here's the workaround" — not action items. They already carried explicit "for X use-case the approximation is acceptable" caveats. Switching `TODO:` → `LIMITATION:` preserves the content and rationale while removing the false to-do signal that shows up in grep sweeps. - copula::incomplete_beta — continued-fraction convergence on extreme tails (intentional for audit-simulation copula sampling) - holidays::approximate_chinese_new_year — lunar approximation (intentional for activity-pattern simulation, off by ±1-2 days) - period_close::ThirteenPeriod — 28-day fiscal period (intentional for simulation; not appropriate for production period-close) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`cargo doc --workspace --no-deps` was emitting 80 warnings split into two categories: - Math notation rustdoc misreads as links: `[0,1]`, `[i]`, `[0]`, `[:100]`, `[DE]`. Fix: escape brackets — `[0,1]` → `\[0,1\]`. - Real broken refs to renamed/moved methods/types (`[generate]`, `[run_shard]`, `[CountryPack]`, `[ConditionalSampler::new]`, etc.). Fix: drop the link form, keep the code-formatted name — ``[`NAME`]`` → `` `NAME` ``. Both are pure-cosmetic fixes. Rendered docs look identical to a reader (math notation is unchanged; broken links were rendering as raw text anyway). Warnings now zero across the workspace. Unblocks adding `-D warnings` to `cargo doc` in CI in a future hardening pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`bundled_priors_path()` returns a `PathBuf` whose string form uses the host's native separator. On Windows the previous assertion `s.contains("priors/industry_priors_health.dsf")` always failed because the rendered path uses `\`. Split into two substring checks (`contains("priors")` AND `contains("industry_priors_health.dsf")`) so the test holds on both platforms. Pre-existing failure on `Test (windows-latest)` since the bundle relocation; surfaced now by CI's OS-matrix after we tightened other checks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cargo-machete (validated by grep + cfg-gated probe) flagged ~33 dependencies declared in Cargo.toml but never actually used. After audit, all 31 below have ZERO source uses (no `use X`, no `X::`, no cfg-gated import, no `#[derive(X)]`) and were just dead declarations: datasynth-ocpm -- thiserror datasynth-eval -- datasynth-config, datasynth-generators, rayon datasynth-fingerprint-- anyhow, datasynth-config datasynth-banking -- rand_distr, thiserror datasynth-runtime -- crossbeam-channel, tokio datasynth-generators -- rayon datasynth-output -- thiserror, tokio, uuid, crossbeam-channel datasynth-server -- anyhow, async-trait, datasynth-generators, datasynth-output, subtle, thiserror datasynth-cli -- datasynth-graph, indicatif, itoa, ryu, serde, tokio datasynth-test-utils -- rand, rand_chacha, serde_json, tempfile datasynth-group -- datasynth-audit-fsm Notable wins: dropping `datasynth-generators` + `datasynth-output` from `datasynth-server` and `datasynth-generators` + `datasynth-config` from `datasynth-eval` removes large transitive build paths. Three deps that machete reported but ARE used (via feature wiring rather than direct `use`) are kept and pinned in `[package.metadata.cargo-machete] ignored` blocks with explanatory comments: - datasynth-core/safetensors (public optional dep, neural feature) - datasynth-cli/reqwest (public optional dep, streaming feature) - fuzz/arbitrary (libfuzzer harness macro consumption) - attic/datasynth-graph-export/datasynth-banking (excluded crate) Verified: cargo check --workspace --tests green; cargo machete clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runs `cargo machete` on every PR. Currently clean across the workspace; new unused dep declarations will be caught at PR time. Pinned ignore entries (safetensors, reqwest, arbitrary, datasynth-banking-in-attic) document the legitimate exceptions inline. Mirrors the shape of the existing Security Audit job (cargo-deny + cargo-audit) — same `cargo install --locked` pattern, same exclusion of the private graph-export sibling. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The Cartesian default produced O(debits × credits) edges per JE. A 50-debit / 50-credit period-close consolidation alone yielded 2 500 edges; a typical HF-scale 1 M-line config blew up to 200 M+ edges (noted in the original doc-comment). On 14–16 GB CI runners, the small- complexity CLI smoke `test_generate_from_config_file` hit a 20 GB allocation request inside `RawVec::grow_one` and OOM'd on Windows (immediate `memory allocation failed`) / hung-then-killed on macOS (slower allocator path). Surface evidence: main alternates pass/fail on this same test for exactly this reason — the OS-matrix Test job is the only one running `--all-targets` (Linux skips integration tests for cumulative-memory reasons per workflow comment). Flip the default to Method A: one edge per 2-line JE, skip multi-line. Bounded edge count (≈ 60 % of entries per Ivertowski 2024), exactness- preserving (confidence = 1.0 on every emitted edge), the recommended shape for published reference datasets. Users relying on the previous Cartesian shape can opt back in with `je_network.method: cartesian` explicitly — `JeNetworkConfig` already plumbs the choice through. Local verification: `test_generate_from_config_file` previously ran for 629 s before being interrupted on macOS / OOM'd at 361 s on Windows. With Method A default it finishes in **47 s** locally. All 4 `je_network` unit tests (including the explicit `Cartesian` cases) keep passing — only the *default* changed. CHANGELOG: noted as a v5.27 breaking change with migration steps. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Legal guidance: no real company names in the codebase, even as test fixtures. The group-audit consolidation fixture was named after a real multinational ("Mini-Nestlé", entity codes NESTLE_SA / NESTLE_USA / NESTLE_DE / NESTLE_BR / NESTLE_JV / NESTLE_EU / NESTLE_FR). The fixture is entirely synthetic — hand-built TBs and a fabricated group structure — but the *name* is a real company and must go. Rename throughout (58 files): `NESTLE_*` → `ACME_*`, `Mini-Nestlé` → `Mini-Acme`, `mini_nestle` → `mini_acme`. Physical renames via `git mv`: - configs/examples/group/mini_nestle.yaml → mini_acme.yaml - tests/fixtures/mini_nestle{,_minimal}.yaml → mini_acme*.yaml - tests/golden/mini_nestle_manifest.json → mini_acme_manifest.json - tests/golden/mini_nestle/ → tests/golden/mini_acme/ The manifest golden was regenerated: `entity_seed` hashes and `ICR_*` intercompany-relationship IDs are FNV-derived from the entity codes, so renaming the codes deterministically changes those derived values. Diff confirmed to be exactly the rename + its hash consequences, nothing else. Verified: datasynth-group lib (107) + manifest_golden + golden_archive + manifest_builder + config_parse + aggregate_e2e + standalone_e2e + balance_property + ic_matcher all pass; datasynth-cli group_cli (8) and datasynth-runtime output_root_routing (3) pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Legal guidance tightened: refer to the corpus only as "corpus" / "corpus data" — never with a "real" / "real-world" / "real client" qualifier that hints it is real client data, and never name a client. Scrub across the repo (current-facing + historical docs): - "real-corpus" / "the real corpus" / "real corpus" / "held-out real corpus" / "real-world (enterprise) (GL) corpus" → "corpus" forms (~340 occurrences). Code doc-comments, CHANGELOG, README, baseline SUMMARYs, superpowers specs/plans. - Corpus-source references "real client COA/TB/GL exports/cubes" (the fingerprint extractor's input) → "corpus …". - `text_taxonomy` legal-entity-suffix test fixtures used real company names (Nespresso, Roche, Sintetica) — replaced with fictional placeholders (Acme / Globex / Initech). The test only checks the `Word + S.A./B.V.` shape, so the names are arbitrary. - Integration-spec product concept "real client data" (the auditor's own data, distinct from our corpus) → "client data" — drops the "real" qualifier while keeping the real-vs-synthetic meaning. Deliberately left unchanged (legitimate, not corpus references): - Generic fingerprint-capability docs ("extract from real data") — a general feature description, not our corpus. - EU AI Act Article 10 compliance statements ("no real data used") — a privacy *feature* claim. - Internal "real GL balance" in audit generators — the simulation's own GL, not external data. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`NESTLE_*` → `ACME_*` shortened several call-argument lines below rustfmt's wrap threshold, so the formatter re-collapses multi-line expressions onto single lines. Pure formatting, no logic change. Caught by the Format CI check on the rename commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The v5.27 default flip (JeNetworkMethod Cartesian → A) broke this e2e test on the OS-matrix Test jobs (macOS/Windows). The test's assertion 2 validates the Cartesian edge formula — edge count = Σ (n_debit × n_credit) per JE — but it called `write_all_output`, which now defaults to Method A (one edge per 2-line JE, skip multi-line). Actual < expected → fail. Switch to `write_all_output_with_layout(..., JeNetworkMethod::Cartesian)` so the test keeps exercising the Cartesian math it was written for. This restores the exact path the test passed on for its entire pre-flip history. Method A's bounded edge count is the new default for generation; this test deliberately opts into Cartesian to validate that code path. Not run locally — the full-orchestrator run + Cartesian edge product exhausts this dev box's memory. Validated by CI on the OS-matrix runners (14-16 GB), which is where the regression surfaced. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski · 2026-05-20T14:24:24Z

🔴 Merge held. A review of the shipped text-taxonomy bundles surfaced ~285 templates carrying real Firstname Lastname person names that escaped placeholder substitution (the audit gate only flagged Initial. Surname/title/patient shapes, not plain two-word names). Fixing via a given-name gazetteer tokenizer in Phase A + residual_pii_scan, then regenerating + re-auditing all 5 bundles. Will mark ready once the audit is clean against the new detector.

A review of the shipped text-taxonomy bundles found 285 distinct templates carrying real `Firstname Lastname` / `Lastname Firstname` person names that escaped placeholder substitution (e.g. consulting fees, rent paid to individuals, debtor late-fees). The SP6 PII model missed this class on all three layers: - Phase A (structural) handles dates/digits, not names; - the Phase-B denylist is occurrence-thresholded, so the long tail of names (each appearing a few times) was never added; - `residual_pii_scan` / `bundle_pii_audit` only flagged `Initial. Surname`, `Surname Initial.`, titles, patient and star records — there was NO plain two-word-name detector, so the audit was green while the leak shipped. Fix — a given-name gazetteer anchor: - resources/given_names.txt: 959 generic given names (country-pack union + Swiss/DE/FR/IT supplement). NOT PII — a name dictionary like city names; covers the Swiss-corpus long tail. - Phase A rule 3.5: a capitalised run (>=2 tokens) containing a known given name collapses to `{person}`. German capitalises all nouns, so case can't separate a surname from a description noun — we redact the whole run (safe over-redaction) rather than risk leaking the surname. Lowercase words / punctuation terminate the run, preserving surrounding description. - residual_pii_scan gains a `given_name` check using the same anchor, so bundle_pii_audit now catches this class and can't regress. Surname-only references with no given name (e.g. "Darlehen <Surname>") still rely on the denylist — a separate, smaller class. Tests: 3 new (scan flags name runs; tokenize collapses runs + stays PII-clean; no false positives on bank names / generic terms / placeholdered text), using fictional surnames + generic given names so no real corpus name enters the test source. 21/21 text_taxonomy tests pass. Bundle regeneration + re-audit follow in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The given-name regen surfaced a gap in the prior commit: tokenize's patient (Rule 1) and star-person (Rule 2) branches `return`ed early, so a `Firstname Lastname` in the SUFFIX of such a record (e.g. a trailing person name after a `G:` patient marker) never reached the name-run collapse — it leaked to the bundle and the `given_name` scan then correctly rejected it at extraction (regen halt on a `Robert Hoe`-style line). Restructure tokenize so Rules 1/2 produce a staged string that falls through to the common tail (street + name-run + structural), so every path gets name + structural redaction. Regression test added with a fictional trailing name after patient/star records. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Iterative regen surfaced two more leak vectors the plain whitespace+lowercase lookup missed: 1. The corpus strips umlauts entirely (Jürg→Jrg, Löhne→Lhne), so a gazetteer entry with proper umlauts never matched the corpus form. `normalize()` now drops ä/ö/ü and maps accented latin to its base letter (Régis→Regis), applied to both the gazetteer (at load) and the token at lookup. 2. Compound / prefix-joined tokens (Hans-Rudolf, ESD-Roger, CS/Rolf) are now split on `-` `/` `.` `,` `_` and matched part-by-part, so a known given name inside a joined token is recognized. Also expands the gazetteer 1025→1096 with the confirmed stragglers (Rolf/Walter/Roman from the prior pass + Jörg/Guido/Nina/Jana/Inge/…) and an international supplement (Balkan/Turkish/Iberian/Italian given names — the corpus carries immigrant names like Dejan/Dursun/Ilija). Tests: +1 (compound + umlaut-stripped forms collapse + stay clean). 23/23 text_taxonomy tests pass. Re-regen + final probe next. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Regenerated all 5 industry bundles with the given-name gazetteer + umlaut/hyphen-robust tokenizer. Closes the 285-template real-name leak. Verification on the regenerated bundles: - bundle_pii_audit (now with the `given_name` check): PASS on all 5. - broad common-given-name probe: 285 → 0 (Marc/Hans/Rolf/Walter/Roman/ Jörg/… all collapsed, incl. umlaut-stripped + hyphen-compound forms). - tight full-name-shape survivors after a generic filter are all institutions / company counterparties (Eli Lilly, Universitätsspital Basel, Wolters Kluwer) — the keep-category, same as the bank names. Over-redaction cost (German noun-capitalisation forces collapsing a capitalised run that contains a name): of 34,181 templates, 2,328 (6.8%) carry a `{person}` placeholder and only 190 (0.6%) collapsed to a bare `{person}` — 99.4% retain their description content. Residual risk: a rare given name outside the 1,096-entry gazetteer in a full-name run could still pass (both tokenize and the audit share the gazetteer). The Phase-B denylist is the surname backstop. A comprehensive given-name lexicon is a possible future hardening. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski · 2026-05-20T16:41:53Z

✅ Leak closed — merge un-held. Root-caused the Firstname Lastname exposure (285 templates) and fixed via a 1,096-entry given-name gazetteer anchoring a Phase-A run-collapse rule + a residual_pii_scan given_name check (robust to the corpus's umlaut-stripping and hyphen/slash-joined tokens). All 5 bundles regenerated; bundle_pii_audit passes with the new check; broad common-name probe 285 → 0; over-redaction only 0.6% of templates. Residual: a rare given name outside the gazetteer (denylist is the surname backstop) — documented in the CHANGELOG. Commits f62f8f0..530e86d.

Pure formatting (RE_NAME_RUN one-liner + test array wrapping). Caught by the Format CI check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…are names) The new `given_name` residual-PII pattern is a TEMPLATE-scan signal for un-tokenized corpus names (gated by bundle_pii_audit at the template level). But two tests scan generated OUTPUT, where `{person}`/`{patient}` placeholders have been filled with synthetic names that are name-shaped by design — so `given_name` false-positives on legitimate output: - je_generator::synthetic_patient_pool_entries_pass_residual_scan - sp6_text_taxonomy_smoke (header_text / line_text scan) Both now filter out `given_name` hits and keep asserting the structural shapes (initial_surname, patient_record, title, …). The template-level audit remains the authoritative gate for un-tokenized corpus names. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The macOS/Windows OS-matrix surfaced a synthetic `{company}` fill ("S.Merchandise Holdings LLC") tripping `initial_surname` in generated output — the same false-positive class as `given_name` on person fills. Scanning FILLED output with `residual_pii_scan` is conceptually wrong: the scan detects PII *shapes*, which synthetic fills legitimately match. Drop the output residual-PII assertion from the smoke; it now checks the SP6 wiring it's meant to (no leftover `{…}` placeholders + line_text coverage). The authoritative PII gate is `bundle_pii_audit`, which scans the committed TEMPLATES (where a PII shape = an un-tokenized corpus name). Removes the now-unused PlaceholderGrammar import. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Windows was getting cancelled at exactly 60m on the full `--all-targets` integration suite — a timeout, not a test failure (macOS ran the identical suite green in ~40 min; Windows runners are ~50% slower for the orchestrator + CLI-subprocess tests, and the larger regenerated SP6 bundles added load). 90 min gives Windows headroom; Linux is unaffected (`--lib --bins`, ~10 min) and macOS has margin. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski and others added 30 commits May 11, 2026 22:33

feat(eval/behavioral): P2 active lifetime W1

bf17717

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): P2 burst-length W1 at configurable gap thresholds

041d653

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): P2 JE-line-burst length W1 (structural)

079438b

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): P3 fan-out distribution W1

3033151

feat(eval/behavioral): P3 clustering coefficient + triangle log-ratio…

58cf783

… gap Builds the entity-projection graph via petgraph::UnGraph; clustering coefficient via neighbour-pair triangle enumeration; triangle log-ratio gap = |log((t_real+1)/(t_syn+1))|.

feat(eval/behavioral): 50/50 split (JE-grouped, deterministic) + DR n…

631e4de

…ormaliser Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): synth-only intraday structural metrics (IETD m…

8e44c8f

…edian seconds, off-hours rate) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): BehavioralFidelityReport struct + JSON writer …

3618537

…(BTreeMap-ordered) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(eval/behavioral): Markdown + CSV writers for BehavioralFidelityR…

2846625

…eport Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

test(eval/behavioral): smoke test wires full pipeline on synthetic-vs…

7218f5b

…-synthetic data Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ci: run behavioral_fidelity smoke + noise-floor tests on PR

236d482

Adds two named integration test invocations to the eval workflow. Both run with --test-threads=4 per workspace policy. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(behavioral): user guide + README + CLAUDE.md entry

8a21ffb

docs/behavioral-fidelity.md documents the P1-P4 metrics, the R1-R10 velocity rule set, the day-granularity limitation, and CLI usage. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(fingerprint/behavioral): LineCountHistogram build / pool / media…

9eb1e3f

…n helpers + bucket grids Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski and others added 12 commits May 18, 2026 23:09

mivertowski changed the title ~~SP6 — Real-corpus text taxonomy + PII-safe placeholder grammar~~ SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar May 20, 2026

mivertowski mentioned this pull request May 20, 2026

ML experiments: scaffold 5 neuro-symbolic + inverse-SBI tracks (A100-ready) #203

Open

3 tasks

mivertowski marked this pull request as draft May 20, 2026 14:24

mivertowski and others added 5 commits May 20, 2026 16:34

docs(sp6): CHANGELOG note for the given-name PII hardening

530e86d

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski marked this pull request as ready for review May 20, 2026 16:41

mivertowski and others added 4 commits May 20, 2026 18:46

style(sp6): rustfmt the given-name gazetteer module

15d7af4

Pure formatting (RE_NAME_RUN one-liner + test array wrapping). Caught by the Format CI check. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

mivertowski merged commit cfdbb08 into main May 20, 2026
17 checks passed

mivertowski deleted the sp6-text-taxonomy branch May 20, 2026 19:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar#202

SP6 — Corpus-grounded text taxonomy + PII-safe placeholder grammar#202
mivertowski merged 216 commits into
mainfrom
sp6-text-taxonomy

mivertowski commented May 18, 2026 •

edited

Loading

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

mivertowski commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What ships

Bundle deltas

What's removed

BF metrics

Test plan

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

mivertowski commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

mivertowski commented May 18, 2026 •

edited

Loading