Skip to content

feat(export): stream large-footprint exports and batch anonymize#162

Open
jaylann wants to merge 2 commits into
stagefrom
feat/a3-export-memory-safety
Open

feat(export): stream large-footprint exports and batch anonymize#162
jaylann wants to merge 2 commits into
stagefrom
feat/a3-export-memory-safety

Conversation

@jaylann

@jaylann jaylann commented Jun 20, 2026

Copy link
Copy Markdown
Owner

Item A3 — export/erasure memory safety at scale (MINOR, additive).

What

A large-footprint subject no longer materializes its whole export (or all anonymize PKs) in memory.

  • Exporter gains a streaming/iterator path that yields ExportRecords incrementally; the existing materialized export output is unchanged.
  • effaced-s3 export_collection streams object bodies (respecting max_object_bytes) instead of accumulating every body in RAM.
  • erasure_executor._anonymize fetches and updates in bounded batches instead of all-PKs-then-row-by-row.

Erasure/export semantics

No output change for any input. The materialized export bundle and the anonymized rows are byte-identical to before — this is a memory/throughput change only. MINOR. The streaming path is proven equivalent to the materialized path (test_exporter_streaming.py), and the batched anonymize reuses the bleed/idempotency harness.

Checks

just check + just test green locally (1014 passed).

Add memory-bounded, byte-identical streaming companions alongside the
existing materializing paths so a large-footprint subject no longer OOMs:

- Exporter.iter_subject_records yields ExportRecords table by table off
  the cursor (yield_per) instead of building the whole ExportBundle tuple;
  export_subject is unchanged and now delegates through the same lazy core.
- effaced-s3 iter_object_records streams one object body at a time
  (respecting max_object_bytes); collect_object_records drains it.
- ErasureExecutor anonymize fetches PKs in bounded ordered+offset pages
  rather than all at once, with per-row surrogates (ADR 0007) intact.

No export or erasure output changes for any input: the streamed records
equal the materialized bundle (same set and order) and the anonymized
rows are identical; same EXPORT_REQUESTED/EXPORT_COMPLETED trail. Memory
and throughput only.

Signed-off-by: Justin Lanfermann <Justin@Lanfermann.dev>
@jaylann jaylann added type:feat New capability area:export Art. 15 export engine area:adapters ORM/storage adapters area:s3 effaced-s3 package labels Jun 20, 2026

@jaylann jaylann left a comment

Copy link
Copy Markdown
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BLOCKER (would be REQUEST_CHANGES — GitHub forbids it on one's own PR): One blocker: the batched-anonymize paging silently changes what gets erased when an ANONYMIZE column overlaps the primary key (a representable manifest), which makes the "byte-identical for any input / MINOR" claim untrue for that case — see the inline comment on erasure_executor.py. The streaming Exporter and S3 iter_object_records paths look output-equivalent and well-tested. One non-blocking docstring nit on the resolver memory bound.

Comment thread packages/effaced/src/effaced/adapters/sqlalchemy/erasure_executor.py Outdated
Comment thread packages/effaced/src/effaced/export/exporter.py
The batched ANONYMIZE paged matched PKs with select().order_by(pk).offset(done):
safe only while the ordering key is stable. But a PK column is a legal ANONYMIZE
target (_table_steps emits it), and a String/Uuid PK draws a fresh unique
surrogate, mutating the ordering key mid-walk so an OFFSET window skips rows and
PII survives erasure (reviewer-caught HIGH on #162).

Page by a keyset cursor (where(pk > last)) when the PK is a single,
non-anonymized column; otherwise capture every matching key in one up-front
select and rewrite row by row (the prior all-keys-first behaviour, scoped to
composite or self-anonymized PKs). Erased output stays byte-identical to the
materializing path; this remains a memory/throughput change (MINOR).

Adds test_anonymizing_a_string_primary_key_skips_no_rows_across_batches; aligns
the exporter docstring (external memory bound), PROOFS, and the CLAUDE.md A3
learning.

Signed-off-by: Justin Lanfermann <Justin@Lanfermann.dev>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:adapters ORM/storage adapters area:export Art. 15 export engine area:s3 effaced-s3 package type:feat New capability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant