Skip to content

Remove redundant per-file is_source_ingested check from email generator#268

Open
KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
KRRT7:refactor/email-dedup-consolidation
Open

Remove redundant per-file is_source_ingested check from email generator#268
KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
KRRT7:refactor/email-dedup-consolidation

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented May 4, 2026

Summary

Upfront bulk dedup (commit 1):

  • Replaces N per-file is_source_ingested DB queries with a single upfront are_sources_ingested bulk query + in-memory set lookup
  • _email_generator now takes a pre-collected email_entries list and already_ingested: set[str], skipping known files before parsing (avoids wasted parse cost)
  • Single _iter_emails call feeds both the bulk pre-filter and the generator
  • Removed dead messages_skipped tracking from batch callback (always 0 since Remove redundant _filter_ingested from streaming pipeline #271)

Cleanup:

  • Removes _flush_skipped, skip tracking variables, unused IStorageProvider import
  • Net -49 lines

Stack

This PR is stacked on #267. Merge order: #271#267#268

# PR Branch Description
1 #271 fix/remove-filter-ingested Remove framework dedup
2 #267 feat/vtt-streaming-ingestion VTT streaming migration
3 #268 (this) refactor/email-dedup-consolidation Email bulk dedup

Reproducible test

First ingest (populates DB):

mkdir -p /tmp/test_eml
# create 2 .eml files (any valid RFC 5322 .eml will do)
uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v

Re-ingest (should skip all):

uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v

Expected output on re-ingest:

Pre-filter: 2 of 2 already ingested

Parsing and importing emails...

Successfully ingested 0 email(s)
Ingested 0 chunk(s)
Skipped 2 already-ingested email(s)
Extracted 0 semantic references
Total time: 0.0s

Key behavior: the bulk query fires once before the generator runs. No emails are parsed on re-ingest (0.0s total time).

Test plan

  • make format check test passes (696 tests, 0 pyright errors)
  • Manual test: re-ingest same .eml files, verify all skipped via upfront bulk dedup
  • Manual test: ingest with --verbose, verify skip reporting in summary

@KRRT7 KRRT7 force-pushed the refactor/email-dedup-consolidation branch 2 times, most recently from 6ff2f29 to 2733996 Compare May 4, 2026 07:24
@KRRT7 KRRT7 marked this pull request as ready for review May 4, 2026 07:50
KRRT7 added 5 commits May 5, 2026 02:24
The framework no longer does per-batch dedup queries inside
add_messages_streaming. Callers are responsible for filtering
duplicates before yielding messages into the stream.

- ingest_email.py already pre-filters via is_source_ingested
- ingest_vtt.py enforces a fresh DB (refuses existing)
- podcast_ingest.py uses unique source_ids by construction

This eliminates ~N unnecessary are_sources_ingested DB round-trips
(one per batch) that always returned empty sets.

Closes microsoft#269
Replace manual add_messages_with_indexing batch loop with
add_messages_streaming, matching the pattern already used by
ingest_email.py and podcast_ingest.py. This pipelines LLM
extraction with DB commits for ~2x throughput on multi-batch
ingestions.

- Add source_id per message for restartability/dedup
- Add --batch-size and --concurrency CLI arguments
- Graceful ^C handling (committed batches survive)
- Replace msg_count = [0] list trick with nonlocal (per Guido's review)
- Remove verbose prints from inside the async generator that raced with
  on_batch_committed callback output during concurrent processing
The framework no longer populates messages_skipped (removed in microsoft#271),
so the skipped counter and conditional output are dead code.
The streaming pipeline's _filter_ingested() already batch-checks
are_sources_ingested() once per batch. The per-file check was N
individual DB queries doing the same work. Removing it simplifies
the generator and consolidates dedup in one place.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants