Remove redundant per-file is_source_ingested check from email generator#268
Open
KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
Open
Remove redundant per-file is_source_ingested check from email generator#268KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
Conversation
6ff2f29 to
2733996
Compare
The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269
Replace manual add_messages_with_indexing batch loop with add_messages_streaming, matching the pattern already used by ingest_email.py and podcast_ingest.py. This pipelines LLM extraction with DB commits for ~2x throughput on multi-batch ingestions. - Add source_id per message for restartability/dedup - Add --batch-size and --concurrency CLI arguments - Graceful ^C handling (committed batches survive)
- Replace msg_count = [0] list trick with nonlocal (per Guido's review) - Remove verbose prints from inside the async generator that raced with on_batch_committed callback output during concurrent processing
The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.
The streaming pipeline's _filter_ingested() already batch-checks are_sources_ingested() once per batch. The per-file check was N individual DB queries doing the same work. Removing it simplifies the generator and consolidates dedup in one place.
2733996 to
5ec2a50
Compare
This was referenced May 5, 2026
gvanrossum
approved these changes
May 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Upfront bulk dedup (commit 1):
is_source_ingestedDB queries with a single upfrontare_sources_ingestedbulk query + in-memorysetlookup_email_generatornow takes a pre-collectedemail_entrieslist andalready_ingested: set[str], skipping known files before parsing (avoids wasted parse cost)_iter_emailscall feeds both the bulk pre-filter and the generatormessages_skippedtracking from batch callback (always 0 since Remove redundant _filter_ingested from streaming pipeline #271)Cleanup:
_flush_skipped, skip tracking variables, unusedIStorageProviderimportStack
This PR is stacked on #267. Merge order: #271 → #267 → #268
fix/remove-filter-ingestedfeat/vtt-streaming-ingestionrefactor/email-dedup-consolidationReproducible test
First ingest (populates DB):
Re-ingest (should skip all):
Expected output on re-ingest:
Key behavior: the bulk query fires once before the generator runs. No emails are parsed on re-ingest (0.0s total time).
Test plan
make format check testpasses (696 tests, 0 pyright errors)--verbose, verify skip reporting in summary