Remove redundant per-file is_source_ingested check from email generator by KRRT7 · Pull Request #268 · microsoft/typeagent-py

KRRT7 · 2026-05-04T06:54:56Z

Summary

Upfront bulk dedup (commit 1):

Replaces N per-file is_source_ingested DB queries with a single upfront are_sources_ingested bulk query + in-memory set lookup
_email_generator now takes a pre-collected email_entries list and already_ingested: set[str], skipping known files before parsing (avoids wasted parse cost)
Single _iter_emails call feeds both the bulk pre-filter and the generator
Removed dead messages_skipped tracking from batch callback (always 0 since Remove redundant _filter_ingested from streaming pipeline #271)

Cleanup:

Removes _flush_skipped, skip tracking variables, unused IStorageProvider import
Net -49 lines

Stack

This PR is stacked on #267. Merge order: #271 → #267 → #268

#	PR	Branch	Description
1	#271	`fix/remove-filter-ingested`	Remove framework dedup
2	#267	`feat/vtt-streaming-ingestion`	VTT streaming migration
3	#268 (this)	`refactor/email-dedup-consolidation`	Email bulk dedup

Reproducible test

First ingest (populates DB):

mkdir -p /tmp/test_eml
# create 2 .eml files (any valid RFC 5322 .eml will do)
uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v

Re-ingest (should skip all):

uv run python tools/ingest_email.py /tmp/test_eml/ -d /tmp/dedup_test.db -v

Expected output on re-ingest:

Pre-filter: 2 of 2 already ingested

Parsing and importing emails...

Successfully ingested 0 email(s)
Ingested 0 chunk(s)
Skipped 2 already-ingested email(s)
Extracted 0 semantic references
Total time: 0.0s

Key behavior: the bulk query fires once before the generator runs. No emails are parsed on re-ingest (0.0s total time).

Test plan

make format check test passes (696 tests, 0 pyright errors)
Manual test: re-ingest same .eml files, verify all skipped via upfront bulk dedup
Manual test: ingest with --verbose, verify skip reporting in summary

The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269

Replace manual add_messages_with_indexing batch loop with add_messages_streaming, matching the pattern already used by ingest_email.py and podcast_ingest.py. This pipelines LLM extraction with DB commits for ~2x throughput on multi-batch ingestions. - Add source_id per message for restartability/dedup - Add --batch-size and --concurrency CLI arguments - Graceful ^C handling (committed batches survive)

- Replace msg_count = [0] list trick with nonlocal (per Guido's review) - Remove verbose prints from inside the async generator that raced with on_batch_committed callback output during concurrent processing

The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.

The streaming pipeline's _filter_ingested() already batch-checks are_sources_ingested() once per batch. The per-file check was N individual DB queries doing the same work. Removing it simplifies the generator and consolidates dedup in one place.

KRRT7 force-pushed the refactor/email-dedup-consolidation branch 2 times, most recently from 6ff2f29 to 2733996 Compare May 4, 2026 07:24

KRRT7 marked this pull request as ready for review May 4, 2026 07:50

KRRT7 added 5 commits May 5, 2026 02:24

Address review feedback: nonlocal counter, remove interleaved output

4f421d4

- Replace msg_count = [0] list trick with nonlocal (per Guido's review) - Remove verbose prints from inside the async generator that raced with on_batch_committed callback output during concurrent processing

Remove dead messages_skipped tracking

2ec72fd

The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.

KRRT7 force-pushed the refactor/email-dedup-consolidation branch from 2733996 to 5ec2a50 Compare May 5, 2026 07:39

This was referenced May 5, 2026

Remove redundant _filter_ingested from streaming pipeline #271

Merged

Migrate ingest_vtt.py to streaming ingestion API #267

Merged

gvanrossum approved these changes May 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove redundant per-file is_source_ingested check from email generator#268

Remove redundant per-file is_source_ingested check from email generator#268
KRRT7 wants to merge 5 commits intomicrosoft:mainfrom
KRRT7:refactor/email-dedup-consolidation

KRRT7 commented May 4, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

KRRT7 commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Stack

Reproducible test

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

KRRT7 commented May 4, 2026 •

edited

Loading