Skip to content

Remove redundant _filter_ingested from streaming pipeline#271

Merged
gvanrossum merged 1 commit intomicrosoft:mainfrom
KRRT7:fix/remove-filter-ingested
May 5, 2026
Merged

Remove redundant _filter_ingested from streaming pipeline#271
gvanrossum merged 1 commit intomicrosoft:mainfrom
KRRT7:fix/remove-filter-ingested

Conversation

@KRRT7
Copy link
Copy Markdown
Contributor

@KRRT7 KRRT7 commented May 5, 2026

Summary

  • Removes _filter_ingested method and all per-batch are_sources_ingested DB queries from add_messages_streaming
  • The streaming API is now a dumb pipe — callers handle dedup in their generators
  • ingest_email.py already pre-filters via is_source_ingested before yielding
  • ingest_vtt.py enforces a fresh DB (refuses to run if it exists)
  • ingest_podcast.py uses unique source_ids by construction
  • Removes dead batch_skipped counter from ingest_email.py
  • Net -213 lines

Why: on a 10k-email re-ingest with batch_size=100, the framework was issuing ~100 are_sources_ingested queries that always returned empty sets (because the caller already filtered). Pure waste.

Stack

This is the base of a 3-PR stack. Merge order: #271#267#268

# PR Branch Description
1 #271 (this) fix/remove-filter-ingested Remove framework dedup
2 #267 feat/vtt-streaming-ingestion VTT streaming migration
3 #268 refactor/email-dedup-consolidation Email bulk dedup

Test plan

  • make check passes (pyright 0 errors)
  • pytest tests/ passes (696 tests — 5 removed that tested framework-level dedup)
  • Removed tests verified as testing only the deleted behavior (pre-marking source_ids then expecting framework to skip them)

Closes #269

The framework no longer does per-batch dedup queries inside
add_messages_streaming. Callers are responsible for filtering
duplicates before yielding messages into the stream.

- ingest_email.py already pre-filters via is_source_ingested
- ingest_vtt.py enforces a fresh DB (refuses existing)
- podcast_ingest.py uses unique source_ids by construction

This eliminates ~N unnecessary are_sources_ingested DB round-trips
(one per batch) that always returned empty sets.

Closes microsoft#269
@KRRT7 KRRT7 force-pushed the fix/remove-filter-ingested branch from 1c71230 to 3875789 Compare May 5, 2026 07:24
KRRT7 added a commit to KRRT7/typeagent-py that referenced this pull request May 5, 2026
The framework no longer populates messages_skipped (removed in microsoft#271),
so the skipped counter and conditional output are dead code.
@gvanrossum gvanrossum merged commit a058b2f into microsoft:main May 5, 2026
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

_filter_ingested runs redundant DB queries when the caller already pre-filters

2 participants