add_messages_streaming calls _filter_ingested on every batch (conversation_base.py lines 318, 325, 331), which issues an are_sources_ingested query per batch. This is the right default — it catches within-run duplicates from earlier batches.
But when the caller already pre-filters (e.g. ingest_email.py does a single upfront are_sources_ingested on all source_ids and skips known files before yielding), every batch's _filter_ingested query returns an empty set. On a fresh 10k-email run with batch_size=100, that's ~100 unnecessary DB round-trips.
The VTT tool doesn't pre-filter at all, so it relies entirely on _filter_ingested — the current behavior is correct there.
A skip_source_filter: bool = False kwarg on add_messages_streaming would let callers that handle dedup themselves opt out. The default stays safe. Within-run duplicates aren't a concern for email/VTT ingestion because each source_id is a unique resolved file path — the generator never yields the same one twice.
add_messages_streamingcalls_filter_ingestedon every batch (conversation_base.py lines 318, 325, 331), which issues anare_sources_ingestedquery per batch. This is the right default — it catches within-run duplicates from earlier batches.But when the caller already pre-filters (e.g.
ingest_email.pydoes a single upfrontare_sources_ingestedon all source_ids and skips known files before yielding), every batch's_filter_ingestedquery returns an empty set. On a fresh 10k-email run withbatch_size=100, that's ~100 unnecessary DB round-trips.The VTT tool doesn't pre-filter at all, so it relies entirely on
_filter_ingested— the current behavior is correct there.A
skip_source_filter: bool = Falsekwarg onadd_messages_streamingwould let callers that handle dedup themselves opt out. The default stays safe. Within-run duplicates aren't a concern for email/VTT ingestion because each source_id is a unique resolved file path — the generator never yields the same one twice.