_filter_ingested runs redundant DB queries when the caller already pre-filters

`add_messages_streaming` calls `_filter_ingested` on every batch (conversation_base.py lines 318, 325, 331), which issues an `are_sources_ingested` query per batch. This is the right default — it catches within-run duplicates from earlier batches.

But when the caller already pre-filters (e.g. `ingest_email.py` does a single upfront `are_sources_ingested` on all source_ids and skips known files before yielding), every batch's `_filter_ingested` query returns an empty set. On a fresh 10k-email run with `batch_size=100`, that's ~100 unnecessary DB round-trips.

The VTT tool doesn't pre-filter at all, so it relies entirely on `_filter_ingested` — the current behavior is correct there.

A `skip_source_filter: bool = False` kwarg on `add_messages_streaming` would let callers that handle dedup themselves opt out. The default stays safe. Within-run duplicates aren't a concern for email/VTT ingestion because each source_id is a unique resolved file path — the generator never yields the same one twice.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

_filter_ingested runs redundant DB queries when the caller already pre-filters #269

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

_filter_ingested runs redundant DB queries when the caller already pre-filters #269

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions