Migrate ingest_vtt.py to streaming ingestion API#267
Migrate ingest_vtt.py to streaming ingestion API#267gvanrossum merged 4 commits intomicrosoft:mainfrom
Conversation
a7dfb7c to
817df7a
Compare
gvanrossum
left a comment
There was a problem hiding this comment.
I ran this on tests/testdata/PythonDocumentary.vtt and it did its job in 1m30s, which is pretty good (though I have a hard time coming up with queries it can answer).
However, I noticed some odd interleaving in the verbose output; I assume due to concurrency:
$ time uv run tools/ingest_vtt.py tests/testdata/PythonDocumentary.vtt -d docu-new.db --conc 20 -v
Ingesting 1 VTT file(s):
- tests/testdata/PythonDocumentary.vtt
Target database: docu-new.db
Analyzing VTT files...
tests/testdata/PythonDocumentary.vtt:
Duration: 5037.52 seconds (83.96 minutes)
Speakers: None detected
Total duration: 5037.52 seconds (83.96 minutes)
All speakers: 0 (None detected)
Loading environment...
Setting up conversation settings...
Settings and storage provider configured
Parsing VTT files and creating messages...
auto_extract_knowledge = True
concurrency = 20
Processing messages in batches of 100 (concurrency=20)...
Processing tests/testdata/PythonDocumentary.vtt...
100 messages | +100 chunks | +501 semrefs | 12.4s (0.12s/chunk) | 12.4s elapsed
200 messages | +100 chunks | +646 semrefs | 9.2s (0.09s/chunk) | 21.6s elapsed
300 messages | +100 chunks | +620 semrefs | 15.1s (0.15s/chunk) | 36.7s elapsed
400 messages | +100 chunks | +651 semrefs | 12.0s (0.12s/chunk) | 48.7s elapsed
500 messages | +100 chunks | +631 semrefs | 10.1s (0.10s/chunk) | 58.8s elapsed
600 messages | +100 chunks | +631 semrefs | 10.3s (0.10s/chunk) | 69.1s elapsed
Extracted 769 messages so far
File time range: 0.00s to 5037.52s (with offset: 0.00s to 5037.52s)
700 messages | +100 chunks | +593 semrefs | 10.3s (0.10s/chunk) | 79.3s elapsed
769 messages | +69 chunks | +412 semrefs | 8.6s (0.12s/chunk) | 87.9s elapsed
Successfully added 769 messages
Ingested 769 chunk(s)
Extracted 4685 semantic references
Total time: 87.9s
Overall time per chunk: 0.11s/chunk
All indexes built successfully
To query the transcript, use:
python tools/query.py --database 'docu-new.db' --query 'Your question here'
real 1m29.432s
user 0m6.711s
sys 0m0.567s
$
Note how Extracted 769 messages so far and File time range: 0.00s to 5037.52s (with offset: 0.00s to 5037.52s) are followed by two more per-batch log lines.
| # Process all VTT files and collect messages | ||
| all_messages: list[TranscriptMessage] = [] | ||
| time_offset = 0.0 # Cumulative time offset for multiple files | ||
| msg_count = [0] # mutable counter shared with the generator |
There was a problem hiding this comment.
Since the introduction of 'nonlocal' this is an anti-pattern.
There was a problem hiding this comment.
taking a look, thanks!
|
Fixed both in fb29108:
|
The framework no longer does per-batch dedup queries inside add_messages_streaming. Callers are responsible for filtering duplicates before yielding messages into the stream. - ingest_email.py already pre-filters via is_source_ingested - ingest_vtt.py enforces a fresh DB (refuses existing) - podcast_ingest.py uses unique source_ids by construction This eliminates ~N unnecessary are_sources_ingested DB round-trips (one per batch) that always returned empty sets. Closes microsoft#269
Replace manual add_messages_with_indexing batch loop with add_messages_streaming, matching the pattern already used by ingest_email.py and podcast_ingest.py. This pipelines LLM extraction with DB commits for ~2x throughput on multi-batch ingestions. - Add source_id per message for restartability/dedup - Add --batch-size and --concurrency CLI arguments - Graceful ^C handling (committed batches survive)
- Replace msg_count = [0] list trick with nonlocal (per Guido's review) - Remove verbose prints from inside the async generator that raced with on_batch_committed callback output during concurrent processing
The framework no longer populates messages_skipped (removed in microsoft#271), so the skipped counter and conditional output are dead code.
fb29108 to
2ec72fd
Compare
|
Heads up — I've stacked these three PRs so they can merge cleanly in sequence:
Each PR is independently green (pyright + 696 tests), and they touch different files so conflicts won't appear if merged in order. If you want to review them independently, #271 is the architectural change — the other two are tool-level improvements that benefit from it. |
Summary
Streaming migration (commit 1):
add_messages_with_indexingbatch loop withadd_messages_streaming, matching the pattern already used byingest_email.pyandpodcast_ingest.pyTranscriptMessagedirectly from the VTT parsing loop — no intermediate list allocationsource_idper message ({vtt_file}#{index}) for restartability and dedup--batch-size(default 100) and--concurrencyCLI argumentson_batch_committedcallback reports per-batch chunks, semrefs, and timing^Chandling — committed batches survive interruptionReview feedback (commit 2):
msg_count = [0]list trick withnonlocal(per Guido's review)Dead code cleanup (commit 3):
messages_skippedtracking (always 0 since Remove redundant _filter_ingested from streaming pipeline #271 removed framework dedup)All three ingest tools now use the same streaming pattern.
Stack
This PR is stacked on #271. Merge order: #271 → #267 → #268
fix/remove-filter-ingestedfeat/vtt-streaming-ingestionrefactor/email-dedup-consolidationReproducible test (Parrot Sketch VTT, ships in repo)
^C interrupt test
Output after ^C:
DB verification: 20 messages, 188 semrefs, 20 ingested sources — exactly the 2 committed batches.
Test plan
make format check testpasses (696 tests, 0 pyright errors)