perf: Knowledge extraction concurrency bottleneck on large ingestion jobs

I'm looking to scale ingestion to millions of messages and hitting a throughput bottleneck in `extract_knowledge_from_text_batch`.

The call site in `semrefindex.py` passes `len(text_batch)` as concurrency, so within a batch it does parallelize. But batches themselves are sequential in `_add_llm_knowledge_incremental`:

```python
for text_location_batch in batches:
    await semrefindex.add_batch_to_semantic_ref_index_from_list(...)
```

With `batch_size=50`, that's 50 concurrent extraction calls, wait for all, then next 50. A single 15-chunk message (88K chars) took 317 seconds and produced 793 semrefs — quality is great, throughput is rough.

Some initial ideas, but wanted to get the team's input on the right approach:
- Exposing concurrency as a configurable setting rather than hardcoding it
- Pipelining the embedding step with extraction — right now embedding finishes completely before extraction starts
- `max_chars_per_chunk` on `KnowledgeExtractor` is defined and has the TODO on line 27 of `convknowledge.py` but isn't read anywhere yet — without it, large messages hit the embedding model's 8K token limit and I had to add chunking on the ingestion side

Open to guidance on what would be most useful to tackle first.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: Knowledge extraction concurrency bottleneck on large ingestion jobs #250

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

perf: Knowledge extraction concurrency bottleneck on large ingestion jobs #250

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions