Skip to content

perf: Knowledge extraction concurrency bottleneck on large ingestion jobs #250

@KRRT7

Description

@KRRT7

I'm looking to scale ingestion to millions of messages and hitting a throughput bottleneck in extract_knowledge_from_text_batch.

The call site in semrefindex.py passes len(text_batch) as concurrency, so within a batch it does parallelize. But batches themselves are sequential in _add_llm_knowledge_incremental:

for text_location_batch in batches:
    await semrefindex.add_batch_to_semantic_ref_index_from_list(...)

With batch_size=50, that's 50 concurrent extraction calls, wait for all, then next 50. A single 15-chunk message (88K chars) took 317 seconds and produced 793 semrefs — quality is great, throughput is rough.

Some initial ideas, but wanted to get the team's input on the right approach:

  • Exposing concurrency as a configurable setting rather than hardcoding it
  • Pipelining the embedding step with extraction — right now embedding finishes completely before extraction starts
  • max_chars_per_chunk on KnowledgeExtractor is defined and has the TODO on line 27 of convknowledge.py but isn't read anywhere yet — without it, large messages hit the embedding model's 8K token limit and I had to add chunking on the ingestion side

Open to guidance on what would be most useful to tackle first.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions