Skip to content

Native Litert-lm integration with local request queuing and KV cache management #5575

@mlnomadpy

Description

@mlnomadpy

Is your feature request related to a specific problem?

Yes. When running multi-agent systems locally using Litert-lm, concurrent agents compete simultaneously for limited local inference resources. Currently, the standard workaround is to stand up an external serving layer (like the lit serve method) and interact with it by reusing the standard Gemini or LiteLlm classes. This adds unnecessary overhead and removes fine-grained control over local resources.

Describe the Solution You'd Like

A native LitertModel integration that extends the BaseLLM class specifically optimized for local orchestration. The two critical requirements are:

  1. Built-in Request Queuing: A mechanism to queue requests at the model instantiation level so multiple local agents don't blast the local inference engine simultaneously and cause resource starvation.
  2. KV Cache Management: Exposed controls to manage the KV cache specifically for instruction-based agents to optimize context switching and memory usage.

Impact on your work

This is critical for building robust, offline multi-agent setups. It allows developers to run complex agent hierarchies locally without hitting OOM errors or having to maintain a separate, heavy serving infrastructure alongside ADK.

Willingness to contribute

Yes. I already have a working implementation utilizing a custom BaseLLM subclass to handle the queueing and cache management, and I would be happy to submit a PR.


🟡 Recommended Information

### Describe Alternatives You've Considered

  1. Using the lit serve method to create a local endpoint and pointing a standard ADK cloud model class at it. This works but adds unnecessary latency, architectural complexity, and abstracts away direct cache control.
  2. Relying on the standard LiteLlm wrapper, which does not inherently solve the multi-agent concurrent queuing problem for constrained local resources.

Proposed API / Implementation

from google.adk.models import BaseLLM
import queue

class LiteRTModel(BaseLLM):
    def __init__(self, model_path: str, manage_kv_cache: bool = True):
        super().__init__()
        # Internal queue to manage concurrent agent requests
        self._request_queue = queue.Queue()
        self.manage_kv_cache = manage_kv_cache
        # ... Litert-lm initialization ...
        
    # Implementation overriding standard generation to utilize the queue
    # and handle context swapping via the KV cache for different instruction agents.

Additional Context

This aligns with ADK's goal of being deployment-agnostic by making local-first, offline execution a first-class citizen for complex multi-agent workflows.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions