Is your feature request related to a specific problem?
Yes. When running multi-agent systems locally using Litert-lm, concurrent agents compete simultaneously for limited local inference resources. Currently, the standard workaround is to stand up an external serving layer (like the lit serve method) and interact with it by reusing the standard Gemini or LiteLlm classes. This adds unnecessary overhead and removes fine-grained control over local resources.
Describe the Solution You'd Like
A native LitertModel integration that extends the BaseLLM class specifically optimized for local orchestration. The two critical requirements are:
- Built-in Request Queuing: A mechanism to queue requests at the model instantiation level so multiple local agents don't blast the local inference engine simultaneously and cause resource starvation.
- KV Cache Management: Exposed controls to manage the KV cache specifically for instruction-based agents to optimize context switching and memory usage.
Impact on your work
This is critical for building robust, offline multi-agent setups. It allows developers to run complex agent hierarchies locally without hitting OOM errors or having to maintain a separate, heavy serving infrastructure alongside ADK.
Willingness to contribute
Yes. I already have a working implementation utilizing a custom BaseLLM subclass to handle the queueing and cache management, and I would be happy to submit a PR.
🟡 Recommended Information
### Describe Alternatives You've Considered
- Using the
lit serve method to create a local endpoint and pointing a standard ADK cloud model class at it. This works but adds unnecessary latency, architectural complexity, and abstracts away direct cache control.
- Relying on the standard
LiteLlm wrapper, which does not inherently solve the multi-agent concurrent queuing problem for constrained local resources.
Proposed API / Implementation
from google.adk.models import BaseLLM
import queue
class LiteRTModel(BaseLLM):
def __init__(self, model_path: str, manage_kv_cache: bool = True):
super().__init__()
# Internal queue to manage concurrent agent requests
self._request_queue = queue.Queue()
self.manage_kv_cache = manage_kv_cache
# ... Litert-lm initialization ...
# Implementation overriding standard generation to utilize the queue
# and handle context swapping via the KV cache for different instruction agents.
Additional Context
This aligns with ADK's goal of being deployment-agnostic by making local-first, offline execution a first-class citizen for complex multi-agent workflows.
Is your feature request related to a specific problem?
Yes. When running multi-agent systems locally using
Litert-lm, concurrent agents compete simultaneously for limited local inference resources. Currently, the standard workaround is to stand up an external serving layer (like thelit servemethod) and interact with it by reusing the standardGeminiorLiteLlmclasses. This adds unnecessary overhead and removes fine-grained control over local resources.Describe the Solution You'd Like
A native
LitertModelintegration that extends theBaseLLMclass specifically optimized for local orchestration. The two critical requirements are:Impact on your work
This is critical for building robust, offline multi-agent setups. It allows developers to run complex agent hierarchies locally without hitting OOM errors or having to maintain a separate, heavy serving infrastructure alongside ADK.
Willingness to contribute
Yes. I already have a working implementation utilizing a custom
BaseLLMsubclass to handle the queueing and cache management, and I would be happy to submit a PR.🟡 Recommended Information
### Describe Alternatives You've Considered
lit servemethod to create a local endpoint and pointing a standard ADK cloud model class at it. This works but adds unnecessary latency, architectural complexity, and abstracts away direct cache control.LiteLlmwrapper, which does not inherently solve the multi-agent concurrent queuing problem for constrained local resources.Proposed API / Implementation
Additional Context
This aligns with ADK's goal of being deployment-agnostic by making local-first, offline execution a first-class citizen for complex multi-agent workflows.