Intelligent LLM API Throttling, Monitoring, and Optimization
agentic-api-optimizer is a Python library that provides transparent, runtime control over LLM API usage through configurable rate limiting and comprehensive usage analytics. It automatically intercepts API calls from popular LLM libraries LangChain without requiring any code changes.
- Multi-Level Throttling: Enforce rate limits at provider-level (e.g., all Google API calls) and model-level (e.g., Gemini-specific limits)
- Token-Aware Control: Throttle based on input tokens, output tokens, or total request counts
- Transparent Interception: Works with existing code - no modifications needed to your LLM calls
- Comprehensive Analytics: Track all API requests, token usage, and throttling events with detailed statistics
- Rich CLI Analytics: Beautiful terminal-based analytics with percentiles, breakdowns, and time-series data
- Thread-Safe: Built with concurrent request handling in mind using file locks
- Flexible Configuration: JSON/dict-based config with Pydantic validation
# Clone the repository
git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer
# Install dependencies
poetry install
# Activate virtual environment
poetry shellfrom src.api_interceptors.api_interceptor_manager import InterceptorManager
from langchain_google_genai import ChatGoogleGenerativeAI
import time
# 1. Define throttling configuration
config = {
"api_provider": {
"langchain-google-genai": {
"provider_throttle": {
"limit": 5, # Max 5 requests
"interval": 60, # Per 60 seconds
"type": "requests"
},
"models": {
"models/gemini-2.5-flash-lite": {
"model_throttle": {
"limit": 100,
"interval": 60,
"type": "requests"
},
"input_throttle": {
"limit": 10000, # Max 10K input tokens
"interval": 60, # Per minute
"type": "input_tokens"
},
"output_throttle": {
"limit": 10000, # Max 10K output tokens
"interval": 60,
"type": "output_tokens"
}
}
}
}
}
}
# 2. Initialize and enable interceptor
interceptor = InterceptorManager()
interceptor.update_api_limit_config(config)
interceptor.enable()
# 3. Use your LLM normally - throttling happens automatically!
model = "models/gemini-2.5-flash-lite"
llm = ChatGoogleGenerativeAI(model=model)
messages = [
("system", "You are a helpful assistant."),
("human", "Say hello in one word."),
]
# Make multiple requests - 6th request will be throttled
for i in range(6):
try:
ai_msg = llm.invoke(messages)
print(f"Request {i}: Success - {ai_msg.content[:50]}")
time.sleep(2)
except Exception as e:
print(f"Request {i} failed: {e}")
# 4. Disable when done
interceptor.disable()Once enabled, the interceptor automatically handles rate limiting and tracks all API usage without any code changes required.
- langchain-google-genai - Production-ready with full throttling and analytics
models/gemini-2.5-flash-lite(Google Gemini)models/gemini-2.0-flash-exp(Google Gemini)- Any Google GenAI model accessible via LangChain
| Library | Status | Throttling | Analytics |
|---|---|---|---|
| gRPC (sync + async) | β Production | β Yes | β Yes |
| requests | π§ In Progress | π§ In Progress | |
| httpx (sync + async) | π§ In Progress | π§ In Progress | |
| urllib3 | π§ In Progress | π§ In Progress |
Configure throttling at provider and model levels. Each throttle requires:
limit: Maximum allowed in the time windowinterval: Time window in secondstype: One of"requests","input_tokens", or"output_tokens"
Required:
provider_throttle: Provider-level request throttling (required)- At least one model-level throttle:
model_throttle,input_throttle, oroutput_throttle
Example Configuration:
config = {
"api_provider": {
"langchain-google-genai": {
"provider_throttle": {
"limit": 10,
"interval": 60,
"type": "requests"
},
"models": {
"models/gemini-2.5-flash-lite": {
"model_throttle": {"limit": 100, "interval": 60, "type": "requests"},
"input_throttle": {"limit": 50000, "interval": 60, "type": "input_tokens"},
"output_throttle": {"limit": 10000, "interval": 60, "type": "output_tokens"}
}
}
}
}
}# From dictionary
interceptor.update_api_limit_config(config_dict)
# From JSON file
interceptor.update_api_limit_config("path/to/config.json")The library uses windowed rate limiting that resets after the configured interval. Limits are checked in this order:
- Provider request limit
- Model request limit
- Model input token limit
- Model output token limit
When a limit is exceeded, the request is blocked and a throttle event is logged.
The Analytics CLI provides beautiful, terminal-based analytics with comprehensive statistics including percentiles, breakdowns, and success rates.
After running your API requests, analyze the results using:
poetry run python -m src.analytics.cli \
--requests record/langchain-google-genai/models/gemini-2.5-flash-lite.json \
--throttled record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json \
--start "2025-10-01 12:00:00" \
--end "2025-12-01 12:10:00"The CLI displays rich, formatted tables with:
Summary Statistics:
- Total requests vs successful vs throttled
- Success rate percentage
- Throttled rate percentage
Token Usage (per request):
- Min/Max/Avg values
- Percentiles: P50 (median), P90, P99
Time-Series Metrics:
- Requests per minute (all, successful, throttled)
- Statistical distributions with percentiles
Throttling Breakdown:
- Events grouped by throttle type
- Count and distribution for each type
- Helps identify which limits are being hit most often
Example Output:
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
langchain-google-genai | models/gemini-2.5-flash-lite
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Total Requests: 100
Successful Requests: 85
% of Successful Requests: 85.0%
Throttled Requests: 15
% of Throttled Requests: 15.0%
βββββββββββββββββββββββββββββ³βββββββ³βββββββ³βββββββ³βββββββ³βββββββ³βββββββ
β Metric β Min β Max β Avg β P50 β P90 β P99 β
β‘ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ©
β Input Tokens β 120 β 450 β 280 β 275 β 420 β 445 β
βββββββββββββββββββββββββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ΄βββββββ
All metrics are stored as JSON files and can be easily read for custom analysis:
import json
# Read request history
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite.json") as f:
request_history = json.load(f)
# Read throttle events
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json") as f:
throttle_events = json.load(f)# Run all tests
poetry run pytest tests/ -v
# Run integration tests (requires GOOGLE_API_KEY)
export GOOGLE_API_KEY="your-api-key"
poetry run pytest integ/ -vComing Soon:
- langchain-anthropic: Full support for Claude models (Sonnet, Opus, Haiku)
- langchain-openai: GPT-4, GPT-4o, GPT-3.5, and other OpenAI models
- langchain-cohere: Cohere Command models
- langchain-bedrock: AWS Bedrock models
Automatically fallback to alternative models when throttled, supporting:
- Same provider, different model (e.g., Gemini Flash β Gemini Pro)
- Cross-provider fallback (e.g., Gemini β Claude)
- Cost-optimized and latency-optimized cascading
- Real-time dashboard with web UI
- Cost estimation across providers
- Performance metrics and export formats (Prometheus, Grafana, CSV)
- Full throttling for
requests,httpx,urllib3 - Streaming response and WebSocket support
- Dynamic rate limits based on API response headers
- Exponential backoff with jitter
- Circuit breaker pattern
- Request prioritization and queue management
Contributions are welcome!
git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer
poetry install
poetry run pytestThis project is licensed under the MIT License - see the LICENSE file for details.
Fardin Hoque Email: kfhfardin@gmail.com GitHub: @kfhfardin LinkedIn: @kfhfardin
Dhruvil Bhatt Email: dhruvilbhattlm10@gmail.com GitHub: @Dhruvilbhatt LinkedIn: @Dhruvilbhatt