agentic-api-optimizer

Intelligent LLM API Throttling, Monitoring, and Optimization

agentic-api-optimizer is a Python library that provides transparent, runtime control over LLM API usage through configurable rate limiting and comprehensive usage analytics. It automatically intercepts API calls from popular LLM libraries LangChain without requiring any code changes.

🚀 Features

Multi-Level Throttling: Enforce rate limits at provider-level (e.g., all Google API calls) and model-level (e.g., Gemini-specific limits)
Token-Aware Control: Throttle based on input tokens, output tokens, or total request counts
Transparent Interception: Works with existing code - no modifications needed to your LLM calls
Comprehensive Analytics: Track all API requests, token usage, and throttling events with detailed statistics
Rich CLI Analytics: Beautiful terminal-based analytics with percentiles, breakdowns, and time-series data
Thread-Safe: Built with concurrent request handling in mind using file locks
Flexible Configuration: JSON/dict-based config with Pydantic validation

📦 Installation

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

🎯 Quick Start

Basic Usage Example

from src.api_interceptors.api_interceptor_manager import InterceptorManager
from langchain_google_genai import ChatGoogleGenerativeAI
import time

# 1. Define throttling configuration
config = {
    "api_provider": {
        "langchain-google-genai": {
            "provider_throttle": {
                "limit": 5,           # Max 5 requests
                "interval": 60,       # Per 60 seconds
                "type": "requests"
            },
            "models": {
                "models/gemini-2.5-flash-lite": {
                    "model_throttle": {
                        "limit": 100,
                        "interval": 60,
                        "type": "requests"
                    },
                    "input_throttle": {
                        "limit": 10000,   # Max 10K input tokens
                        "interval": 60,   # Per minute
                        "type": "input_tokens"
                    },
                    "output_throttle": {
                        "limit": 10000,   # Max 10K output tokens
                        "interval": 60,
                        "type": "output_tokens"
                    }
                }
            }
        }
    }
}

# 2. Initialize and enable interceptor
interceptor = InterceptorManager()
interceptor.update_api_limit_config(config)
interceptor.enable()

# 3. Use your LLM normally - throttling happens automatically!
model = "models/gemini-2.5-flash-lite"
llm = ChatGoogleGenerativeAI(model=model)
messages = [
    ("system", "You are a helpful assistant."),
    ("human", "Say hello in one word."),
]

# Make multiple requests - 6th request will be throttled
for i in range(6):
    try:
        ai_msg = llm.invoke(messages)
        print(f"Request {i}: Success - {ai_msg.content[:50]}")
        time.sleep(2)
    except Exception as e:
        print(f"Request {i} failed: {e}")

# 4. Disable when done
interceptor.disable()

How It Works

Once enabled, the interceptor automatically handles rate limiting and tracks all API usage without any code changes required.

🌐 Supported Providers and Models

Currently Supported

✅ Fully Supported API Providers

langchain-google-genai - Production-ready with full throttling and analytics

🧪 Tested Models

models/gemini-2.5-flash-lite (Google Gemini)
models/gemini-2.0-flash-exp (Google Gemini)
Any Google GenAI model accessible via LangChain

🔌 Transport Libraries

Library	Status	Throttling	Analytics
gRPC (sync + async)	✅ Production	✅ Yes	✅ Yes
requests	⚠️ Partial	🚧 In Progress	🚧 In Progress
httpx (sync + async)	⚠️ Partial	🚧 In Progress	🚧 In Progress
urllib3	⚠️ Partial	🚧 In Progress	🚧 In Progress

⚙️ Configuration

Configuration Structure

Configure throttling at provider and model levels. Each throttle requires:

limit: Maximum allowed in the time window
interval: Time window in seconds
type: One of "requests", "input_tokens", or "output_tokens"

Required:

provider_throttle: Provider-level request throttling (required)
At least one model-level throttle: model_throttle, input_throttle, or output_throttle

Example Configuration:

config = {
    "api_provider": {
        "langchain-google-genai": {
            "provider_throttle": {
                "limit": 10,
                "interval": 60,
                "type": "requests"
            },
            "models": {
                "models/gemini-2.5-flash-lite": {
                    "model_throttle": {"limit": 100, "interval": 60, "type": "requests"},
                    "input_throttle": {"limit": 50000, "interval": 60, "type": "input_tokens"},
                    "output_throttle": {"limit": 10000, "interval": 60, "type": "output_tokens"}
                }
            }
        }
    }
}

Loading Configuration

# From dictionary
interceptor.update_api_limit_config(config_dict)

# From JSON file
interceptor.update_api_limit_config("path/to/config.json")

How Throttling Works

The library uses windowed rate limiting that resets after the configured interval. Limits are checked in this order:

Provider request limit
Model request limit
Model input token limit
Model output token limit

When a limit is exceeded, the request is blocked and a throttle event is logged.

📈 Analytics CLI Tool

The Analytics CLI provides beautiful, terminal-based analytics with comprehensive statistics including percentiles, breakdowns, and success rates.

Usage

After running your API requests, analyze the results using:

poetry run python -m src.analytics.cli \
  --requests record/langchain-google-genai/models/gemini-2.5-flash-lite.json \
  --throttled record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json \
  --start "2025-10-01 12:00:00" \
  --end "2025-12-01 12:10:00"

Output Features

The CLI displays rich, formatted tables with:

Summary Statistics:

Total requests vs successful vs throttled
Success rate percentage
Throttled rate percentage

Token Usage (per request):

Min/Max/Avg values
Percentiles: P50 (median), P90, P99

Time-Series Metrics:

Requests per minute (all, successful, throttled)
Statistical distributions with percentiles

Throttling Breakdown:

Events grouped by throttle type
Count and distribution for each type
Helps identify which limits are being hit most often

Example Output:

──────────────────────────────────────────────────────────
langchain-google-genai | models/gemini-2.5-flash-lite
──────────────────────────────────────────────────────────
Total Requests: 100
Successful Requests: 85
% of Successful Requests: 85.0%
Throttled Requests: 15
% of Throttled Requests: 15.0%

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┓
┃ Metric                    ┃  Min ┃  Max ┃  Avg ┃  P50 ┃  P90 ┃  P99 ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━┩
│ Input Tokens              │  120 │  450 │  280 │  275 │  420 │  445 │
└───────────────────────────┴──────┴──────┴──────┴──────┴──────┴──────┘

Accessing Statistics Programmatically

All metrics are stored as JSON files and can be easily read for custom analysis:

import json

# Read request history
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite.json") as f:
    request_history = json.load(f)

# Read throttle events
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json") as f:
    throttle_events = json.load(f)

🧪 Testing

# Run all tests
poetry run pytest tests/ -v

# Run integration tests (requires GOOGLE_API_KEY)
export GOOGLE_API_KEY="your-api-key"
poetry run pytest integ/ -v

🗺️ Roadmap

🔜 Planned Features

1. Additional Provider Support

Coming Soon:

langchain-anthropic: Full support for Claude models (Sonnet, Opus, Haiku)
langchain-openai: GPT-4, GPT-4o, GPT-3.5, and other OpenAI models
langchain-cohere: Cohere Command models
langchain-bedrock: AWS Bedrock models

2. Automatic Model Switching

Automatically fallback to alternative models when throttled, supporting:

Same provider, different model (e.g., Gemini Flash → Gemini Pro)
Cross-provider fallback (e.g., Gemini → Claude)
Cost-optimized and latency-optimized cascading

3. Enhanced Analytics

Real-time dashboard with web UI
Cost estimation across providers
Performance metrics and export formats (Prometheus, Grafana, CSV)

4. Complete HTTP Interceptor Support

Full throttling for requests, httpx, urllib3
Streaming response and WebSocket support

5. Advanced Throttling Features

Dynamic rate limits based on API response headers
Exponential backoff with jitter
Circuit breaker pattern
Request prioritization and queue management

🤝 Contributing

Contributions are welcome!

git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer
poetry install
poetry run pytest

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

👥 Authors

Fardin Hoque Email: kfhfardin@gmail.com GitHub: @kfhfardin LinkedIn: @kfhfardin

Dhruvil Bhatt Email: dhruvilbhattlm10@gmail.com GitHub: @Dhruvilbhatt LinkedIn: @Dhruvilbhatt

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
integ		integ
src		src
tests		tests
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

License

kfhfardin/agentic-api-optimizer-os

Folders and files

Latest commit

History

Repository files navigation