Skip to content

kfhfardin/agentic-api-optimizer-os

Β 
Β 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

42 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

agentic-api-optimizer

Intelligent LLM API Throttling, Monitoring, and Optimization

agentic-api-optimizer is a Python library that provides transparent, runtime control over LLM API usage through configurable rate limiting and comprehensive usage analytics. It automatically intercepts API calls from popular LLM libraries LangChain without requiring any code changes.

License: MIT Python 3.12+


πŸš€ Features

  • Multi-Level Throttling: Enforce rate limits at provider-level (e.g., all Google API calls) and model-level (e.g., Gemini-specific limits)
  • Token-Aware Control: Throttle based on input tokens, output tokens, or total request counts
  • Transparent Interception: Works with existing code - no modifications needed to your LLM calls
  • Comprehensive Analytics: Track all API requests, token usage, and throttling events with detailed statistics
  • Rich CLI Analytics: Beautiful terminal-based analytics with percentiles, breakdowns, and time-series data
  • Thread-Safe: Built with concurrent request handling in mind using file locks
  • Flexible Configuration: JSON/dict-based config with Pydantic validation

πŸ“¦ Installation

Using Poetry (Recommended)

# Clone the repository
git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer

# Install dependencies
poetry install

# Activate virtual environment
poetry shell

🎯 Quick Start

Basic Usage Example

from src.api_interceptors.api_interceptor_manager import InterceptorManager
from langchain_google_genai import ChatGoogleGenerativeAI
import time

# 1. Define throttling configuration
config = {
    "api_provider": {
        "langchain-google-genai": {
            "provider_throttle": {
                "limit": 5,           # Max 5 requests
                "interval": 60,       # Per 60 seconds
                "type": "requests"
            },
            "models": {
                "models/gemini-2.5-flash-lite": {
                    "model_throttle": {
                        "limit": 100,
                        "interval": 60,
                        "type": "requests"
                    },
                    "input_throttle": {
                        "limit": 10000,   # Max 10K input tokens
                        "interval": 60,   # Per minute
                        "type": "input_tokens"
                    },
                    "output_throttle": {
                        "limit": 10000,   # Max 10K output tokens
                        "interval": 60,
                        "type": "output_tokens"
                    }
                }
            }
        }
    }
}

# 2. Initialize and enable interceptor
interceptor = InterceptorManager()
interceptor.update_api_limit_config(config)
interceptor.enable()

# 3. Use your LLM normally - throttling happens automatically!
model = "models/gemini-2.5-flash-lite"
llm = ChatGoogleGenerativeAI(model=model)
messages = [
    ("system", "You are a helpful assistant."),
    ("human", "Say hello in one word."),
]

# Make multiple requests - 6th request will be throttled
for i in range(6):
    try:
        ai_msg = llm.invoke(messages)
        print(f"Request {i}: Success - {ai_msg.content[:50]}")
        time.sleep(2)
    except Exception as e:
        print(f"Request {i} failed: {e}")

# 4. Disable when done
interceptor.disable()

How It Works

Once enabled, the interceptor automatically handles rate limiting and tracks all API usage without any code changes required.


🌐 Supported Providers and Models

Currently Supported

βœ… Fully Supported API Providers

  • langchain-google-genai - Production-ready with full throttling and analytics

πŸ§ͺ Tested Models

  • models/gemini-2.5-flash-lite (Google Gemini)
  • models/gemini-2.0-flash-exp (Google Gemini)
  • Any Google GenAI model accessible via LangChain

πŸ”Œ Transport Libraries

Library Status Throttling Analytics
gRPC (sync + async) βœ… Production βœ… Yes βœ… Yes
requests ⚠️ Partial 🚧 In Progress 🚧 In Progress
httpx (sync + async) ⚠️ Partial 🚧 In Progress 🚧 In Progress
urllib3 ⚠️ Partial 🚧 In Progress 🚧 In Progress

βš™οΈ Configuration

Configuration Structure

Configure throttling at provider and model levels. Each throttle requires:

  • limit: Maximum allowed in the time window
  • interval: Time window in seconds
  • type: One of "requests", "input_tokens", or "output_tokens"

Required:

  • provider_throttle: Provider-level request throttling (required)
  • At least one model-level throttle: model_throttle, input_throttle, or output_throttle

Example Configuration:

config = {
    "api_provider": {
        "langchain-google-genai": {
            "provider_throttle": {
                "limit": 10,
                "interval": 60,
                "type": "requests"
            },
            "models": {
                "models/gemini-2.5-flash-lite": {
                    "model_throttle": {"limit": 100, "interval": 60, "type": "requests"},
                    "input_throttle": {"limit": 50000, "interval": 60, "type": "input_tokens"},
                    "output_throttle": {"limit": 10000, "interval": 60, "type": "output_tokens"}
                }
            }
        }
    }
}

Loading Configuration

# From dictionary
interceptor.update_api_limit_config(config_dict)

# From JSON file
interceptor.update_api_limit_config("path/to/config.json")

How Throttling Works

The library uses windowed rate limiting that resets after the configured interval. Limits are checked in this order:

  1. Provider request limit
  2. Model request limit
  3. Model input token limit
  4. Model output token limit

When a limit is exceeded, the request is blocked and a throttle event is logged.


πŸ“ˆ Analytics CLI Tool

The Analytics CLI provides beautiful, terminal-based analytics with comprehensive statistics including percentiles, breakdowns, and success rates.

Usage

After running your API requests, analyze the results using:

poetry run python -m src.analytics.cli \
  --requests record/langchain-google-genai/models/gemini-2.5-flash-lite.json \
  --throttled record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json \
  --start "2025-10-01 12:00:00" \
  --end "2025-12-01 12:10:00"

Output Features

The CLI displays rich, formatted tables with:

Summary Statistics:

  • Total requests vs successful vs throttled
  • Success rate percentage
  • Throttled rate percentage

Token Usage (per request):

  • Min/Max/Avg values
  • Percentiles: P50 (median), P90, P99

Time-Series Metrics:

  • Requests per minute (all, successful, throttled)
  • Statistical distributions with percentiles

Throttling Breakdown:

  • Events grouped by throttle type
  • Count and distribution for each type
  • Helps identify which limits are being hit most often

Example Output:

──────────────────────────────────────────────────────────
langchain-google-genai | models/gemini-2.5-flash-lite
──────────────────────────────────────────────────────────
Total Requests: 100
Successful Requests: 85
% of Successful Requests: 85.0%
Throttled Requests: 15
% of Throttled Requests: 15.0%

┏━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┳━━━━━━┓
┃ Metric                    ┃  Min ┃  Max ┃  Avg ┃  P50 ┃  P90 ┃  P99 ┃
┑━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━╇━━━━━━┩
β”‚ Input Tokens              β”‚  120 β”‚  450 β”‚  280 β”‚  275 β”‚  420 β”‚  445 β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”΄β”€β”€β”€β”€β”€β”€β”˜

Accessing Statistics Programmatically

All metrics are stored as JSON files and can be easily read for custom analysis:

import json

# Read request history
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite.json") as f:
    request_history = json.load(f)

# Read throttle events
with open("record/langchain-google-genai/models/gemini-2.5-flash-lite_throttle.json") as f:
    throttle_events = json.load(f)

πŸ§ͺ Testing

# Run all tests
poetry run pytest tests/ -v

# Run integration tests (requires GOOGLE_API_KEY)
export GOOGLE_API_KEY="your-api-key"
poetry run pytest integ/ -v

πŸ—ΊοΈ Roadmap

πŸ”œ Planned Features

1. Additional Provider Support

Coming Soon:

  • langchain-anthropic: Full support for Claude models (Sonnet, Opus, Haiku)
  • langchain-openai: GPT-4, GPT-4o, GPT-3.5, and other OpenAI models
  • langchain-cohere: Cohere Command models
  • langchain-bedrock: AWS Bedrock models

2. Automatic Model Switching

Automatically fallback to alternative models when throttled, supporting:

  • Same provider, different model (e.g., Gemini Flash β†’ Gemini Pro)
  • Cross-provider fallback (e.g., Gemini β†’ Claude)
  • Cost-optimized and latency-optimized cascading

3. Enhanced Analytics

  • Real-time dashboard with web UI
  • Cost estimation across providers
  • Performance metrics and export formats (Prometheus, Grafana, CSV)

4. Complete HTTP Interceptor Support

  • Full throttling for requests, httpx, urllib3
  • Streaming response and WebSocket support

5. Advanced Throttling Features

  • Dynamic rate limits based on API response headers
  • Exponential backoff with jitter
  • Circuit breaker pattern
  • Request prioritization and queue management

🀝 Contributing

Contributions are welcome!

git clone https://github.com/kfhfardin/agentic-api-optimizer.git
cd agentic-api-optimizer
poetry install
poetry run pytest

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.


πŸ‘₯ Authors

Fardin Hoque Email: kfhfardin@gmail.com GitHub: @kfhfardin LinkedIn: @kfhfardin

Dhruvil Bhatt Email: dhruvilbhattlm10@gmail.com GitHub: @Dhruvilbhatt LinkedIn: @Dhruvilbhatt


About

Agentic LLM Call Optimization and Throttling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 100.0%