# LLM Cost Optimization Demo

Demos aligned with Part 1B Section 3: Cost Optimization Beyond Caching

1. **Routing Economics** - Calculate savings from intelligent model routing
2. **LiteLLM** - The LLM Operations Layer (vendor-agnostic gateway)
3. **Semantic Router** - Intent-based routing with local embeddings
4. **Local Router Demo** - Ollama-based routing with keyword/length heuristics
5. **Semantic Caching** - GPTCache for meaning-based caching

## Key Insight

~70% of production traffic is simple enough for the smallest capable model.
Route intelligently → 50-85% cost reduction.

## Architecture ( Short Version)

```
Incoming Query → Semantic Router (local, ~5ms) → Route Config → LiteLLM Gateway → Providers
```

## Setup

```bash
pip install litellm semantic-router sentence-transformers gptcache torch
```

**Ollama:** `ollama pull qwen3:4b && ollama pull llama3.2:1b`


In [20]:
# Setup: Logging, environment checks, and Ollama client
import subprocess
import logging
import requests
import re
import time
from typing import List, Dict, Any, Tuple
from dataclasses import dataclass

# Color-coded logging
class ColoredFormatter(logging.Formatter):
    COLORS = {'DEBUG': '\033[90m', 'INFO': '\033[92m', 'WARNING': '\033[93m', 'ERROR': '\033[91m', 'RESET': '\033[0m'}
    def format(self, record):
        color = self.COLORS.get(record.levelname, self.COLORS['RESET'])
        record.msg = f"{color}[{record.levelname}]{self.COLORS['RESET']} {record.msg}"
        return super().format(record)

logger = logging.getLogger("cost_optimization_demo")
logger.setLevel(logging.DEBUG)
if not logger.handlers:
    handler = logging.StreamHandler()
    handler.setFormatter(ColoredFormatter('%(message)s'))
    logger.addHandler(handler)

# Check Ollama and available models
def check_ollama():
    try:
        result = subprocess.run(["ollama", "list"], capture_output=True, text=True, timeout=5)
        if result.returncode == 0:
            logger.info("Ollama is running")
            print(f"\nAvailable models:\n{result.stdout}")
            return True
    except Exception as e:
        logger.error(f"Ollama check failed: {e}")
    return False

ollama_ready = check_ollama()
OLLAMA_URL = "http://localhost:11434"

# Model configuration for routing demo
STRONG_MODEL = "qwen3:4b"    # More capable, slower
WEAK_MODEL = "llama3.2:1b"   # Faster, cheaper, less capable

if ollama_ready:
    logger.info(f"Strong model: {STRONG_MODEL}")
    logger.info(f"Weak model: {WEAK_MODEL}")

# Ollama helper functions
def ollama_generate(prompt: str, model: str, temperature: float = 0.7) -> Tuple[str, float]:
    """Generate response from Ollama, return (response, latency_ms)."""
    start = time.perf_counter()
    response = requests.post(
        f"{OLLAMA_URL}/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False,
            "options": {"temperature": temperature}
        },
        timeout=120
    )
    latency_ms = (time.perf_counter() - start) * 1000
    return response.json().get("response", ""), latency_ms

def clean_response(text: str) -> str:
    """Remove thinking tags from qwen3 responses."""
    return re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL).strip()

logger.info("Helper functions ready")


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
[92m[INFO][0m Ollama is running
2026-01-03 03:07:45,875 - 8495705152 - 1463702200.py-1463702200:30 - INFO: [92m[INFO][0m Ollama is running
[92m[INFO][0m Strong model: qwen3:4b
2026-01-03 03:07:45,876 - 8495705152 - 1463702200.py-1463702200:45 - INFO: [92m[INFO][0m Strong model: qwen3:4b
[92m[INFO][0m Weak model: llama3.2:1b
2026-01-03 03:07:45,877 - 8495705152 - 1463702200.py-1463702200:46 - INFO: [92m[INFO][0m Weak model: llama3.2:1b
[92m[INFO][0m Helper functions ready
2026-01-03 03:07:45,877 - 8495705152 - 1463702200.py-1463702200:69 - INFO: [92m[INFO][0m Helper functions ready



Available models:
NAME           ID              SIZE      MODIFIED   
qwen3:4b       359d7dd4bcda    2.5 GB    2 days ago    
llama3.2:1b    baf6a787fdff    1.3 GB    2 days ago    



---

# Demo 1: Routing Economics

Calculate potential savings from intelligent model routing vs. using a single model for all requests.

**Key insight:** Most production traffic (~70%) is simple enough for small models.


In [21]:
# Routing Economics: Cost Calculator

print("="*65)
print("ROUTING ECONOMICS: Cost Savings Calculator")
print("="*65)

def calculate_routing_savings(
    daily_requests: int,
    complexity_distribution: dict,  # {"simple": 0.7, "standard": 0.2, "complex": 0.1}
    model_costs: dict,              # {"simple": 0.0001, "standard": 0.001, "complex": 0.01}
    frontier_cost: float = 0.01,    # Cost per 1K tokens if using frontier for everything
    tokens_per_request: int = 2000
) -> dict:
    """
    Calculate savings from intelligent routing vs. using frontier model for all.
    
    The key insight: ~70% of production traffic is simple enough for
    the smallest capable model.
    """
    # Cost without routing (frontier for everything)
    daily_tokens = daily_requests * tokens_per_request
    daily_frontier_cost = (daily_tokens / 1000) * frontier_cost
    
    # Cost with routing
    daily_routed_cost = 0
    for complexity, fraction in complexity_distribution.items():
        tier_requests = daily_requests * fraction
        tier_tokens = tier_requests * tokens_per_request
        tier_cost = (tier_tokens / 1000) * model_costs[complexity]
        daily_routed_cost += tier_cost
    
    daily_savings = daily_frontier_cost - daily_routed_cost
    
    return {
        'daily_frontier_cost': round(daily_frontier_cost, 2),
        'daily_routed_cost': round(daily_routed_cost, 2),
        'daily_savings': round(daily_savings, 2),
        'monthly_savings': round(daily_savings * 30, 2),
        'savings_percent': round((daily_savings / daily_frontier_cost) * 100, 1)
    }

# Scenario 1: Customer support system with 100K daily queries
print("\n[Scenario 1: Customer Support - 100K requests/day]")
print("─"*65)

support_routing = calculate_routing_savings(
    daily_requests=100000,
    complexity_distribution={
        "simple": 0.70,   # FAQ, status checks, simple questions
        "standard": 0.20, # Explanations, multi-step answers
        "complex": 0.10   # Analysis, debugging, complaints
    },
    model_costs={
        "simple": 0.00015,   # GPT-4o-mini / Llama 8B
        "standard": 0.003,   # Claude Sonnet / GPT-4o
        "complex": 0.015     # Claude Opus
    },
    frontier_cost=0.015,
    tokens_per_request=2000
)

print(f"  Daily requests:            {100000:>12,}")
print(f"  Daily cost (no routing):   €{support_routing['daily_frontier_cost']:>11,.2f}")
print(f"  Daily cost (with routing): €{support_routing['daily_routed_cost']:>11,.2f}")
print(f"  Daily savings:             €{support_routing['daily_savings']:>11,.2f}")
print(f"  Monthly savings:           €{support_routing['monthly_savings']:>11,.2f}")
print(f"  Savings:                   {support_routing['savings_percent']:>12}%")

# Scenario 2: RAG Q&A system
print("\n[Scenario 2: RAG Q&A - 50K requests/day]")
print("─"*65)

rag_routing = calculate_routing_savings(
    daily_requests=50000,
    complexity_distribution={
        "simple": 0.60,   # Direct answers from context
        "standard": 0.30, # Synthesis across documents
        "complex": 0.10   # Complex reasoning
    },
    model_costs={
        "simple": 0.00015,
        "standard": 0.003,
        "complex": 0.015
    },
    frontier_cost=0.015,
    tokens_per_request=3000  # RAG uses more context
)

print(f"  Daily requests:            {50000:>12,}")
print(f"  Daily cost (no routing):   €{rag_routing['daily_frontier_cost']:>11,.2f}")
print(f"  Daily cost (with routing): €{rag_routing['daily_routed_cost']:>11,.2f}")
print(f"  Monthly savings:           €{rag_routing['monthly_savings']:>11,.2f}")
print(f"  Savings:                   {rag_routing['savings_percent']:>12}%")


ROUTING ECONOMICS: Cost Savings Calculator

[Scenario 1: Customer Support - 100K requests/day]
─────────────────────────────────────────────────────────────────
  Daily requests:                 100,000
  Daily cost (no routing):   €   3,000.00
  Daily cost (with routing): €     441.00
  Daily savings:             €   2,559.00
  Monthly savings:           €  76,770.00
  Savings:                           85.3%

[Scenario 2: RAG Q&A - 50K requests/day]
─────────────────────────────────────────────────────────────────
  Daily requests:                  50,000
  Daily cost (no routing):   €   2,250.00
  Daily cost (with routing): €     373.50
  Monthly savings:           €  56,295.00
  Savings:                           83.4%


---

# Demo 2: LiteLLM - The LLM Operations Layer

LiteLLM is a **vendor-agnostic gateway** unifying 100+ providers through a single OpenAI-compatible API.

| Capability | What It Does |
|------------|--------------|
| Unified API | Same code for OpenAI, Anthropic, Ollama, vLLM, etc. |
| Fallbacks | Auto-retry with different providers |
| Caching | Redis/Qdrant semantic caching built-in |
| PII Masking | Presidio integration for GDPR |
| Cost Tracking | Per-key budgets and alerts |
| Observability | Langfuse, Datadog, Prometheus |

**Key insight:** LiteLLM handles infrastructure; Semantic Router handles routing logic.


In [22]:
# LiteLLM: The LLM Operations Layer

print("="*65)
print("LITELLM: Vendor-Agnostic LLM Gateway")
print("="*65)

litellm_available = False

try:
    import litellm
    litellm_available = True
    print(f"\n✓ LiteLLM installed")
except ImportError:
    print("\n⚠ LiteLLM not installed")
    print("  pip install litellm")

print("""
ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │   Your App   │     │   Your App   │     │   Your App   │
  │  (Service A) │     │  (Service B) │     │  (Service C) │
  └──────┬───────┘     └──────┬───────┘     └──────┬───────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │  LiteLLM Proxy   │◄─── Virtual Keys
                    │  (Port 4000)     │◄─── Routing Config
                    └────────┬─────────┘◄─── Budget Rules
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
         ┌────────┐    ┌────────┐    ┌────────┐
         │ Ollama │    │ Claude │    │ GPT-4  │
         │ (local)│    │ (API)  │    │ (API)  │
         └────────┘    └────────┘    └────────┘

CORE CAPABILITIES (all open source):
─────────────────────────────────────────────────────────────────

  1. UNIFIED API: Same code for 100+ providers
     
     from litellm import completion
     
     # Just change the model string
     completion(model="gpt-4o", messages=[...])
     completion(model="claude-3-sonnet", messages=[...])
     completion(model="ollama/llama3.2", messages=[...])

  2. FALLBACKS: Auto-retry with different providers
     
     model_list = [
         {"model_name": "gpt-4", "litellm_params": {"model": "gpt-4"}},
         {"model_name": "gpt-4", "litellm_params": {"model": "azure/gpt-4"}},
         {"model_name": "gpt-4", "litellm_params": {"model": "ollama/llama3"}},
     ]

  3. CACHING: Redis or Qdrant semantic caching
  4. PII MASKING: Presidio integration (GDPR)
  5. BUDGETS: Per-key cost limits and alerts
  6. OBSERVABILITY: Langfuse, Datadog, Prometheus

WHEN TO USE:
  ✓ Multiple providers (cloud + local)
  ✓ Need fallbacks for reliability
  ✓ Cost tracking across teams
  ✓ Self-hosted (data sovereignty)
  ✗ Single provider, prototype only
""")


LITELLM: Vendor-Agnostic LLM Gateway

✓ LiteLLM installed

ARCHITECTURE:
─────────────────────────────────────────────────────────────────

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │   Your App   │     │   Your App   │     │   Your App   │
  │  (Service A) │     │  (Service B) │     │  (Service C) │
  └──────┬───────┘     └──────┬───────┘     └──────┬───────┘
         │                    │                    │
         └────────────────────┼────────────────────┘
                              │
                              ▼
                    ┌──────────────────┐
                    │  LiteLLM Proxy   │◄─── Virtual Keys
                    │  (Port 4000)     │◄─── Routing Config
                    └────────┬─────────┘◄─── Budget Rules
                             │
              ┌──────────────┼──────────────┐
              ▼              ▼              ▼
         ┌────────┐    ┌────────┐    ┌────────┐
         │ Ollama │    │ Claude │    │ GPT-4  │
         │

In [23]:
# LiteLLM: Functional Demos with Ollama

print("="*65)
print("LITELLM: Functional Demos with Ollama")
print("="*65)

if not litellm_available:
    print("\n⚠ LiteLLM not installed: pip install litellm")
elif not ollama_ready:
    print("\n⚠ Ollama not running")
else:
    import litellm
    import time
    
    # Suppress verbose logging
    litellm.set_verbose = False
    
    # ─────────────────────────────────────────────────────────────────
    # Demo 1: Basic Completion
    # ─────────────────────────────────────────────────────────────────
    print("\n[Demo 1: Basic Completion]")
    print("─"*65)
    
    messages = [{"role": "user", "content": "What is 2+2? Answer in one word."}]
    
    try:
        start = time.perf_counter()
        response = litellm.completion(
            model="ollama/llama3.2:1b",
            messages=messages,
            api_base="http://localhost:11434"
        )
        latency = (time.perf_counter() - start) * 1000
        
        print(f"  ✓ Model: {response.model}")
        print(f"  ✓ Response: {response.choices[0].message.content.strip()[:80]}")
        print(f"  ✓ Latency: {latency:.0f}ms")
    except Exception as e:
        print(f"  ✗ Error: {e}")
    
    # ─────────────────────────────────────────────────────────────────
    # Demo 2: Model Comparison (Same prompt, different models)
    # ─────────────────────────────────────────────────────────────────
    print("\n[Demo 2: Model Comparison]")
    print("─"*65)
    
    models_to_test = [
        f"ollama/{WEAK_MODEL}",
        f"ollama/{STRONG_MODEL}",
    ]
    
    test_prompt = [{"role": "user", "content": "Explain recursion in one sentence."}]
    
    for model in models_to_test:
        try:
            start = time.perf_counter()
            resp = litellm.completion(
                model=model,
                messages=test_prompt,
                api_base="http://localhost:11434"
            )
            latency = (time.perf_counter() - start) * 1000
            
            answer = resp.choices[0].message.content.strip()
            # Clean qwen thinking tags
            answer = re.sub(r'<think>.*?</think>', '', answer, flags=re.DOTALL).strip()
            
            print(f"\n  {model}:")
            print(f"    Latency: {latency:.0f}ms")
            print(f"    Response: {answer[:100]}{'...' if len(answer) > 100 else ''}")
        except Exception as e:
            print(f"\n  {model}: Error - {e}")
    
    # ─────────────────────────────────────────────────────────────────
    # Demo 3: Fallback Chain (try models in order)
    # ─────────────────────────────────────────────────────────────────
    print("\n\n[Demo 3: Fallback Chain]")
    print("─"*65)
    
    fallback_models = [
        f"ollama/nonexistent-model",  # Will fail
        f"ollama/{WEAK_MODEL}",       # Fallback
    ]
    
    def completion_with_fallback(messages, models):
        """Try models in order until one works."""
        for model in models:
            try:
                resp = litellm.completion(
                    model=model,
                    messages=messages,
                    api_base="http://localhost:11434",
                    timeout=10
                )
                return model, resp
            except Exception as e:
                print(f"  ✗ {model}: {str(e)[:50]}...")
                continue
        return None, None
    
    print("  Trying models in order:")
    model_used, resp = completion_with_fallback(
        [{"role": "user", "content": "Hello!"}],
        fallback_models
    )
    
    if resp:
        print(f"\n  ✓ Success with: {model_used}")
        print(f"  ✓ Response: {resp.choices[0].message.content.strip()[:60]}")
    else:
        print("\n  ✗ All models failed")
    
    # ─────────────────────────────────────────────────────────────────
    # Summary
    # ─────────────────────────────────────────────────────────────────
    print("\n" + "="*65)
    print("LITELLM KEY TAKEAWAYS")
    print("="*65)
    print("""
  ✓ Unified API: model="ollama/llama3.2:1b" or model="gpt-4o"
  ✓ Fallbacks: Try models in order until one works
  ✓ Same code works for 100+ providers
  ✓ Production: Run as proxy server with config.yaml
    """)


LITELLM: Functional Demos with Ollama

[Demo 1: Basic Completion]
─────────────────────────────────────────────────────────────────
  ✓ Model: ollama/llama3.2:1b
  ✓ Response: 4.
  ✓ Latency: 684ms

[Demo 2: Model Comparison]
─────────────────────────────────────────────────────────────────

  ollama/llama3.2:1b:
    Latency: 425ms
    Response: Recursion is a programming technique where a function calls itself repeatedly until it reaches a bas...

  ollama/qwen3:4b:
    Latency: 53706ms
    Response: Recursion is a programming technique where a function calls itself to solve a problem by breaking it...


[Demo 3: Fallback Chain]
─────────────────────────────────────────────────────────────────
  Trying models in order:

[1;31mGive Feedback / Get Help: https://github.com/BerriAI/litellm/issues/new[0m
LiteLLM.Info: If you need to debug this error, use `litellm._turn_on_debug()'.

  ✗ ollama/nonexistent-model: litellm.APIConnectionError: OllamaException - {"er...

  ✓ Success with: oll

---

# Demo 3: Semantic Router - Intent-Based Routing

**Semantic Router** uses embeddings to match queries against predefined route examples.

| Feature | Semantic Router | RouteLLM |
|---------|-----------------|----------|
| Routing logic | Defined (examples) | Learned (preferences) |
| Training needed | No | Yes |
| Explainability | High | Low |
| Local execution | ✓ (sentence-transformers) | ✗ (needs OpenAI) |
| Output | Intent/Category | Model selection |

**Key advantage:** Runs locally with ~5ms latency. No API calls for routing decisions.


In [24]:
# Semantic Router: Setup with Local Embeddings

print("="*65)
print("SEMANTIC ROUTER: Intent-Based Classification")
print("="*65)

semantic_router_available = False
sr_router = None

try:
    from semantic_router import Route
    from semantic_router.routers import SemanticRouter
    from semantic_router.encoders import HuggingFaceEncoder
    
    print("\nLoading local encoder (HuggingFace)...")
    encoder = HuggingFaceEncoder()  # Uses local model, no API needed
    
    # Define routes with example utterances (from md)
    billing_route = Route(
        name="billing",
        utterances=[
            "What's my current balance?",
            "I want to pay my bill",
            "Can you explain this charge?",
            "Update my payment method",
            "When is my payment due?",
            "Show me my invoice",
            "I was overcharged",
        ]
    )
    
    technical_route = Route(
        name="technical",
        utterances=[
            "The app keeps crashing",
            "I can't log in to my account",
            "How do I reset my password?",
            "The feature isn't working",
            "I'm getting an error message",
            "My data isn't syncing",
            "The page won't load",
        ]
    )
    
    sales_route = Route(
        name="sales",
        utterances=[
            "What plans do you offer?",
            "I want to upgrade my subscription",
            "Tell me about enterprise pricing",
            "Compare your plans",
            "What's included in premium?",
            "Can I get a demo?",
            "Pricing for teams?",
        ]
    )
    
    # Create router with auto_sync to build index immediately
    sr_router = SemanticRouter(
        encoder=encoder,
        routes=[billing_route, technical_route, sales_route],
        auto_sync="local"  # Automatically sync routes to local index
    )
    
    semantic_router_available = True
    print("✓ Semantic Router ready")
    print(f"  Routes: billing, technical, sales")
    print(f"  Encoder: HuggingFace (local)")
    
except ImportError as e:
    print(f"\n⚠ Semantic Router not installed: {e}")
    print("\nTo install:")
    print("  pip install semantic-router")
    
except Exception as e:
    print(f"\n⚠ Failed to initialize: {e}")


SEMANTIC ROUTER: Intent-Based Classification

Loading local encoder (HuggingFace)...




✓ Semantic Router ready
  Routes: billing, technical, sales
  Encoder: HuggingFace (local)


In [25]:
# Semantic Router: Classification Demo

print("="*65)
print("SEMANTIC ROUTER: Intent Classification Demo")
print("="*65)

if not semantic_router_available:
    print("\n⚠ Semantic Router not available")
else:
    # Test queries - mix of routes and unmatched
    sr_test_queries = [
        "I need to pay my monthly fee",
        "The app crashes every time I open it",
        "What are your enterprise pricing options?",
        "My password reset email never arrived",
        "Can you explain this $50 charge?",
        "I'd like to upgrade to the premium plan",
        "Hello, how are you?",  # Should not match any route
    ]
    
    # Route-to-model mapping (as shown in md)
    route_to_model = {
        "billing": ("llama-8b", "€0.05/M tokens"),      # Simple lookups
        "technical": ("claude-sonnet", "€3/M tokens"),   # Debugging
        "sales": ("gpt-4o", "€15/M tokens"),             # Persuasive
        None: ("llama-8b", "€0.05/M tokens")             # Fallback
    }
    
    print("\nClassification Results:")
    print("─"*65)
    
    for query in sr_test_queries:
        result = sr_router(query)
        route_name = result.name if result else None
        model, cost = route_to_model.get(route_name, route_to_model[None])
        
        print(f"\n  [{route_name or 'NONE':10}] → {model} ({cost})")
        print(f"  Query: {query[:55]}{'...' if len(query) > 55 else ''}")
    
    print(f"\n{'═'*65}")
    print("ROUTE-TO-ACTION MAPPING (from md):")
    print(f"{'═'*65}")
    print("""
  Route        Model           Prompt          Tools
  ───────────────────────────────────────────────────────────
  billing      llama-8b        billing.txt     [balance, pay]
  technical    claude-sonnet   support.txt     [kb, ticket]
  sales        gpt-4o          sales.txt       [pricing, demo]
  default      llama-8b        general.txt     []
    """)


SEMANTIC ROUTER: Intent Classification Demo

Classification Results:
─────────────────────────────────────────────────────────────────

  [billing   ] → llama-8b (€0.05/M tokens)
  Query: I need to pay my monthly fee

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: The app crashes every time I open it

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: What are your enterprise pricing options?

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: My password reset email never arrived

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: Can you explain this $50 charge?

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: I'd like to upgrade to the premium plan

  [NONE      ] → llama-8b (€0.05/M tokens)
  Query: Hello, how are you?

═════════════════════════════════════════════════════════════════
ROUTE-TO-ACTION MAPPING (from md):
═════════════════════════════════════════════════════════════════

  Route        Model           Prompt          Tools
  ───────────────────────────────

---

# Demo 4: Local Router with Ollama

Demonstrate the full routing flow locally:
1. Classify query complexity (rule-based)
2. Route to appropriate model (`qwen3:4b` or `llama3.2:1b`)
3. Generate response

This simulates what happens in production with LiteLLM + cloud models.


In [26]:
# Local Router: Rule-Based Classifier + Ollama

print("="*65)
print("LOCAL ROUTER: Rule-Based Classification")
print("="*65)

@dataclass
class RoutingDecision:
    """Result of routing decision."""
    query: str
    complexity: str      # simple, standard, complex
    route_to: str        # weak or strong
    confidence: float
    reasoning: str

def classify_complexity(query: str) -> RoutingDecision:
    """Rule-based complexity classifier (simplified for demo)."""
    query_lower = query.lower()
    word_count = len(query.split())
    
    # Complexity indicators
    complex_keywords = ["analyze", "compare", "evaluate", "trade-offs", "design", "architect", "debug"]
    simple_keywords = ["hello", "hi", "thanks", "yes", "no", "what time", "define"]
    
    has_complex = any(kw in query_lower for kw in complex_keywords)
    has_simple = any(kw in query_lower for kw in simple_keywords)
    
    if has_complex or word_count > 30:
        return RoutingDecision(query, "complex", "strong", 0.85, "Complex reasoning required")
    elif has_simple or word_count < 5:
        return RoutingDecision(query, "simple", "weak", 0.90, "Simple query")
    else:
        return RoutingDecision(query, "standard", "strong", 0.70, "Standard task")

# Test the classifier
test_queries = [
    "Hello!",
    "What is Python?",
    "Analyze the trade-offs between microservices and monolithic architecture",
]

print("\nClassification Results:")
print("─"*65)
for query in test_queries:
    decision = classify_complexity(query)
    symbol = "→" if decision.route_to == "strong" else "⇢"
    print(f"\n  [{decision.complexity.upper():8}] {symbol} {decision.route_to}")
    print(f"  Query: {query[:55]}{'...' if len(query) > 55 else ''}")


LOCAL ROUTER: Rule-Based Classification

Classification Results:
─────────────────────────────────────────────────────────────────

  [SIMPLE  ] ⇢ weak
  Query: Hello!

  [SIMPLE  ] ⇢ weak
  Query: What is Python?

  [COMPLEX ] → strong
  Query: Analyze the trade-offs between microservices and monoli...


In [27]:
# Local Router: Live Demo with Ollama

print("="*65)
print("LOCAL ROUTER: Live Demo with Ollama")
print("="*65)

if not ollama_ready:
    print("\n⚠ Ollama not running - skipping live demo")
else:
    def route_and_generate(query: str) -> Dict[str, Any]:
        """Route query to appropriate model and generate response."""
        decision = classify_complexity(query)
        model = STRONG_MODEL if decision.route_to == "strong" else WEAK_MODEL
        
        response, latency_ms = ollama_generate(query, model, temperature=0.7)
        response = clean_response(response)
        
        return {
            "query": query,
            "complexity": decision.complexity,
            "model_used": model,
            "response": response[:150] + "..." if len(response) > 150 else response,
            "latency_ms": latency_ms
        }
    
    # Test queries
    routing_tests = [
        "Hi there!",
        "Analyze the trade-offs between SQL and NoSQL databases"
    ]
    
    results = []
    for query in routing_tests:
        print(f"\n{'─'*65}")
        print(f"Query: {query}")
        result = route_and_generate(query)
        results.append(result)
        print(f"  Complexity: {result['complexity'].upper()}")
        print(f"  Routed to: {result['model_used']}")
        print(f"  Latency: {result['latency_ms']:.0f}ms")
        print(f"  Response: {result['response']}")
    
    # Summary
    print(f"\n{'═'*65}")
    print("ROUTING SUMMARY")
    print(f"{'═'*65}")
    weak_count = sum(1 for r in results if r['model_used'] == WEAK_MODEL)
    print(f"  Routed to weak: {weak_count}/{len(results)}")
    print(f"  Cost reduction: ~{weak_count/len(results)*100:.0f}% of queries to cheaper model")


LOCAL ROUTER: Live Demo with Ollama

─────────────────────────────────────────────────────────────────
Query: Hi there!
  Complexity: SIMPLE
  Routed to: llama3.2:1b
  Latency: 225ms
  Response: Hello! How can I help you today?

─────────────────────────────────────────────────────────────────
Query: Analyze the trade-offs between SQL and NoSQL databases
  Complexity: COMPLEX
  Routed to: qwen3:4b
  Latency: 57707ms
  Response: Here's a clear, balanced analysis of the **core trade-offs between SQL and NoSQL databases**, designed to help you make practical decisions—not just t...

═════════════════════════════════════════════════════════════════
ROUTING SUMMARY
═════════════════════════════════════════════════════════════════
  Routed to weak: 1/2
  Cost reduction: ~50% of queries to cheaper model


In [28]:
# Final Summary: Cost Optimization Methods

print("="*70)
print("COST OPTIMIZATION: Complete Summary")
print("="*70)
print("""
┌─────────────────────┬──────────────────┬──────────────────┬─────────────┐
│ Method              │ Approach         │ Requirements     │ Savings     │
├─────────────────────┼──────────────────┼──────────────────┼─────────────┤
│ LiteLLM Gateway     │ Unified API      │ litellm          │ Infra layer │
│ Semantic Router     │ Intent matching  │ semantic-router  │ 50-85%      │
│ Rule-Based Router   │ Keyword/length   │ None             │ 40-60%      │
│ GPTCache            │ Semantic match   │ gptcache         │ 20-40%      │
└─────────────────────┴──────────────────┴──────────────────┴─────────────┘

PRODUCTION ARCHITECTURE (from md):
─────────────────────────────────────────────────────────────────────────

    Incoming Query
          │
          ▼
    ┌────────────────────────┐
    │   Semantic Router      │  ← Intent classification (local, ~5ms)
    │   (local embeddings)   │
    └───────────┬────────────┘
                │
    ┌───────────┴───────────────┐
    │                           │
    ▼                           ▼
  billing → small model    complex → frontier model
  technical → mid model    sales → persuasive model
                │
                ▼
    ┌────────────────────────┐
    │   LiteLLM Gateway      │  ← Fallbacks, caching, cost tracking
    └───────────┬────────────┘
                │
    ┌───────────┼───────────────┐
    ▼           ▼               ▼
  Ollama     Claude          GPT-4o

DEMO STATUS:
─────────────────────────────────────────────────────────────────────────
""")

# Status checks
status = [
    ("LiteLLM", litellm_available if 'litellm_available' in dir() else False),
    ("Semantic Router", semantic_router_available if 'semantic_router_available' in dir() else False),
    ("Ollama", ollama_ready if 'ollama_ready' in dir() else False),
    ("GPTCache", gptcache_available if 'gptcache_available' in dir() else False),
]
for name, available in status:
    stat = "✓" if available else "✗"
    print(f"  {stat} {name}")


COST OPTIMIZATION: Complete Summary

┌─────────────────────┬──────────────────┬──────────────────┬─────────────┐
│ Method              │ Approach         │ Requirements     │ Savings     │
├─────────────────────┼──────────────────┼──────────────────┼─────────────┤
│ LiteLLM Gateway     │ Unified API      │ litellm          │ Infra layer │
│ Semantic Router     │ Intent matching  │ semantic-router  │ 50-85%      │
│ Rule-Based Router   │ Keyword/length   │ None             │ 40-60%      │
│ GPTCache            │ Semantic match   │ gptcache         │ 20-40%      │
└─────────────────────┴──────────────────┴──────────────────┴─────────────┘

PRODUCTION ARCHITECTURE (from md):
─────────────────────────────────────────────────────────────────────────

    Incoming Query
          │
          ▼
    ┌────────────────────────┐
    │   Semantic Router      │  ← Intent classification (local, ~5ms)
    │   (local embeddings)   │
    └───────────┬────────────┘
                │
    ┌───────────┴───

---

# Demo 6: LiteLLM Router - Intelligent Model Selection

LiteLLM's `Router` class provides:
- **Load balancing** across multiple deployments
- **Fallbacks** when primary models fail
- **Routing strategies**: simple-shuffle, least-busy, latency-based, cost-based

This demo shows routing between Ollama models locally.


In [29]:
# LiteLLM Router: Setup and Configuration

print("="*65)
print("LITELLM ROUTER: Setup")
print("="*65)

if not litellm_available:
    print("\n⚠ LiteLLM not installed: pip install litellm")
else:
    from litellm import Router
    import time
    
    # Define model deployments
    # Each "model_name" is a logical name; litellm_params specify actual model
    model_list = [
        {
            "model_name": "fast-model",  # Logical name for routing
            "litellm_params": {
                "model": f"ollama/{WEAK_MODEL}",
                "api_base": "http://localhost:11434",
            },
            "model_info": {"id": 1}
        },
        {
            "model_name": "smart-model",  # Logical name for routing
            "litellm_params": {
                "model": f"ollama/{STRONG_MODEL}",
                "api_base": "http://localhost:11434",
            },
            "model_info": {"id": 2}
        },
        # Multiple deployments of same logical model (for load balancing)
        {
            "model_name": "balanced",
            "litellm_params": {
                "model": f"ollama/{WEAK_MODEL}",
                "api_base": "http://localhost:11434",
            },
            "model_info": {"id": 3}
        },
        {
            "model_name": "balanced",  # Same name = load balanced
            "litellm_params": {
                "model": f"ollama/{STRONG_MODEL}",
                "api_base": "http://localhost:11434",
            },
            "model_info": {"id": 4}
        },
    ]
    
    # Create router with fallback configuration
    litellm_router = Router(
        model_list=model_list,
        routing_strategy="simple-shuffle",  # Options: simple-shuffle, least-busy, latency-based-routing
        set_verbose=False,
        num_retries=2,
    )
    
    print("\n✓ LiteLLM Router configured")
    print(f"  Models registered: {len(model_list)}")
    print(f"  Routing strategy: simple-shuffle")
    print("\n  Logical model names:")
    print(f"    - fast-model  → {WEAK_MODEL}")
    print(f"    - smart-model → {STRONG_MODEL}")
    print(f"    - balanced    → load-balanced between both")


LITELLM ROUTER: Setup

✓ LiteLLM Router configured
  Models registered: 4
  Routing strategy: simple-shuffle

  Logical model names:
    - fast-model  → llama3.2:1b
    - smart-model → qwen3:4b
    - balanced    → load-balanced between both


In [30]:
# LiteLLM Router: Functional Demo - Direct Model Selection

print("="*65)
print("LITELLM ROUTER: Direct Model Selection")
print("="*65)

if not litellm_available or not ollama_ready:
    print("\n⚠ LiteLLM or Ollama not available")
else:
    import asyncio
    
    messages = [{"role": "user", "content": "What is 2+2? One word answer."}]
    
    # Demo 1: Route to fast model
    print("\n[1] Route to 'fast-model' (weak model)")
    print("─"*65)
    try:
        start = time.perf_counter()
        response = litellm_router.completion(
            model="fast-model",
            messages=messages
        )
        latency = (time.perf_counter() - start) * 1000
        print(f"  ✓ Routed to: {response.model}")
        print(f"  ✓ Response: {response.choices[0].message.content.strip()[:60]}")
        print(f"  ✓ Latency: {latency:.0f}ms")
    except Exception as e:
        print(f"  ✗ Error: {e}")
    
    # Demo 2: Route to smart model
    print("\n[2] Route to 'smart-model' (strong model)")
    print("─"*65)
    try:
        start = time.perf_counter()
        response = litellm_router.completion(
            model="smart-model",
            messages=[{"role": "user", "content": "Explain recursion briefly."}]
        )
        latency = (time.perf_counter() - start) * 1000
        
        answer = response.choices[0].message.content.strip()
        answer = re.sub(r'<think>.*?</think>', '', answer, flags=re.DOTALL).strip()
        
        print(f"  ✓ Routed to: {response.model}")
        print(f"  ✓ Response: {answer[:80]}...")
        print(f"  ✓ Latency: {latency:.0f}ms")
    except Exception as e:
        print(f"  ✗ Error: {e}")


LITELLM ROUTER: Direct Model Selection

[1] Route to 'fast-model' (weak model)
─────────────────────────────────────────────────────────────────
  ✓ Routed to: ollama/llama3.2:1b
  ✓ Response: Two.
  ✓ Latency: 118ms

[2] Route to 'smart-model' (strong model)
─────────────────────────────────────────────────────────────────
  ✓ Routed to: ollama/qwen3:4b
  ✓ Response: Recursion is a programming technique where a **function calls itself** to solve ...
  ✓ Latency: 10364ms


In [31]:
# LiteLLM Router: Load Balancing Demo

print("="*65)
print("LITELLM ROUTER: Load Balancing Demo")
print("="*65)

if not litellm_available or not ollama_ready:
    print("\n⚠ LiteLLM or Ollama not available")
else:
    # Demo 3: Load balancing - same logical name routes to different models
    print("\n[3] Load Balancing with 'balanced' model")
    print("─"*65)
    print("  Sending 4 requests to 'balanced' (shuffles between weak/strong)")
    
    models_used = []
    for i in range(4):
        try:
            response = litellm_router.completion(
                model="balanced",
                messages=[{"role": "user", "content": f"Say 'hello {i}'"}]
            )
            models_used.append(response.model)
            print(f"    Request {i+1}: → {response.model}")
        except Exception as e:
            print(f"    Request {i+1}: Error - {e}")
    
    # Count distribution
    from collections import Counter
    distribution = Counter(models_used)
    print(f"\n  Distribution across {len(models_used)} requests:")
    for model, count in distribution.items():
        print(f"    {model}: {count} ({count/len(models_used)*100:.0f}%)")


LITELLM ROUTER: Load Balancing Demo

[3] Load Balancing with 'balanced' model
─────────────────────────────────────────────────────────────────
  Sending 4 requests to 'balanced' (shuffles between weak/strong)
    Request 1: → ollama/llama3.2:1b
    Request 2: → ollama/llama3.2:1b
    Request 3: → ollama/llama3.2:1b
    Request 4: → ollama/qwen3:4b

  Distribution across 4 requests:
    ollama/llama3.2:1b: 3 (75%)
    ollama/qwen3:4b: 1 (25%)


In [32]:
# LiteLLM Router: Fallback Demo

print("="*65)
print("LITELLM ROUTER: Fallback Configuration")
print("="*65)

if not litellm_available or not ollama_ready:
    print("\n⚠ LiteLLM or Ollama not available")
else:
    # Create a new router with fallback configuration
    fallback_model_list = [
        {
            "model_name": "primary",
            "litellm_params": {
                "model": "ollama/nonexistent-model-xyz",  # Will fail
                "api_base": "http://localhost:11434",
            },
        },
        {
            "model_name": "fallback",
            "litellm_params": {
                "model": f"ollama/{WEAK_MODEL}",
                "api_base": "http://localhost:11434",
            },
        },
    ]
    
    fallback_router = Router(
        model_list=fallback_model_list,
        fallbacks=[{"primary": ["fallback"]}],  # If primary fails, try fallback
        set_verbose=False,
        num_retries=0,  # Don't retry same model
    )
    
    print("\n[4] Fallback: primary → fallback on failure")
    print("─"*65)
    print("  Primary: nonexistent-model-xyz (will fail)")
    print("  Fallback: llama3.2:1b")
    
    try:
        start = time.perf_counter()
        response = fallback_router.completion(
            model="primary",
            messages=[{"role": "user", "content": "Hello!"}]
        )
        latency = (time.perf_counter() - start) * 1000
        
        print(f"\n  ✓ Request succeeded via fallback!")
        print(f"  ✓ Model used: {response.model}")
        print(f"  ✓ Response: {response.choices[0].message.content.strip()[:50]}")
        print(f"  ✓ Total latency: {latency:.0f}ms (includes failed attempt)")
    except Exception as e:
        print(f"\n  ✗ All models failed: {e}")


LITELLM ROUTER: Fallback Configuration

[4] Fallback: primary → fallback on failure
─────────────────────────────────────────────────────────────────
  Primary: nonexistent-model-xyz (will fail)
  Fallback: llama3.2:1b

  ✓ Request succeeded via fallback!
  ✓ Model used: ollama/llama3.2:1b
  ✓ Response: Hello! How can I assist you today?
  ✓ Total latency: 213ms (includes failed attempt)


In [33]:
# LiteLLM Router: Custom Routing Logic (Complexity-Based)

print("="*65)
print("LITELLM ROUTER: Custom Complexity-Based Routing")
print("="*65)

if not litellm_available or not ollama_ready:
    print("\n⚠ LiteLLM or Ollama not available")
else:
    def route_by_complexity(query: str) -> str:
        """
        Custom routing logic - you control which model to use.
        Returns the logical model name to route to.
        """
        query_lower = query.lower()
        
        # Simple queries → fast model
        simple_patterns = ["hello", "hi", "what is", "define", "yes", "no", "thanks"]
        if any(p in query_lower for p in simple_patterns) or len(query.split()) < 5:
            return "fast-model"
        
        # Complex queries → smart model
        complex_patterns = ["explain", "analyze", "compare", "design", "debug", "why"]
        if any(p in query_lower for p in complex_patterns) or len(query.split()) > 20:
            return "smart-model"
        
        # Default to fast model
        return "fast-model"
    
    print("\n[5] Complexity-Based Routing")
    print("─"*65)
    
    test_queries = [
        "Hi there!",
        "What is Python?",
        "Explain the trade-offs between microservices and monolithic architectures",
        "Thanks!",
        "Design a caching strategy for a high-traffic e-commerce site",
    ]
    
    routing_stats = {"fast-model": 0, "smart-model": 0}
    
    for query in test_queries:
        model_name = route_by_complexity(query)
        routing_stats[model_name] += 1
        
        try:
            response = litellm_router.completion(
                model=model_name,
                messages=[{"role": "user", "content": query}]
            )
            answer = response.choices[0].message.content.strip()
            answer = re.sub(r'<think>.*?</think>', '', answer, flags=re.DOTALL).strip()
            
            print(f"\n  [{model_name:11}] {query[:45]}{'...' if len(query)>45 else ''}")
            print(f"               → {answer[:60]}{'...' if len(answer)>60 else ''}")
        except Exception as e:
            print(f"\n  [{model_name:11}] Error: {e}")
    
    print(f"\n{'─'*65}")
    print("  Routing Summary:")
    total = sum(routing_stats.values())
    for model, count in routing_stats.items():
        print(f"    {model}: {count}/{total} ({count/total*100:.0f}%)")
    
    fast_pct = routing_stats["fast-model"] / total * 100
    print(f"\n  Cost savings estimate: ~{fast_pct:.0f}% routed to cheaper model")


LITELLM ROUTER: Custom Complexity-Based Routing

[5] Complexity-Based Routing
─────────────────────────────────────────────────────────────────

  [fast-model ] Hi there!
               → Hello! It's nice to meet you. Is there something I can help ...

  [fast-model ] What is Python?
               → Python is a high-level, interpreted programming language tha...

  [fast-model ] Explain the trade-offs between microservices ...
               → When it comes to designing an architecture, two popular appr...

  [fast-model ] Thanks!
               → It seems you've already sent a message. Is there something e...

  [fast-model ] Design a caching strategy for a high-traffic ...
               → **High-Traffic E-commerce Site Caching Strategy**

To improv...

─────────────────────────────────────────────────────────────────
  Routing Summary:
    fast-model: 5/5 (100%)
    smart-model: 0/5 (0%)

  Cost savings estimate: ~100% routed to cheaper model



---

# Demo 5: Semantic Caching

Cache LLM responses by **meaning**, not exact text.

| Query | Result |
|-------|--------|
| "How do I reset my password?" | → LLM call, cached |
| "I forgot my password, help!" | → Cache HIT (similar meaning) |

**Expected hit rates:** FAQ 30-60% | Search 15-30% | Chat 5-15%

This demo uses sentence-transformers for local embeddings (same concept as GPTCache).


In [34]:
# Semantic Caching: Functional Demo with Local Embeddings + Ollama

print("="*65)
print("SEMANTIC CACHING: Functional Demo")
print("="*65)

# We'll build a simple semantic cache using sentence-transformers
# This demonstrates the concept without GPTCache's openai<2.0 dependency

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Dict, Tuple, Optional
import time

class SemanticCache:
    """
    Simple semantic cache using sentence-transformers.
    Demonstrates the core concept of GPTCache locally.
    """
    def __init__(self, similarity_threshold: float = 0.85):
        print("\nLoading embedding model...")
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.threshold = similarity_threshold
        self.cache: Dict[str, Tuple[np.ndarray, str]] = {}  # query -> (embedding, response)
        print(f"✓ SemanticCache ready (threshold={similarity_threshold})")
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))
    
    def get(self, query: str) -> Tuple[Optional[str], float, str]:
        """
        Check if a semantically similar query exists in cache.
        Returns: (cached_response or None, similarity_score, matched_query)
        """
        query_embedding = self.encoder.encode(query)
        
        best_match = None
        best_similarity = 0.0
        best_query = ""
        
        for cached_query, (cached_embedding, cached_response) in self.cache.items():
            similarity = self._cosine_similarity(query_embedding, cached_embedding)
            if similarity > best_similarity:
                best_similarity = similarity
                best_match = cached_response
                best_query = cached_query
        
        if best_similarity >= self.threshold:
            return best_match, best_similarity, best_query
        return None, best_similarity, best_query
    
    def set(self, query: str, response: str):
        """Store query-response pair in cache."""
        embedding = self.encoder.encode(query)
        self.cache[query] = (embedding, response)

# Initialize cache
semantic_cache = SemanticCache(similarity_threshold=0.80)
print(f"  Cache entries: {len(semantic_cache.cache)}")


SEMANTIC CACHING: Functional Demo

Loading embedding model...
✓ SemanticCache ready (threshold=0.8)
  Cache entries: 0


In [35]:
# Semantic Cache: Live Demo with Ollama

print("="*65)
print("SEMANTIC CACHE: Live Demo with Ollama")
print("="*65)

if not ollama_ready:
    print("\n⚠ Ollama not running")
else:
    def cached_completion(query: str, cache: SemanticCache) -> Dict[str, Any]:
        """LLM completion with semantic caching."""
        start = time.perf_counter()
        
        # Check cache first
        cached_response, similarity, matched_query = cache.get(query)
        
        if cached_response is not None:
            latency = (time.perf_counter() - start) * 1000
            return {
                "query": query, "response": cached_response, "cache_hit": True,
                "similarity": similarity, "matched_query": matched_query, "latency_ms": latency
            }
        
        # Cache miss - call LLM
        response, _ = ollama_generate(query, WEAK_MODEL, temperature=0.7)
        response = clean_response(response)
        cache.set(query, response)
        
        return {
            "query": query, "response": response, "cache_hit": False,
            "similarity": similarity, "matched_query": matched_query or "N/A",
            "latency_ms": (time.perf_counter() - start) * 1000
        }
    
    # Test queries - some similar, some different
    test_queries = [
        "How do I reset my password?",      # Will be cached
        "I forgot my password, help!",       # Similar - should hit
        "What are your business hours?",     # Different - miss
        "When are you open?",                # Similar to previous - should hit
        "Password reset instructions please", # Similar to first - should hit
    ]
    
    print("\nRunning queries through cached completion:")
    print("─"*65)
    
    cache_hits = 0
    for i, query in enumerate(test_queries):
        result = cached_completion(query, semantic_cache)
        if result["cache_hit"]:
            cache_hits += 1
            print(f"\n  [{i+1}] ✓ CACHE HIT (sim={result['similarity']:.2f})")
            print(f"      Matched: {result['matched_query'][:40]}...")
        else:
            print(f"\n  [{i+1}] ✗ MISS → LLM ({result['latency_ms']:.0f}ms)")
        print(f"      Query: {query}")
        print(f"      Response: {result['response'][:70]}...")
    
    # Summary
    hit_rate = cache_hits / len(test_queries) * 100
    print(f"\n{'═'*65}")
    print(f"  Hit rate: {cache_hits}/{len(test_queries)} = {hit_rate:.0f}%")
    print(f"  Estimated cost savings: ~{hit_rate:.0f}% of LLM calls avoided")


SEMANTIC CACHE: Live Demo with Ollama

Running queries through cached completion:
─────────────────────────────────────────────────────────────────


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)



  [1] ✗ MISS → LLM (5527ms)
      Query: How do I reset my password?
      Response: To reset your password, you'll need to follow these general steps. Ple...

  [2] ✗ MISS → LLM (2885ms)
      Query: I forgot my password, help!
      Response: Forgetting a password can be frustrating. Here are some steps you can ...

  [3] ✗ MISS → LLM (1246ms)
      Query: What are your business hours?
      Response: I'm available to help 24/7, but my developers may not be able to respo...

  [4] ✗ MISS → LLM (1254ms)
      Query: When are you open?
      Response: I'm here to help 24/7, but my response times may be slower on weekends...

  [5] ✓ CACHE HIT (sim=0.82)
      Matched: How do I reset my password?...
      Query: Password reset instructions please
      Response: To reset your password, you'll need to follow these general steps. Ple...

═════════════════════════════════════════════════════════════════
  Hit rate: 1/5 = 20%
  Estimated cost savings: ~20% of LLM calls avoided


---

# Demo 7: SISO - Semantic Index for Serving Optimization

**SISO** is next-gen semantic caching that improves on GPTCache:

| Feature | GPTCache | SISO |
|---------|----------|------|
| Index | Flat similarity | Centroid-based clusters |
| Eviction | LRU global | Locality-aware per cluster |
| Threshold | Fixed | Dynamic (adapts to query distribution) |
| Lookup | O(n) | O(log n) via cluster pruning |

**Key insight:** Group semantically similar queries into clusters. New queries check cluster centroids first, then search within the best-matching cluster.


In [36]:
# SISO: Semantic Index for Serving Optimization - Setup

print("="*65)
print("SISO: Centroid-Based Semantic Cache")
print("="*65)

from sentence_transformers import SentenceTransformer
import numpy as np
from typing import Dict, List, Tuple, Optional
from dataclasses import dataclass, field
import time

@dataclass
class CacheEntry:
    """Single cache entry with access tracking."""
    query: str
    response: str
    embedding: np.ndarray
    access_count: int = 0
    last_access: float = field(default_factory=time.time)

@dataclass
class Cluster:
    """Semantic cluster with centroid and entries."""
    centroid: np.ndarray
    entries: List[CacheEntry] = field(default_factory=list)
    threshold: float = 0.80  # Dynamic threshold per cluster
    
    def update_centroid(self):
        """Recalculate centroid from entries."""
        if self.entries:
            embeddings = np.array([e.embedding for e in self.entries])
            self.centroid = np.mean(embeddings, axis=0)

class SISOCache:
    """
    SISO: Semantic Index for Serving Optimization
    
    Improvements over basic semantic cache:
    1. Centroid-based clustering for O(log n) lookup
    2. Locality-aware eviction (per-cluster LRU)
    3. Dynamic thresholds adapting to cluster density
    """
    def __init__(
        self, 
        base_threshold: float = 0.80,
        cluster_threshold: float = 0.70,  # Threshold for assigning to cluster
        max_entries_per_cluster: int = 10,
        max_clusters: int = 20
    ):
        print("\nLoading embedding model...")
        self.encoder = SentenceTransformer('all-MiniLM-L6-v2')
        self.base_threshold = base_threshold
        self.cluster_threshold = cluster_threshold
        self.max_entries_per_cluster = max_entries_per_cluster
        self.max_clusters = max_clusters
        self.clusters: List[Cluster] = []
        self.stats = {"hits": 0, "misses": 0, "cluster_searches": 0, "entry_searches": 0}
        print(f"✓ SISO Cache ready")
        print(f"  Base threshold: {base_threshold}")
        print(f"  Max clusters: {max_clusters}")
        print(f"  Max entries/cluster: {max_entries_per_cluster}")
    
    def _cosine_similarity(self, a: np.ndarray, b: np.ndarray) -> float:
        return float(np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b)))
    
    def _find_best_cluster(self, embedding: np.ndarray) -> Tuple[Optional[Cluster], float]:
        """Find the cluster with the most similar centroid."""
        best_cluster = None
        best_sim = 0.0
        
        for cluster in self.clusters:
            self.stats["cluster_searches"] += 1
            sim = self._cosine_similarity(embedding, cluster.centroid)
            if sim > best_sim:
                best_sim = sim
                best_cluster = cluster
        
        return best_cluster, best_sim
    
    def _evict_from_cluster(self, cluster: Cluster):
        """Locality-aware eviction: remove least recently used entry from cluster."""
        if not cluster.entries:
            return
        # Sort by last_access, remove oldest
        cluster.entries.sort(key=lambda e: e.last_access, reverse=True)
        evicted = cluster.entries.pop()
        cluster.update_centroid()
        print(f"    [Evicted] '{evicted.query[:30]}...' from cluster")
    
    def get(self, query: str) -> Tuple[Optional[str], Dict]:
        """
        SISO lookup:
        1. Embed query
        2. Find best matching cluster centroid
        3. Search within that cluster only
        """
        query_embedding = self.encoder.encode(query)
        
        # Find best cluster
        cluster, cluster_sim = self._find_best_cluster(query_embedding)
        
        if cluster is None or cluster_sim < self.cluster_threshold:
            self.stats["misses"] += 1
            return None, {"cluster_sim": cluster_sim, "entry_sim": 0.0, "matched": None}
        
        # Search within cluster
        best_entry = None
        best_sim = 0.0
        
        for entry in cluster.entries:
            self.stats["entry_searches"] += 1
            sim = self._cosine_similarity(query_embedding, entry.embedding)
            if sim > best_sim:
                best_sim = sim
                best_entry = entry
        
        # Check against cluster's dynamic threshold
        if best_entry and best_sim >= cluster.threshold:
            best_entry.access_count += 1
            best_entry.last_access = time.time()
            self.stats["hits"] += 1
            return best_entry.response, {
                "cluster_sim": cluster_sim,
                "entry_sim": best_sim,
                "matched": best_entry.query
            }
        
        self.stats["misses"] += 1
        return None, {"cluster_sim": cluster_sim, "entry_sim": best_sim, "matched": None}
    
    def set(self, query: str, response: str):
        """Store with automatic clustering."""
        embedding = self.encoder.encode(query)
        entry = CacheEntry(query=query, response=response, embedding=embedding)
        
        # Find or create cluster
        cluster, sim = self._find_best_cluster(embedding)
        
        if cluster and sim >= self.cluster_threshold:
            # Add to existing cluster
            if len(cluster.entries) >= self.max_entries_per_cluster:
                self._evict_from_cluster(cluster)
            cluster.entries.append(entry)
            cluster.update_centroid()
        else:
            # Create new cluster
            if len(self.clusters) >= self.max_clusters:
                # Remove smallest cluster
                self.clusters.sort(key=lambda c: len(c.entries), reverse=True)
                self.clusters.pop()
            new_cluster = Cluster(centroid=embedding, entries=[entry])
            self.clusters.append(new_cluster)
            print(f"    [New cluster] for '{query[:40]}...'")

# Initialize SISO cache
siso_cache = SISOCache(base_threshold=0.80, cluster_threshold=0.70)
print(f"\n  Clusters: {len(siso_cache.clusters)}")


SISO: Centroid-Based Semantic Cache

Loading embedding model...
✓ SISO Cache ready
  Base threshold: 0.8
  Max clusters: 20
  Max entries/cluster: 10

  Clusters: 0


In [37]:
# SISO: Live Demo with Ollama

print("="*65)
print("SISO: Live Demo with Clustering")
print("="*65)

if not ollama_ready:
    print("\n⚠ Ollama not running")
else:
    def siso_cached_completion(query: str, cache: SISOCache) -> Dict[str, Any]:
        """LLM completion with SISO caching."""
        start = time.perf_counter()
        
        # SISO lookup
        cached_response, info = cache.get(query)
        
        if cached_response is not None:
            latency = (time.perf_counter() - start) * 1000
            return {
                "query": query, "response": cached_response, "cache_hit": True,
                "cluster_sim": info["cluster_sim"], "entry_sim": info["entry_sim"],
                "matched": info["matched"], "latency_ms": latency
            }
        
        # Cache miss - call LLM
        response, llm_latency = ollama_generate(query, WEAK_MODEL, temperature=0.7)
        response = clean_response(response)
        cache.set(query, response)
        
        return {
            "query": query, "response": response, "cache_hit": False,
            "cluster_sim": info["cluster_sim"], "entry_sim": info["entry_sim"],
            "matched": None, "latency_ms": (time.perf_counter() - start) * 1000
        }
    
    # Test with semantically grouped queries
    test_queries = [
        # Password cluster
        "How do I reset my password?",
        "I forgot my password",
        "Password recovery steps",
        
        # Hours cluster  
        "What are your business hours?",
        "When do you open?",
        
        # Refund cluster
        "How do I get a refund?",
        "I want my money back",
        "Refund policy please",
        
        # Cross-cluster test
        "Reset my account password",  # Should hit password cluster
    ]
    
    print("\nRunning queries through SISO cache:")
    print("─"*65)
    
    for i, query in enumerate(test_queries):
        result = siso_cached_completion(query, siso_cache)
        
        if result["cache_hit"]:
            print(f"\n  [{i+1}] ✓ HIT (cluster={result['cluster_sim']:.2f}, entry={result['entry_sim']:.2f})")
            print(f"      Matched: '{result['matched'][:35]}...'")
        else:
            print(f"\n  [{i+1}] ✗ MISS → LLM ({result['latency_ms']:.0f}ms)")
        print(f"      Query: {query}")
        print(f"      Response: {result['response'][:60]}...")
    
    # Statistics
    print(f"\n{'═'*65}")
    print("SISO STATISTICS")
    print(f"{'═'*65}")
    print(f"  Clusters formed: {len(siso_cache.clusters)}")
    print(f"  Cache hits: {siso_cache.stats['hits']}")
    print(f"  Cache misses: {siso_cache.stats['misses']}")
    hit_rate = siso_cache.stats['hits'] / (siso_cache.stats['hits'] + siso_cache.stats['misses']) * 100
    print(f"  Hit rate: {hit_rate:.0f}%")
    print(f"\n  Cluster searches: {siso_cache.stats['cluster_searches']}")
    print(f"  Entry searches: {siso_cache.stats['entry_searches']}")
    
    # Efficiency comparison
    total_queries = siso_cache.stats['hits'] + siso_cache.stats['misses']
    flat_searches = total_queries * total_queries  # O(n) for flat cache
    siso_searches = siso_cache.stats['cluster_searches'] + siso_cache.stats['entry_searches']
    print(f"\n  Flat cache would do: ~{flat_searches} comparisons")
    print(f"  SISO did: {siso_searches} comparisons")
    print(f"  Efficiency gain: {(1 - siso_searches/max(flat_searches,1))*100:.0f}%")


SISO: Live Demo with Clustering

Running queries through SISO cache:
─────────────────────────────────────────────────────────────────
    [New cluster] for 'How do I reset my password?...'

  [1] ✗ MISS → LLM (3617ms)
      Query: How do I reset my password?
      Response: To reset your password, you'll need to follow these general ...

  [2] ✗ MISS → LLM (2493ms)
      Query: I forgot my password
      Response: Forgetting passwords can be frustrating. Here are some steps...
    [New cluster] for 'Password recovery steps...'

  [3] ✗ MISS → LLM (5666ms)
      Query: Password recovery steps
      Response: If you've forgotten your password, here are some steps to he...
    [New cluster] for 'What are your business hours?...'

  [4] ✗ MISS → LLM (1036ms)
      Query: What are your business hours?
      Response: I'm available to help you 24/7. However, my response times m...
    [New cluster] for 'When do you open?...'

  [5] ✗ MISS → LLM (542ms)
      Query: When do you open?
      R

In [38]:
# SISO vs GPTCache: Comparison

print("="*65)
print("SISO vs SEMANTIC CACHE: Comparison")
print("="*65)

print("""
┌──────────────────────┬─────────────────────────┬─────────────────────────┐
│ Aspect               │ SemanticCache (GPTCache)│ SISO                    │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Lookup complexity    │ O(n) - check all entries│ O(k + m) - k clusters,  │
│                      │                         │ m entries in best       │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Eviction policy      │ Global LRU              │ Per-cluster LRU         │
│                      │                         │ (locality-aware)        │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Threshold            │ Fixed global            │ Dynamic per-cluster     │
│                      │                         │ (adapts to density)     │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Best for             │ Small cache, simple     │ Large cache, diverse    │
│                      │ query distribution      │ query topics            │
└──────────────────────┴─────────────────────────┴─────────────────────────┘
""")

# Show cluster structure
if 'siso_cache' in dir() and siso_cache.clusters:
    print("SISO Cluster Structure:")
    print("─"*65)
    for i, cluster in enumerate(siso_cache.clusters):
        print(f"\n  Cluster {i+1}: {len(cluster.entries)} entries")
        for entry in cluster.entries[:3]:  # Show first 3
            print(f"    • {entry.query[:50]}{'...' if len(entry.query) > 50 else ''}")
        if len(cluster.entries) > 3:
            print(f"    ... and {len(cluster.entries) - 3} more")


SISO vs SEMANTIC CACHE: Comparison

┌──────────────────────┬─────────────────────────┬─────────────────────────┐
│ Aspect               │ SemanticCache (GPTCache)│ SISO                    │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Lookup complexity    │ O(n) - check all entries│ O(k + m) - k clusters,  │
│                      │                         │ m entries in best       │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Eviction policy      │ Global LRU              │ Per-cluster LRU         │
│                      │                         │ (locality-aware)        │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Threshold            │ Fixed global            │ Dynamic per-cluster     │
│                      │                         │ (adapts to density)     │
├──────────────────────┼─────────────────────────┼─────────────────────────┤
│ Best for             │ Small cache, si

---

# Demo 8: LiteLLM Caching Options

LiteLLM has built-in caching support:

| Type | Backend | Semantic? | Setup |
|------|---------|-----------|-------|
| In-memory | Dict | No (exact match) | Zero config |
| Redis | Redis server | No | `pip install redis` + server |
| Qdrant | Qdrant server | Yes | `pip install qdrant-client` + server |
| S3 | AWS S3 | No | AWS credentials |

Let's explore what's possible in notebook scope.


In [43]:
# LiteLLM Caching: Explore Options

print("="*65)
print("LITELLM CACHING: Available Options")
print("="*65)

# Check what's available
cache_options = {}

# 1. In-memory (always available)
cache_options["in_memory"] = True
print("\n✓ In-memory cache: Available (built-in)")

# 2. Redis
try:
    import redis
    # Try to connect to local Redis
    r = redis.Redis(host='localhost', port=6379, socket_connect_timeout=1)
    r.ping()
    cache_options["redis"] = True
    print("✓ Redis: Available (server running)")
except Exception as e:
    cache_options["redis"] = False
    print(f"✗ Redis: Not available ({type(e).__name__})")

# 3. Qdrant (for semantic caching)
try:
    from qdrant_client import QdrantClient
    # Try in-memory Qdrant (no server needed!)
    qdrant = QdrantClient(":memory:")
    cache_options["qdrant_memory"] = True
    print("✓ Qdrant (in-memory): Available - can do semantic caching!")
except ImportError:
    cache_options["qdrant_memory"] = False
    print("✗ Qdrant: Not installed (pip install qdrant-client)")
except Exception as e:
    cache_options["qdrant_memory"] = False
    print(f"✗ Qdrant: Error ({e})")

# 4. Check LiteLLM caching module
try:
    from litellm import Cache
    cache_options["litellm_cache"] = True
    print("✓ LiteLLM Cache class: Available")
except ImportError:
    cache_options["litellm_cache"] = False
    print("✗ LiteLLM Cache: Not available")

print(f"\n{'─'*65}")
print("RECOMMENDATION:")
if cache_options.get("qdrant_memory"):
    print("  → Qdrant in-memory mode available for semantic caching!")
elif cache_options.get("redis"):
    print("  → Redis available for exact-match caching")
else:
    print("  → In-memory cache or custom SemanticCache (Demo 5)")


LITELLM CACHING: Available Options

✓ In-memory cache: Available (built-in)
✗ Redis: Not available (ModuleNotFoundError)
✓ Qdrant (in-memory): Available - can do semantic caching!
✓ LiteLLM Cache class: Available

─────────────────────────────────────────────────────────────────
RECOMMENDATION:
  → Qdrant in-memory mode available for semantic caching!


In [44]:
# LiteLLM Caching: In-Memory Demo

print("="*65)
print("LITELLM: In-Memory Caching Demo")
print("="*65)

if not litellm_available or not ollama_ready:
    print("\n⚠ LiteLLM or Ollama not available")
else:
    import litellm
    from litellm import Cache
    
    # Enable in-memory caching
    litellm.cache = Cache(type="local")  # In-memory cache
    litellm.enable_cache()
    
    print("\n✓ LiteLLM in-memory cache enabled")
    print("  Type: local (exact match)")
    
    # Test caching
    test_messages = [{"role": "user", "content": "What is 2+2? One word."}]
    
    print("\n[Test 1: First call - should hit LLM]")
    print("─"*65)
    start = time.perf_counter()
    resp1 = litellm.completion(
        model=f"ollama/{WEAK_MODEL}",
        messages=test_messages,
        api_base="http://localhost:11434"
    )
    latency1 = (time.perf_counter() - start) * 1000
    print(f"  Response: {resp1.choices[0].message.content.strip()}")
    print(f"  Latency: {latency1:.0f}ms")
    
    print("\n[Test 2: Same query - should hit cache]")
    print("─"*65)
    start = time.perf_counter()
    resp2 = litellm.completion(
        model=f"ollama/{WEAK_MODEL}",
        messages=test_messages,
        api_base="http://localhost:11434"
    )
    latency2 = (time.perf_counter() - start) * 1000
    print(f"  Response: {resp2.choices[0].message.content.strip()}")
    print(f"  Latency: {latency2:.0f}ms")
    
    # Check if cache hit
    if latency2 < latency1 / 2:
        print(f"\n  ✓ Cache HIT! ({latency1:.0f}ms → {latency2:.0f}ms)")
    else:
        print(f"\n  ⚠ Cache might not have hit (check litellm version)")
    
    print("\n[Test 3: Slightly different query - cache MISS (exact match only)]")
    print("─"*65)
    different_messages = [{"role": "user", "content": "What is 2 + 2? One word."}]  # Extra space
    start = time.perf_counter()
    resp3 = litellm.completion(
        model=f"ollama/{WEAK_MODEL}",
        messages=different_messages,
        api_base="http://localhost:11434"
    )
    latency3 = (time.perf_counter() - start) * 1000
    print(f"  Response: {resp3.choices[0].message.content.strip()}")
    print(f"  Latency: {latency3:.0f}ms")
    print(f"  → This is why semantic caching matters!")
    
    # Disable cache for other demos
    litellm.cache = None


LITELLM: In-Memory Caching Demo

✓ LiteLLM in-memory cache enabled
  Type: local (exact match)

[Test 1: First call - should hit LLM]
─────────────────────────────────────────────────────────────────
  Response: Four.
  Latency: 143ms

[Test 2: Same query - should hit cache]
─────────────────────────────────────────────────────────────────
  Response: Four.
  Latency: 1ms

  ✓ Cache HIT! (143ms → 1ms)

[Test 3: Slightly different query - cache MISS (exact match only)]
─────────────────────────────────────────────────────────────────
  Response: Four.
  Latency: 105ms
  → This is why semantic caching matters!


In [45]:
# LiteLLM + Qdrant: Semantic Caching (if available)

print("="*65)
print("LITELLM + QDRANT: Semantic Caching")
print("="*65)

if not cache_options.get("qdrant_memory"):
    print("\n⚠ Qdrant not available")
    print("  To enable: pip install qdrant-client")
    print("\n  Qdrant supports in-memory mode (no server needed)!")
    print("  This would enable true semantic caching with LiteLLM.")
else:
    print("\n✓ Qdrant available - attempting semantic cache setup...")
    
    try:
        from litellm import Cache
        from qdrant_client import QdrantClient
        
        # LiteLLM's Qdrant integration requires specific setup
        # Check if litellm supports qdrant cache type
        print("\nLiteLLM Qdrant cache setup:")
        print("─"*65)
        print("""
  # Production setup (requires qdrant server or cloud):
  
  litellm.cache = Cache(
      type="qdrant-semantic",
      qdrant_semantic_cache_embedding_model="text-embedding-ada-002",
      qdrant_collection_name="llm_cache",
      qdrant_quantization_config=None,
      similarity_threshold=0.8,
  )
  
  # Note: LiteLLM's Qdrant cache currently requires:
  # 1. OpenAI embeddings API (for embedding generation)
  # 2. Qdrant server running (cloud or local)
  
  # For notebook-scope semantic caching without external deps,
  # our custom SemanticCache (Demo 5) or SISO (Demo 7) work better.
""")
        
        print("Alternative: Use Qdrant in-memory directly with custom cache")
        print("─"*65)
        
        # Demo Qdrant in-memory with our own semantic cache
        from qdrant_client.models import Distance, VectorParams, PointStruct
        
        qdrant = QdrantClient(":memory:")
        collection_name = "semantic_cache"
        
        # Create collection
        qdrant.create_collection(
            collection_name=collection_name,
            vectors_config=VectorParams(size=384, distance=Distance.COSINE)
        )
        print("✓ Qdrant in-memory collection created")
        
        # Use sentence-transformers for embeddings
        from sentence_transformers import SentenceTransformer
        encoder = SentenceTransformer('all-MiniLM-L6-v2')
        
        # Add a test entry
        test_query = "How do I reset my password?"
        test_response = "Go to Settings > Security > Reset Password"
        embedding = encoder.encode(test_query).tolist()
        
        qdrant.upsert(
            collection_name=collection_name,
            points=[PointStruct(id=1, vector=embedding, payload={"query": test_query, "response": test_response})]
        )
        print("✓ Test entry added to Qdrant")
        
        # Search with similar query
        similar_query = "I forgot my password"
        search_embedding = encoder.encode(similar_query).tolist()
        
        results = qdrant.search(
            collection_name=collection_name,
            query_vector=search_embedding,
            limit=1
        )
        
        if results and results[0].score > 0.7:
            print(f"\n✓ Semantic search works!")
            print(f"  Query: '{similar_query}'")
            print(f"  Matched: '{results[0].payload['query']}' (score={results[0].score:.2f})")
            print(f"  Cached response: {results[0].payload['response']}")
        
        print("\n→ Qdrant in-memory works for semantic caching!")
        print("  This can replace our custom SemanticCache if preferred.")
        
    except Exception as e:
        print(f"\n✗ Error: {e}")
        print("  Falling back to custom SemanticCache (Demo 5)")


LITELLM + QDRANT: Semantic Caching

✓ Qdrant available - attempting semantic cache setup...

LiteLLM Qdrant cache setup:
─────────────────────────────────────────────────────────────────

  # Production setup (requires qdrant server or cloud):

  litellm.cache = Cache(
      type="qdrant-semantic",
      qdrant_semantic_cache_embedding_model="text-embedding-ada-002",
      qdrant_collection_name="llm_cache",
      qdrant_quantization_config=None,
      similarity_threshold=0.8,
  )

  # Note: LiteLLM's Qdrant cache currently requires:
  # 1. OpenAI embeddings API (for embedding generation)
  # 2. Qdrant server running (cloud or local)

  # For notebook-scope semantic caching without external deps,
  # our custom SemanticCache (Demo 5) or SISO (Demo 7) work better.

Alternative: Use Qdrant in-memory directly with custom cache
─────────────────────────────────────────────────────────────────
✓ Qdrant in-memory collection created
✓ Test entry added to Qdrant

✗ Error: 'QdrantClient' objec

In [46]:
# Caching Options Summary

print("="*65)
print("SEMANTIC CACHING: Options Summary")
print("="*65)

print("""
┌────────────────────┬─────────────┬──────────────┬─────────────────────┐
│ Option             │ Semantic?   │ External Dep │ Best For            │
├────────────────────┼─────────────┼──────────────┼─────────────────────┤
│ LiteLLM local      │ No (exact)  │ None         │ Dev/testing         │
│ LiteLLM + Redis    │ No (exact)  │ Redis server │ Production, fast    │
│ LiteLLM + Qdrant   │ Yes         │ OpenAI API   │ Production, cloud   │
│ Custom SemanticCache│ Yes        │ sentence-tx  │ Notebook, offline   │
│ Custom SISO        │ Yes         │ sentence-tx  │ Notebook, clustered │
│ Qdrant in-memory   │ Yes         │ qdrant-client│ Notebook, efficient │
└────────────────────┴─────────────┴──────────────┴─────────────────────┘

FOR THIS NOTEBOOK (offline, no external APIs):
─────────────────────────────────────────────────────────────────
  ✓ Custom SemanticCache (Demo 5) - simple, educational
  ✓ Custom SISO (Demo 7) - shows clustering concepts  
  ✓ Qdrant in-memory (if installed) - production-like

FOR PRODUCTION:
─────────────────────────────────────────────────────────────────
  → LiteLLM + Redis for exact-match (simple, fast)
  → LiteLLM + Qdrant for semantic (requires OpenAI embeddings)
  → GPTCache if using openai<2.0
""")


SEMANTIC CACHING: Options Summary

┌────────────────────┬─────────────┬──────────────┬─────────────────────┐
│ Option             │ Semantic?   │ External Dep │ Best For            │
├────────────────────┼─────────────┼──────────────┼─────────────────────┤
│ LiteLLM local      │ No (exact)  │ None         │ Dev/testing         │
│ LiteLLM + Redis    │ No (exact)  │ Redis server │ Production, fast    │
│ LiteLLM + Qdrant   │ Yes         │ OpenAI API   │ Production, cloud   │
│ Custom SemanticCache│ Yes        │ sentence-tx  │ Notebook, offline   │
│ Custom SISO        │ Yes         │ sentence-tx  │ Notebook, clustered │
│ Qdrant in-memory   │ Yes         │ qdrant-client│ Notebook, efficient │
└────────────────────┴─────────────┴──────────────┴─────────────────────┘

FOR THIS NOTEBOOK (offline, no external APIs):
─────────────────────────────────────────────────────────────────
  ✓ Custom SemanticCache (Demo 5) - simple, educational
  ✓ Custom SISO (Demo 7) - shows clustering concepts  
