# Optimization: Caching Strategies

----

This notebook focuses on **caching strategies** to reduce latency, cost, and redundant API calls for AI workloads.

You will learn:

- **Prompt Caching**: Azure OpenAI native caching for repeated prompt prefixes
- **Response Caching**: Cache exact responses with Redis
- **Semantic Caching**: Cache similar queries using vector similarity
- **Cost Analysis**: Compare cached vs non-cached costs

**Reference**: [Azure/agent-innovator-lab - Caching](https://github.com/Azure/agent-innovator-lab/tree/main/3_optimization-design-ptn/02_caching)

## Table of Contents

- [Why Caching Matters](#why-caching-matters)
- [Pre-requisites](#pre-requisites)
- [Setup](#setup)
- [Part 1: Azure OpenAI Prompt Caching](#part-1-azure-openai-prompt-caching)
- [Part 2: Redis Response Caching](#part-2-redis-response-caching)
- [Part 3: Semantic Caching](#part-3-semantic-caching)
- [Part 4: Caching Strategy Comparison](#part-4-caching-strategy-comparison)
- [Best Practices Summary](#best-practices-summary)

- [Cleanup Resources](#cleanup-resources)- [Wrap-up](#wrap-up)

## Why Caching Matters

### The Problem: Repetitive Costs

| Scenario | Without Caching | With Caching |
|----------|----------------|---------------|
| Same system prompt (10K tokens) sent 1000√ó | 10M input tokens | 10K tokens + cache reads |
| FAQ bot answering "What are your hours?" 500√ó | 500 API calls | 1 API call + 499 cache hits |
| RAG with similar queries | Full retrieval each time | Cached retrievals |

### Caching Types Overview

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                        Caching Strategies                           ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¨‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ   Prompt Caching    ‚îÇ  Response Caching   ‚îÇ   Semantic Caching      ‚îÇ
‚îÇ   (Azure OpenAI)    ‚îÇ  (Redis Exact)      ‚îÇ   (Vector Similarity)   ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îº‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ Cache prompt prefix ‚îÇ Cache exact Q&A     ‚îÇ Cache similar queries   ‚îÇ
‚îÇ 50% cost reduction  ‚îÇ ~0 latency on hit   ‚îÇ Fuzzy matching          ‚îÇ
‚îÇ Automatic           ‚îÇ Manual setup        ‚îÇ Embedding required      ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î¥‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Cost Impact

| Model | Standard Input | Cached Input | Savings |
|-------|---------------|--------------|----------|
| GPT-4o | $2.50/1M tokens | $1.25/1M tokens | **50%** |
| GPT-4o-mini | $0.15/1M tokens | $0.075/1M tokens | **50%** |
| GPT-4.1 | $2.00/1M tokens | $0.50/1M tokens | **75%** |

## Pre-requisites

Before running this notebook, ensure you have:

1. **Azure account** with an active subscription
2. **Azure OpenAI resource** with GPT-4x and embedding model deployments
3. **Azure Cache for Redis Enterprise** with **RediSearch** module enabled (for Parts 2-3)

### Creating Azure Redis Enterprise with RediSearch

The following cell creates an **Azure Managed Redis (Enterprise)** instance with the **RediSearch** module enabled, which is required for semantic caching.

> **Note**: Azure Managed Redis Enterprise provisioning can take 15-30 minutes. You can skip this if you already have a Redis instance with RediSearch enabled.

**Pricing**: Enterprise tier is required for RediSearch module. See [Azure Redis pricing](https://azure.microsoft.com/pricing/details/cache/) for details.

In [36]:
# Create Azure Managed Redis Enterprise with RediSearch
# ======================================================
import subprocess
import json
import os
from dotenv import load_dotenv

load_dotenv(override=True)

# Configuration
CREATE_REDIS = False  # Set to True to create Redis instance

# Load config from Foundry setup
config_file = '../0_setup/.foundry_config.json'
try:
    with open(config_file, 'r', encoding='utf-8') as f:
        config = json.load(f)
except FileNotFoundError:
    print(f"‚ö†Ô∏è Could not find '{config_file}'. Run 0_setup/1_setup.ipynb first.")
    config = {}

# Redis configuration
REDIS_RESOURCE_GROUP = config.get('RESOURCE_GROUP', os.environ.get('RESOURCE_GROUP', 'rg-caching-lab'))
REDIS_LOCATION = config.get('LOCATION', os.environ.get('LOCATION', 'eastus'))
REDIS_NAME = os.environ.get('REDIS_NAME', f"redis-cache-{REDIS_LOCATION[:4]}")
AZURE_SUBSCRIPTION_ID = config.get('AZURE_SUBSCRIPTION_ID', os.environ.get('AZURE_SUBSCRIPTION_ID', ''))

# Redis Enterprise SKU options:
# - Enterprise_E1 (1GB, lowest cost for dev/test)
# - Enterprise_E10 (10GB)
# - Enterprise_E20 (20GB)
# - Enterprise_E50 (50GB)
# - Enterprise_E100 (100GB)
REDIS_SKU = os.environ.get('REDIS_SKU', 'Enterprise_E1')
REDIS_CAPACITY = 2  # Number of nodes (minimum 2 for Enterprise)

def run_az(args: list) -> str:
    """Run Azure CLI command and return stdout."""
    cmd = ["az"] + args
    print(f"  $ az {' '.join(args[:8])}{'...' if len(args) > 8 else ''}")
    p = subprocess.run(cmd, capture_output=True, text=True)
    if p.returncode != 0:
        raise RuntimeError((p.stderr or p.stdout).strip())
    return p.stdout.strip()

def maybe_set_subscription():
    if AZURE_SUBSCRIPTION_ID:
        run_az(["account", "set", "--subscription", AZURE_SUBSCRIPTION_ID])

def ensure_resource_group():
    exists = run_az(["group", "exists", "-n", REDIS_RESOURCE_GROUP]).strip().lower() == "true"
    if not exists:
        print(f"üÜï Creating resource group: {REDIS_RESOURCE_GROUP}")
        run_az(["group", "create", "-n", REDIS_RESOURCE_GROUP, "-l", REDIS_LOCATION])
    else:
        print(f"‚úÖ Resource group exists: {REDIS_RESOURCE_GROUP}")

def redis_exists() -> bool:
    try:
        run_az(["redisenterprise", "show", "-g", REDIS_RESOURCE_GROUP, "-n", REDIS_NAME, "-o", "none"])
        return True
    except Exception:
        return False

def database_exists() -> bool:
    """Check if the default database exists on the Redis cluster."""
    try:
        run_az([
            "redisenterprise", "database", "show",
            "-g", REDIS_RESOURCE_GROUP,
            "--cluster-name", REDIS_NAME,
            "-o", "none"
        ])
        return True
    except Exception:
        return False

def ensure_database():
    """Create the database with RediSearch if it doesn't exist."""
    import time
    if database_exists():
        print("   ‚úÖ Database already exists.")
        return

    # Delete any leftover database with wrong config (just in case)
    print("\nüì¶ Creating Redis database with RediSearch module...")
    run_az([
        "redisenterprise", "database", "create",
        "-g", REDIS_RESOURCE_GROUP,
        "--cluster-name", REDIS_NAME,
        "--client-protocol", "Encrypted",
        "--clustering-policy", "EnterpriseCluster",
        "--eviction-policy", "NoEviction",
        "--modules", "name=RediSearch",
    ])
    print("   ‚úÖ Database created successfully!")

def create_redis_enterprise():
    """Create Azure Managed Redis Enterprise with RediSearch module."""
    print(f"\nüöÄ Creating Azure Managed Redis Enterprise: {REDIS_NAME}")
    print(f"   Location: {REDIS_LOCATION}")
    print(f"   SKU: {REDIS_SKU}")
    print(f"   Modules: RediSearch (required for semantic caching)")
    print("\n‚è≥ This may take 15-30 minutes...")
    
    # Create Redis Enterprise cluster
    run_az([
        "redisenterprise", "create",
        "-g", REDIS_RESOURCE_GROUP,
        "-n", REDIS_NAME,
        "-l", REDIS_LOCATION,
        "--sku", REDIS_SKU,
        "--capacity", str(REDIS_CAPACITY),
        "--no-wait",
        "--public-network-access", "Enabled",
        "--access-keys-auth", "Enabled"
    ])
    
    print("\nüìù Redis cluster creation started. Creating database with RediSearch...")
    
    # Wait for cluster to be ready before creating database
    import time
    max_wait = 1800  # 30 minutes
    poll_interval = 30
    waited = 0
    
    while waited < max_wait:
        try:
            state = run_az([
                "redisenterprise", "show",
                "-g", REDIS_RESOURCE_GROUP,
                "-n", REDIS_NAME,
                "--query", "provisioningState",
                "-o", "tsv"
            ]).strip()
            print(f"   Provisioning state: {state}")
            if state == "Succeeded":
                break
            elif state in ["Failed", "Canceled"]:
                raise RuntimeError(f"Redis provisioning failed: {state}")
        except RuntimeError as e:
            if "ResourceNotFound" not in str(e):
                print(f"   Waiting for cluster... ({waited}s)")
        
        time.sleep(poll_interval)
        waited += poll_interval
    
    if waited >= max_wait:
        print("‚ö†Ô∏è Timeout waiting for Redis cluster. Check Azure portal for status.")
        return
    
    # Delete existing default database if present (may have wrong clustering policy)
    print("\nüóëÔ∏è Removing existing default database (if any)...")
    try:
        run_az([
            "redisenterprise", "database", "delete",
            "-g", REDIS_RESOURCE_GROUP,
            "--cluster-name", REDIS_NAME,
            "--yes",
        ])
        print("   Deleted existing database.")
        time.sleep(10)  # Wait for deletion to propagate
    except RuntimeError:
        print("   No existing database to delete.")

    # Create database with RediSearch module
    print("\nüì¶ Creating Redis database with RediSearch module...")
    run_az([
        "redisenterprise", "database", "create",
        "-g", REDIS_RESOURCE_GROUP,
        "--cluster-name", REDIS_NAME,
        "--client-protocol", "Encrypted",
        "--clustering-policy", "EnterpriseCluster",
        "--eviction-policy", "NoEviction",
        "--modules", "name=RediSearch",
    ])
    
    print("\n‚úÖ Redis Enterprise with RediSearch created successfully!")

def get_redis_connection_info() -> dict:
    """Get Redis endpoint and access key."""
    try:
        # Get endpoint
        endpoint = run_az([
            "redisenterprise", "database", "show",
            "-g", REDIS_RESOURCE_GROUP,
            "--cluster-name", REDIS_NAME,
            "--query", "[resourceState, port]",
            "-o", "tsv"
        ])
        
        # Get hostname from cluster
        hostname = run_az([
            "redisenterprise", "show",
            "-g", REDIS_RESOURCE_GROUP,
            "-n", REDIS_NAME,
            "--query", "hostName",
            "-o", "tsv"
        ]).strip()
        
        # Get access key
        keys = run_az([
            "redisenterprise", "database", "list-keys",
            "-g", REDIS_RESOURCE_GROUP,
            "--cluster-name", REDIS_NAME,
            "-o", "tsv"
        ]).strip()
        
        return {
            "endpoint": f"{hostname}:10000",
            "password": keys,
            "url": f"rediss://:{keys}@{hostname}:10000"
        }
    except Exception as e:
        print(f"‚ö†Ô∏è Could not retrieve Redis connection info: {e}")
        return {}

# Main execution
print("üîß Azure Redis Enterprise Configuration")
print("=" * 60)
print(f"Resource Group: {REDIS_RESOURCE_GROUP}")
print(f"Redis Name: {REDIS_NAME}")
print(f"Location: {REDIS_LOCATION}")
print(f"SKU: {REDIS_SKU}")
print(f"CREATE_REDIS: {CREATE_REDIS}")
print("=" * 60)

if CREATE_REDIS:
    maybe_set_subscription()
    ensure_resource_group()
    
    if redis_exists():
        print(f"\n‚úÖ Redis Enterprise cluster exists: {REDIS_NAME}")
        ensure_database()
    else:
        create_redis_enterprise()
    
    # Get connection info
    print("\nüîë Retrieving connection information...")
    conn_info = get_redis_connection_info()
    if conn_info:
        print(f"\nüìå Redis Endpoint: {conn_info.get('endpoint', 'N/A')}")
        print(f"üìå Redis URL: rediss://:<key>@{conn_info.get('endpoint', 'N/A')}")
        print("\nüí° Add to your .env file:")
        print(f'   REDIS_ENDPOINT="{conn_info.get("endpoint", "")}"')
        print('   REDIS_PASSWORD="Check REDIS_PASSWORD environment variable..."')
        
        # Optionally set environment variables
        os.environ['REDIS_ENDPOINT'] = conn_info.get('endpoint', '')
        os.environ['REDIS_PASSWORD'] = conn_info.get('password', '')
else:
    print("\n‚ÑπÔ∏è Set CREATE_REDIS=True to create Azure Redis Enterprise.")
    print("   Or set REDIS_ENDPOINT and REDIS_PASSWORD in .env for existing instance.")

üîß Azure Redis Enterprise Configuration
Resource Group: foundry-rg
Redis Name: redis-cache-swed
Location: swedencentral
SKU: Enterprise_E1
CREATE_REDIS: False

‚ÑπÔ∏è Set CREATE_REDIS=True to create Azure Redis Enterprise.
   Or set REDIS_ENDPOINT and REDIS_PASSWORD in .env for existing instance.


## Setup

This notebook reuses the configuration file (`.foundry_config.json`) created by `0_setup/1_setup.ipynb`.

- If the file is missing, run the setup notebook first.
- Make sure you can authenticate (e.g., `az login`), so `DefaultAzureCredential` can work.

In [27]:
# Environment setup and PATH configuration
import json
import os
import subprocess
import asyncio
import time
import hashlib
from datetime import datetime
from typing import List, Dict, Any, Optional, Tuple
from dataclasses import dataclass
from dotenv import load_dotenv

load_dotenv(override=True)

# Ensure the notebook kernel can find Azure CLI (`az`) on PATH
possible_paths = [
    '/opt/homebrew/bin',   # macOS (Apple Silicon)
    '/usr/local/bin',      # macOS (Intel) / Linux
    '/usr/bin',            # Linux / Codespaces
    '/home/linuxbrew/.linuxbrew/bin',  # Linux Homebrew
]

az_path = None
try:
    result = subprocess.run(['which', 'az'], capture_output=True, text=True)
    if result.returncode == 0:
        az_path = os.path.dirname(result.stdout.strip())
        print(f'üîç Azure CLI found: {result.stdout.strip()}')
except Exception:
    pass

paths_to_add: list[str] = []
if az_path and az_path not in os.environ.get('PATH', ''):
    paths_to_add.append(az_path)
else:
    for path in possible_paths:
        if os.path.exists(path) and path not in os.environ.get('PATH', ''):
            paths_to_add.append(path)

if paths_to_add:
    os.environ['PATH'] = ':'.join(paths_to_add) + ':' + os.environ.get('PATH', '')
    print(f"‚úÖ Added to PATH: {', '.join(paths_to_add)}")
else:
    print('‚úÖ PATH looks good already')

print(f"\nPATH (first 150 chars): {os.environ['PATH'][:150]}...")

üîç Azure CLI found: /anaconda/envs/azureml_py38/bin//az
‚úÖ PATH looks good already

PATH (first 150 chars): /anaconda/envs/azureml_py38/bin/:/afh/code/agent-operator-lab/.venv/bin:/home/azureuser/.vscode-server/cli/servers/Stable-c9d77990917f3102ada88be140d2...


In [32]:
# Load Foundry project settings from .foundry_config.json
from azure.identity import DefaultAzureCredential

config_file = '../0_setup/.foundry_config.json'
try:
    with open(config_file, 'r', encoding='utf-8') as f:
        config = json.load(f)
except FileNotFoundError as e:
    print(f"‚ö†Ô∏è Could not find '{config_file}'.")
    print('üí° Run 0_setup/1_setup.ipynb first to create it.')
    raise e

# Project variables from config
FOUNDRY_NAME = config.get('FOUNDRY_NAME')
RESOURCE_GROUP = config.get('RESOURCE_GROUP')
LOCATION = config.get('LOCATION')
AZURE_AI_PROJECT_ENDPOINT = config.get('AZURE_AI_PROJECT_ENDPOINT')

# Azure OpenAI variables from env
AZURE_OPENAI_ENDPOINT = os.environ.get("AZURE_OPENAI_ENDPOINT")
AZURE_OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_API_KEY")
AZURE_OPENAI_CHAT_DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_CHAT_DEPLOYMENT_NAME")
AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME", "text-embedding-3-large")
AZURE_OPENAI_API_VERSION = os.environ.get("AZURE_OPENAI_API_VERSION")

# Redis configuration (optional)
REDIS_ENDPOINT = os.environ.get("REDIS_ENDPOINT", "")
REDIS_PASSWORD = os.environ.get("REDIS_PASSWORD", "")
REDIS_URL = f"rediss://:{REDIS_PASSWORD}@{REDIS_ENDPOINT}" if REDIS_ENDPOINT else ""

os.environ['FOUNDRY_NAME'] = FOUNDRY_NAME or ''
os.environ['LOCATION'] = LOCATION or ''
os.environ['RESOURCE_GROUP'] = RESOURCE_GROUP or ''
os.environ['AZURE_SUBSCRIPTION_ID'] = config.get('AZURE_SUBSCRIPTION_ID', '')

print(f"‚úÖ Loaded settings from '{config_file}'.")
print(f"\nüìå Foundry name: {FOUNDRY_NAME}")
print(f"üìå Resource group: {RESOURCE_GROUP}")
print(f"üìå Location: {LOCATION}")
print(f"üìå Azure OpenAI endpoint: {AZURE_OPENAI_ENDPOINT}")
print(f"üìå Chat deployment: {AZURE_OPENAI_CHAT_DEPLOYMENT_NAME}")
print(f"üìå Embedding deployment: {AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME}")
print(f"üìå Redis configured: {'‚úÖ' if REDIS_URL else '‚ùå (Parts 2-3 will use in-memory cache)'}")

# Initialize credential for Azure services
credential = DefaultAzureCredential()

‚úÖ Loaded settings from '../0_setup/.foundry_config.json'.

üìå Foundry name: foundry-rq90gs
üìå Resource group: foundry-rg
üìå Location: swedencentral
üìå Azure OpenAI endpoint: https://foundry-rq90gs.openai.azure.com
üìå Chat deployment: gpt-4.1
üìå Embedding deployment: text-embedding-3-large
üìå Redis configured: ‚úÖ


## Part 1: Azure OpenAI Prompt Caching

Azure OpenAI automatically caches prompt prefixes to reduce costs on repeated requests with identical beginnings.

### How It Works

```
Request 1:  [System Prompt: 1000 tokens] + [User: 50 tokens] ‚Üí Full processing
Request 2:  [System Prompt: 1000 tokens] + [User: 60 tokens] ‚Üí Cached prefix + new tokens
Request 3:  [System Prompt: 1000 tokens] + [User: 45 tokens] ‚Üí Cached prefix + new tokens
```

### Requirements

- Minimum **1,024 tokens** in the cacheable prefix
- Prefix must be **identical** across requests
- Supported models: GPT-4o, GPT-4o-mini, GPT-4.1 series

In [29]:
# Azure OpenAI Prompt Caching demonstration
from openai import AzureOpenAI

# Create a long system prompt (must be >1024 tokens for caching)
LARGE_SYSTEM_PROMPT = """
You are a highly specialized AI assistant for enterprise software development.

## Your Expertise Areas
1. Cloud Architecture: AWS, Azure, GCP best practices
2. Programming Languages: Python, Java, TypeScript, Go, Rust
3. DevOps: CI/CD, Kubernetes, Docker, Terraform
4. Security: OWASP, Zero Trust, IAM best practices
5. Databases: SQL, NoSQL, Graph databases, Time-series DBs

## Response Guidelines
- Always provide code examples when relevant
- Include security considerations in all recommendations
- Suggest monitoring and observability approaches
- Consider scalability and cost implications
- Reference official documentation when possible

## Code Quality Standards
- Follow SOLID principles
- Include comprehensive error handling
- Add meaningful comments and documentation
- Consider edge cases and failure modes
- Suggest appropriate testing strategies

## Architecture Principles
- Design for resilience and fault tolerance
- Implement proper caching strategies
- Use asynchronous processing where appropriate
- Consider data consistency requirements
- Plan for horizontal scalability

## Security Requirements
- Never expose sensitive credentials in code
- Implement proper authentication and authorization
- Use encryption for data at rest and in transit
- Follow least privilege access principles
- Include input validation and sanitization

[Additional context padding to ensure >1024 tokens for cache eligibility]
""" + ("This is additional context. " * 200)  # Pad to ensure >1024 tokens

@dataclass
class CacheResult:
    """Result of a cached API call."""
    response: str
    latency_ms: float
    input_tokens: int
    output_tokens: int
    cached_tokens: int
    cache_hit: bool

def call_with_cache_tracking(
    client: AzureOpenAI,
    deployment: str,
    system_prompt: str,
    user_message: str
) -> CacheResult:
    """
    Make an API call and track caching metrics.
    """
    start = time.perf_counter()
    
    response = client.chat.completions.create(
        model=deployment,
        messages=[
            {"role": "system", "content": system_prompt},
            {"role": "user", "content": user_message}
        ],
        max_completion_tokens=100,
    )
    
    elapsed_ms = (time.perf_counter() - start) * 1000
    
    usage = response.usage
    
    # Extract cached_tokens from prompt_tokens_details (Pydantic object)
    cached_tokens = 0
    if hasattr(usage, 'prompt_tokens_details') and usage.prompt_tokens_details is not None:
        cached_tokens = getattr(usage.prompt_tokens_details, 'cached_tokens', 0) or 0
    
    return CacheResult(
        response=response.choices[0].message.content,
        latency_ms=elapsed_ms,
        input_tokens=usage.prompt_tokens,
        output_tokens=usage.completion_tokens,
        cached_tokens=cached_tokens,
        cache_hit=cached_tokens > 0
    )

# Demonstrate prompt caching
print("üìä Azure OpenAI Prompt Caching Demo")
print("=" * 60)

if AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY:
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Series of requests with same system prompt
    user_queries = [
        "How do I set up a Python virtual environment?",
        "What's the best way to handle exceptions in Python?",
        "Explain Python decorators briefly.",
    ]
    
    results = []
    for i, query in enumerate(user_queries, 1):
        result = call_with_cache_tracking(
            client,
            AZURE_OPENAI_CHAT_DEPLOYMENT_NAME,
            LARGE_SYSTEM_PROMPT,
            query
        )
        results.append(result)
        
        cache_status = "‚úÖ CACHE HIT" if result.cache_hit else "‚ùå CACHE MISS"
        print(f"\nüìù Request {i}: \"{query[:40]}...\"")
        print(f"   ‚è±Ô∏è  Latency: {result.latency_ms:.0f}ms")
        print(f"   üì• Input tokens: {result.input_tokens:,}")
        print(f"   üì§ Output tokens: {result.output_tokens}")
        print(f"   üíæ Cached tokens: {result.cached_tokens:,}")
        print(f"   {cache_status}")
    
    # Summary
    total_input = sum(r.input_tokens for r in results)
    total_cached = sum(r.cached_tokens for r in results)
    cache_rate = (total_cached / total_input * 100) if total_input > 0 else 0
    
    print(f"\nüìà Summary:")
    print(f"   Total input tokens: {total_input:,}")
    print(f"   Total cached tokens: {total_cached:,}")
    print(f"   Cache rate: {cache_rate:.1f}%")
    print(f"   üí∞ Estimated savings: {cache_rate/2:.1f}% cost reduction")
else:
    print("‚ö†Ô∏è AZURE_OPENAI_ENDPOINT or AZURE_OPENAI_API_KEY not set")

üìä Azure OpenAI Prompt Caching Demo



üìù Request 1: "How do I set up a Python virtual environ..."
   ‚è±Ô∏è  Latency: 1257ms
   üì• Input tokens: 1,292
   üì§ Output tokens: 100
   üíæ Cached tokens: 0
   ‚ùå CACHE MISS

üìù Request 2: "What's the best way to handle exceptions..."
   ‚è±Ô∏è  Latency: 1228ms
   üì• Input tokens: 1,292
   üì§ Output tokens: 100
   üíæ Cached tokens: 0
   ‚ùå CACHE MISS

üìù Request 3: "Explain Python decorators briefly...."
   ‚è±Ô∏è  Latency: 1195ms
   üì• Input tokens: 1,287
   üì§ Output tokens: 100
   üíæ Cached tokens: 1,152
   ‚úÖ CACHE HIT

üìà Summary:
   Total input tokens: 3,871
   Total cached tokens: 1,152
   Cache rate: 29.8%
   üí∞ Estimated savings: 14.9% cost reduction


## Part 2: Redis Response Caching

Cache exact responses to avoid redundant API calls for identical queries.

### Use Cases

- FAQ bots with common questions
- Data processing with repeated prompts
- Development/testing environments

In [30]:
# Response Caching implementation (in-memory fallback if no Redis)

class ResponseCache:
    """
    Simple response cache with TTL support.
    Uses Redis if available, otherwise falls back to in-memory dict.
    """
    def __init__(self, redis_url: str = None, ttl_seconds: int = 3600):
        self.ttl = ttl_seconds
        self.use_redis = False
        self._memory_cache: Dict[str, Tuple[str, float]] = {}  # {key: (value, expiry_time)}
        
        if redis_url:
            try:
                import redis
                self.redis_client = redis.from_url(redis_url)
                self.redis_client.ping()
                self.use_redis = True
                print("‚úÖ Connected to Redis")
            except Exception as e:
                print(f"‚ö†Ô∏è Redis connection failed: {e}")
                print("   Using in-memory cache instead")
        else:
            print("üì¶ Using in-memory cache (no Redis URL provided)")
    
    def _hash_key(self, prompt: str) -> str:
        """Create a hash key from the prompt."""
        return hashlib.sha256(prompt.encode()).hexdigest()[:16]
    
    def get(self, prompt: str) -> Optional[str]:
        """Get cached response for a prompt."""
        key = self._hash_key(prompt)
        
        if self.use_redis:
            value = self.redis_client.get(key)
            return value.decode() if value else None
        else:
            if key in self._memory_cache:
                value, expiry = self._memory_cache[key]
                if time.time() < expiry:
                    return value
                else:
                    del self._memory_cache[key]
            return None
    
    def set(self, prompt: str, response: str):
        """Cache a response for a prompt."""
        key = self._hash_key(prompt)
        
        if self.use_redis:
            self.redis_client.setex(key, self.ttl, response)
        else:
            self._memory_cache[key] = (response, time.time() + self.ttl)
    
    def stats(self) -> Dict[str, int]:
        """Get cache statistics."""
        if self.use_redis:
            info = self.redis_client.info()
            return {
                "hits": info.get("keyspace_hits", 0),
                "misses": info.get("keyspace_misses", 0),
            }
        else:
            return {"entries": len(self._memory_cache)}

def cached_completion(
    client: AzureOpenAI,
    cache: ResponseCache,
    deployment: str,
    prompt: str,
    force_refresh: bool = False
) -> Tuple[str, bool, float]:
    """
    Get completion with caching.
    Returns: (response, cache_hit, latency_ms)
    """
    start = time.perf_counter()
    
    # Check cache first
    if not force_refresh:
        cached = cache.get(prompt)
        if cached:
            latency = (time.perf_counter() - start) * 1000
            return cached, True, latency
    
    # Call API
    response = client.chat.completions.create(
        model=deployment,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
    )
    
    result = response.choices[0].message.content
    latency = (time.perf_counter() - start) * 1000
    
    # Store in cache
    cache.set(prompt, result)
    
    return result, False, latency

# Demonstrate response caching
print("üìä Response Caching Demo")
print("=" * 60)

cache = ResponseCache(REDIS_URL if REDIS_URL else None, ttl_seconds=300)

if AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY:
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    # Simulate repeated queries (common in FAQ scenarios)
    queries = [
        "What is Python?",
        "What is Python?",  # Repeated - should hit cache
        "What is JavaScript?",
        "What is Python?",  # Repeated - should hit cache
        "What is JavaScript?",  # Repeated - should hit cache
    ]
    
    total_api_calls = 0
    total_cache_hits = 0
    
    for i, query in enumerate(queries, 1):
        response, cache_hit, latency = cached_completion(
            client, cache, AZURE_OPENAI_CHAT_DEPLOYMENT_NAME, query
        )
        
        if cache_hit:
            total_cache_hits += 1
            status = "‚úÖ CACHE HIT"
        else:
            total_api_calls += 1
            status = "‚ùå CACHE MISS (API call)"
        
        print(f"\nüìù Query {i}: \"{query}\"")
        print(f"   ‚è±Ô∏è  Latency: {latency:.1f}ms")
        print(f"   {status}")
    
    print(f"\nüìà Summary:")
    print(f"   Total queries: {len(queries)}")
    print(f"   API calls: {total_api_calls}")
    print(f"   Cache hits: {total_cache_hits}")
    print(f"   Cache hit rate: {total_cache_hits/len(queries)*100:.0f}%")
    print(f"   üí∞ API cost reduction: {total_cache_hits/len(queries)*100:.0f}%")
else:
    print("‚ö†Ô∏è AZURE_OPENAI_ENDPOINT or AZURE_OPENAI_API_KEY not set")

üìä Response Caching Demo
‚ö†Ô∏è Redis connection failed: invalid username-password pair
   Using in-memory cache instead

üìù Query 1: "What is Python?"
   ‚è±Ô∏è  Latency: 1953.3ms
   ‚ùå CACHE MISS (API call)

üìù Query 2: "What is Python?"
   ‚è±Ô∏è  Latency: 0.0ms
   ‚úÖ CACHE HIT

üìù Query 3: "What is JavaScript?"
   ‚è±Ô∏è  Latency: 2713.8ms
   ‚ùå CACHE MISS (API call)

üìù Query 4: "What is Python?"
   ‚è±Ô∏è  Latency: 0.0ms
   ‚úÖ CACHE HIT

üìù Query 5: "What is JavaScript?"
   ‚è±Ô∏è  Latency: 0.0ms
   ‚úÖ CACHE HIT

üìà Summary:
   Total queries: 5
   API calls: 2
   Cache hits: 3
   Cache hit rate: 60%
   üí∞ API cost reduction: 60%


## Part 3: Semantic Caching

Semantic caching uses vector similarity to match queries that are **similar but not identical**.

### How It Works

```
Query: "What is the capital of France?"  ‚Üí Cache MISS ‚Üí API call ‚Üí Store embedding + response
Query: "What's France's capital city?"   ‚Üí Similarity > 0.9 ‚Üí Cache HIT ‚Üí Return cached response
Query: "Tell me about Paris"             ‚Üí Similarity < 0.7 ‚Üí Cache MISS ‚Üí API call
```

### Benefits

- Handles paraphrased queries
- Reduces redundant API calls for semantically similar questions
- Configurable similarity threshold

In [39]:
# Semantic Caching implementation
import numpy as np
from dataclasses import dataclass, field

@dataclass
class SemanticCacheEntry:
    """Entry in the semantic cache."""
    prompt: str
    response: str
    embedding: List[float]
    timestamp: float = field(default_factory=time.time)

class SemanticCache:
    """
    Semantic cache using vector similarity.
    
    Uses embeddings to find similar queries and return cached responses.
    """
    def __init__(
        self,
        embedding_client: AzureOpenAI,
        embedding_model: str,
        similarity_threshold: float = 0.9
    ):
        self.embedding_client = embedding_client
        self.embedding_model = embedding_model
        self.similarity_threshold = similarity_threshold
        self.entries: List[SemanticCacheEntry] = []
        self.stats = {"hits": 0, "misses": 0}
    
    def _get_embedding(self, text: str) -> List[float]:
        """Get embedding for text."""
        response = self.embedding_client.embeddings.create(
            model=self.embedding_model,
            input=text
        )
        return response.data[0].embedding
    
    def _cosine_similarity(self, a: List[float], b: List[float]) -> float:
        """Calculate cosine similarity between two vectors."""
        a_np = np.array(a)
        b_np = np.array(b)
        return float(np.dot(a_np, b_np) / (np.linalg.norm(a_np) * np.linalg.norm(b_np)))
    
    def get(self, prompt: str) -> Tuple[Optional[str], float]:
        """
        Find cached response for semantically similar prompt.
        Returns: (response or None, best_similarity_score)
        """
        if not self.entries:
            return None, 0.0
        
        query_embedding = self._get_embedding(prompt)
        
        best_match = None
        best_similarity = 0.0
        
        for entry in self.entries:
            similarity = self._cosine_similarity(query_embedding, entry.embedding)
            if similarity > best_similarity:
                best_similarity = similarity
                best_match = entry
        
        if best_similarity >= self.similarity_threshold:
            self.stats["hits"] += 1
            return best_match.response, best_similarity
        
        self.stats["misses"] += 1
        return None, best_similarity
    
    def set(self, prompt: str, response: str):
        """Store prompt-response pair in cache."""
        embedding = self._get_embedding(prompt)
        self.entries.append(SemanticCacheEntry(
            prompt=prompt,
            response=response,
            embedding=embedding
        ))

def semantic_cached_completion(
    client: AzureOpenAI,
    cache: SemanticCache,
    deployment: str,
    prompt: str
) -> Tuple[str, bool, float, float]:
    """
    Get completion with semantic caching.
    Returns: (response, cache_hit, latency_ms, similarity)
    """
    start = time.perf_counter()
    
    # Check semantic cache
    cached_response, similarity = cache.get(prompt)
    
    if cached_response:
        latency = (time.perf_counter() - start) * 1000
        return cached_response, True, latency, similarity
    
    # Call API
    response = client.chat.completions.create(
        model=deployment,
        messages=[{"role": "user", "content": prompt}],
        max_tokens=200,
    )
    
    result = response.choices[0].message.content
    latency = (time.perf_counter() - start) * 1000
    
    # Store in cache
    cache.set(prompt, result)
    
    return result, False, latency, similarity

# Demonstrate semantic caching
print("üìä Semantic Caching Demo")
print("=" * 60)

if AZURE_OPENAI_ENDPOINT and AZURE_OPENAI_API_KEY:
    client = AzureOpenAI(
        azure_endpoint=AZURE_OPENAI_ENDPOINT,
        api_key=AZURE_OPENAI_API_KEY,
        api_version=AZURE_OPENAI_API_VERSION,
    )
    
    semantic_cache = SemanticCache(
        embedding_client=client,
        embedding_model=AZURE_OPENAI_EMBEDDING_DEPLOYMENT_NAME,
        similarity_threshold=0.70
    )
    
    # Queries with semantic variations
    queries = [
        "What is the capital of France?",
        "What's France's capital city?",  # Semantically similar
        "Tell me the capital of France",  # Semantically similar
        "What is the capital of Germany?",  # Different topic
        "Germany's capital is?",  # Semantically similar to #4
    ]
    
    print(f"üéØ Similarity threshold: {semantic_cache.similarity_threshold}")
    
    for i, query in enumerate(queries, 1):
        response, cache_hit, latency, similarity = semantic_cached_completion(
            client, semantic_cache, AZURE_OPENAI_CHAT_DEPLOYMENT_NAME, query
        )
        
        if cache_hit:
            status = f"‚úÖ SEMANTIC HIT (similarity: {similarity:.2f})"
        else:
            status = f"‚ùå CACHE MISS (best match: {similarity:.2f})"
        
        print(f"\nüìù Query {i}: \"{query}\"")
        print(f"   ‚è±Ô∏è  Latency: {latency:.0f}ms")
        print(f"   {status}")
    
    print(f"\nüìà Summary:")
    print(f"   Total queries: {len(queries)}")
    print(f"   Semantic hits: {semantic_cache.stats['hits']}")
    print(f"   Cache misses: {semantic_cache.stats['misses']}")
    print(f"   Hit rate: {semantic_cache.stats['hits']/len(queries)*100:.0f}%")
else:
    print("‚ö†Ô∏è AZURE_OPENAI_ENDPOINT or AZURE_OPENAI_API_KEY not set")

üìä Semantic Caching Demo
üéØ Similarity threshold: 0.7



üìù Query 1: "What is the capital of France?"
   ‚è±Ô∏è  Latency: 903ms
   ‚ùå CACHE MISS (best match: 0.00)

üìù Query 2: "What's France's capital city?"
   ‚è±Ô∏è  Latency: 53ms
   ‚úÖ SEMANTIC HIT (similarity: 0.78)

üìù Query 3: "Tell me the capital of France"
   ‚è±Ô∏è  Latency: 76ms
   ‚úÖ SEMANTIC HIT (similarity: 0.73)

üìù Query 4: "What is the capital of Germany?"
   ‚è±Ô∏è  Latency: 919ms
   ‚ùå CACHE MISS (best match: 0.57)

üìù Query 5: "Germany's capital is?"
   ‚è±Ô∏è  Latency: 52ms
   ‚úÖ SEMANTIC HIT (similarity: 0.84)

üìà Summary:
   Total queries: 5
   Semantic hits: 3
   Cache misses: 1
   Hit rate: 60%


## Cleanup Resources

When you're done with this lab, you can delete the Azure Redis Enterprise instance to avoid ongoing charges.

> **Warning**: This will permanently delete the Redis instance and all cached data.

In [None]:
# Cleanup: Delete Azure Redis Enterprise instance
# ================================================

DELETE_REDIS = False  # Set to True to delete the Redis instance

if DELETE_REDIS:
    print("üóëÔ∏è Deleting Azure Redis Enterprise...")
    print(f"   Resource Group: {REDIS_RESOURCE_GROUP}")
    print(f"   Redis Name: {REDIS_NAME}")
    
    confirm = input("\n‚ö†Ô∏è Are you sure? Type 'yes' to confirm: ")
    
    if confirm.lower() == 'yes':
        try:
            # Delete the Redis Enterprise cluster (this also deletes the database)
            run_az([
                "redisenterprise", "delete",
                "-g", REDIS_RESOURCE_GROUP,
                "-n", REDIS_NAME,
                "--yes",
                "--no-wait"
            ])
            print("\n‚úÖ Redis Enterprise deletion initiated.")
            print("   This may take a few minutes to complete.")
        except Exception as e:
            print(f"\n‚ùå Failed to delete Redis: {e}")
    else:
        print("\n‚ùå Deletion cancelled.")
else:
    print("‚ÑπÔ∏è Set DELETE_REDIS=True to delete the Azure Redis Enterprise instance.")
    print(f"   Current Redis: {REDIS_NAME} in {REDIS_RESOURCE_GROUP}")

In [None]:
# Caching Strategy Comparison
print("üìä Caching Strategy Comparison")
print("=" * 80)

comparison = [
    {
        "Strategy": "Prompt Caching (Azure OpenAI)",
        "Match Type": "Exact prefix",
        "Setup": "Automatic",
        "Latency Reduction": "Moderate",
        "Cost Reduction": "50-75%",
        "Best For": "Long system prompts, few-shot examples",
    },
    {
        "Strategy": "Response Caching (Redis)",
        "Match Type": "Exact query",
        "Setup": "Redis required",
        "Latency Reduction": "~99%",
        "Cost Reduction": "100% on hits",
        "Best For": "FAQ bots, repeated queries",
    },
    {
        "Strategy": "Semantic Caching",
        "Match Type": "Similar queries",
        "Setup": "Embeddings + Vector DB",
        "Latency Reduction": "~90%",
        "Cost Reduction": "100% on hits",
        "Best For": "Natural language variations, chatbots",
    },
]

# Print comparison table
print(f"\n{'Strategy':<30} {'Match':<15} {'Latency':<12} {'Cost':<10} {'Best For'}")
print("-" * 100)
for row in comparison:
    print(f"{row['Strategy']:<30} {row['Match Type']:<15} {row['Latency Reduction']:<12} {row['Cost Reduction']:<10} {row['Best For']}")

print("\n" + "=" * 80)
print("üìã Decision Guide:")
print("\n   Use PROMPT CACHING when:")
print("   ‚Ä¢ You have long system prompts (>1024 tokens)")
print("   ‚Ä¢ Your prompts share common prefixes (few-shot examples)")
print("   ‚Ä¢ You want automatic cost reduction with no setup")

print("\n   Use RESPONSE CACHING when:")
print("   ‚Ä¢ You have frequently repeated exact queries")
print("   ‚Ä¢ Latency is critical (need <10ms responses)")
print("   ‚Ä¢ You're building FAQ or support bots")

print("\n   Use SEMANTIC CACHING when:")
print("   ‚Ä¢ Users ask the same question in different ways")
print("   ‚Ä¢ Exact match caching has low hit rates")
print("   ‚Ä¢ You're building conversational AI with varied inputs")

## Best Practices Summary

### Caching Architecture

```
‚îå‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îê
‚îÇ                     Multi-Layer Caching                             ‚îÇ
‚îú‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚î§
‚îÇ                                                                     ‚îÇ
‚îÇ  User Query ‚îÄ‚îÄ‚ñ∫ [Semantic Cache] ‚îÄ‚îÄ‚ñ∫ [Response Cache] ‚îÄ‚îÄ‚ñ∫ API      ‚îÇ
‚îÇ                      ‚îÇ                     ‚îÇ               ‚îÇ        ‚îÇ
‚îÇ                      ‚îÇ                     ‚îÇ               ‚îÇ        ‚îÇ
‚îÇ              Similar query?          Exact query?    Prompt Cache   ‚îÇ
‚îÇ                  ‚Üì Yes                  ‚Üì Yes          (automatic)  ‚îÇ
‚îÇ              Return cached          Return cached                   ‚îÇ
‚îÇ                                                                     ‚îÇ
‚îî‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îÄ‚îò
```

### Key Recommendations

| Practice | Recommendation |
|----------|----------------|
| **System prompts** | Keep >1024 tokens for prompt caching eligibility |
| **Cache TTL** | 5-60 minutes for dynamic content, longer for static |
| **Similarity threshold** | Start at 0.9, lower to 0.85 if hit rate too low |
| **Cache invalidation** | Implement clear mechanism for content updates |
| **Monitoring** | Track hit rates, latency distribution, cost savings |

## Wrap-up

### What You Learned

1. **Prompt Caching**: Azure OpenAI automatic caching for repeated prefixes (50-75% cost reduction)
2. **Response Caching**: Exact-match caching with Redis for FAQ-style applications
3. **Semantic Caching**: Vector similarity caching for natural language variations
4. **Multi-layer strategy**: Combine approaches for optimal performance

### Additional Resources

- [Azure OpenAI Prompt Caching](https://learn.microsoft.com/en-us/azure/ai-services/openai/how-to/prompt-caching)
- [Azure Cache for Redis - Semantic Caching](https://learn.microsoft.com/en-us/azure/redis/tutorial-semantic-cache)
- [Agent Innovator Lab - Caching](https://github.com/Azure/agent-innovator-lab/tree/main/3_optimization-design-ptn/02_caching)

### Next Steps

- **6_cost_analytics.ipynb**: Monitor token usage and analyze costs