# Semantic Caching Lab - Standalone

**Using master-lab resources**

This notebook demonstrates Azure API Management semantic caching using the proven working approach.

## How it Works

1. **Request arrives** at APIM
2. **APIM creates embedding** of the prompt using text-embedding-3-small
3. **Checks Redis cache** for similar embeddings (>80% match)
4. **If match found**: Returns cached response (~0.1-0.3s)
5. **If no match**: Calls Azure OpenAI, stores in cache (~3-10s)

## Expected Results

- First similar request: Slow (~3-10s)
- Subsequent similar requests: **15-100x faster!** (~0.1-0.3s)
- Cache TTL: 20 minutes

---

In [1]:
# Cell 1: Setup and Imports
import os
import sys
from pathlib import Path
from dotenv import load_dotenv
from openai import AzureOpenAI
import time
import random

# Find master-lab.env - check multiple locations
possible_paths = [
    Path('master-lab.env'),  # Same directory as notebook
    Path(__file__).parent / 'master-lab.env' if '__file__' in globals() else None,
    Path.cwd() / 'master-lab.env',  # Current working directory
    Path('/mnt/c/Users/lproux/Documents/GitHub/MCP-servers-internalMSFT-and-external/AI-Gateway/labs/master-lab/master-lab.env'),  # Absolute path
]

env_file = None
for path in possible_paths:
    if path and path.exists():
        env_file = path
        break

if env_file:
    load_dotenv(env_file)
    print(f"‚úÖ Loaded: {env_file}")
else:
    print("‚ùå master-lab.env not found in any expected location!")
    print("\nTried:")
    for path in possible_paths:
        if path:
            print(f"   - {path}")
    raise FileNotFoundError("Please run this notebook from the master-lab directory")

# Get configuration with validation
apim_gateway_url = os.environ.get('APIM_GATEWAY_URL')
apim_api_key = os.environ.get('APIM_API_KEY')
inference_api_path = os.environ.get('INFERENCE_API_PATH', 'inference')

# Validate required variables
if not apim_gateway_url:
    raise ValueError("APIM_GATEWAY_URL not found in master-lab.env")
if not apim_api_key:
    raise ValueError("APIM_API_KEY not found in master-lab.env")

print(f"\nEndpoint: {apim_gateway_url}/{inference_api_path}")
print(f"API Key: ****{apim_api_key[-4:]}")
print("\n‚úÖ Setup complete - Ready to test semantic caching!")

‚ùå master-lab.env not found!

Endpoint: None/inference


TypeError: 'NoneType' object is not subscriptable

In [None]:
# Cell 2: Test Semantic Caching with 20 Similar Questions

# These questions are semantically similar (>80% match)
# So they should all return cached responses after the first one
questions = [
    "How to Brew the Perfect Cup of Coffee?",
    "What are the steps to Craft the Ideal Espresso?",
    "Tell me how to create the best steaming Java?",
    "Explain how to make a caffeinated brewed beverage?"
]

# Initialize OpenAI client
client = AzureOpenAI(
    azure_endpoint=f"{apim_gateway_url}/{inference_api_path}",
    api_key=apim_api_key,
    api_version="2025-03-01-preview"
)

runs = 20
sleep_time_ms = 10  # 10ms between requests
api_runs = []  # Response times

print("=" * 80)
print("üß™ SEMANTIC CACHING TEST")
print("=" * 80)
print(f"\nMaking {runs} requests with similar questions...")
print("Expected: First request slow, subsequent requests FAST\n")

for i in range(runs):
    random_question = random.choice(questions)
    print(f"\n‚ñ∂Ô∏è Run {i+1}/{runs}:")
    print(f"üí¨  {random_question}")

    start_time = time.time()
    try:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": "You are a sarcastic, unhelpful assistant."},
                {"role": "user", "content": random_question}
            ]
        )
        response_time = time.time() - start_time

        status = "üéØ CACHE HIT" if response_time < 1.0 else "üî• BACKEND CALL"
        print(f"‚åö {response_time:.2f} seconds - {status}")

        # Uncomment to see responses:
        # print(f"üí¨ {response.choices[0].message.content}\n")

        api_runs.append(response_time)

    except Exception as e:
        print(f"‚ùå Error: {str(e)[:150]}")
        api_runs.append(None)

    time.sleep(sleep_time_ms / 1000)

# Calculate statistics
valid_runs = [r for r in api_runs if r is not None]

if valid_runs:
    avg_time = sum(valid_runs) / len(valid_runs)
    min_time = min(valid_runs)
    max_time = max(valid_runs)
    cache_hits = sum(1 for r in valid_runs if r < 1.0)

    print(f"\n{'='*80}")
    print("üìä PERFORMANCE SUMMARY")
    print(f"{'='*80}")
    print(f"Total Requests:     {len(api_runs)}")
    print(f"Successful:         {len(valid_runs)}")
    print(f"Average Time:       {avg_time:.2f}s")
    print(f"Fastest Response:   {min_time:.2f}s")
    print(f"Slowest Response:   {max_time:.2f}s")
    print(f"Likely Cache Hits:  {cache_hits}/{len(valid_runs)} ({cache_hits/len(valid_runs)*100:.1f}%)")
    print(f"{'='*80}")

    if max_time > 1.0 and min_time < 1.0:
        speedup = max_time / min_time
        print(f"\n‚úÖ SEMANTIC CACHING IS WORKING!")
        print(f"   Slowest request: {max_time:.2f}s (backend call)")
        print(f"   Fastest request: {min_time:.2f}s (cache hit)")
        print(f"   Speed improvement: {speedup:.1f}x faster!")
    else:
        print(f"\n‚ö†Ô∏è  Results may vary. Expected: first request slow, subsequent fast.")
else:
    print("\n‚ùå No successful requests")

print("\n‚úÖ Test complete - See visualization next")

In [None]:
# Cell 3: Visualize Semantic Caching Performance

import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl

if 'api_runs' in globals() and api_runs:
    valid_results = [r for r in api_runs if r is not None]

    if len(valid_results) > 0:
        # Create DataFrame
        mpl.rcParams['figure.figsize'] = [15, 5]
        df = pd.DataFrame(valid_results, columns=['Response Time'])
        df['Run'] = range(1, len(df) + 1)

        # Create bar plot
        ax = df.plot(kind='bar', x='Run', y='Response Time', legend=False, color='steelblue')
        plt.title('Semantic Caching Performance', fontsize=14, fontweight='bold')
        plt.xlabel('Runs', fontsize=12)
        plt.ylabel('Response Time (s)', fontsize=12)
        plt.xticks(rotation=0)

        # Add average line
        average = df['Response Time'].mean()
        plt.axhline(y=average, color='r', linestyle='--', label=f'Average: {average:.2f}s')

        # Add cache hit threshold line
        plt.axhline(y=1.0, color='green', linestyle=':', linewidth=2, label='Cache Hit Threshold (1.0s)')

        plt.legend(loc='upper right')
        plt.grid(axis='y', alpha=0.3)
        plt.tight_layout()
        plt.show()

        print("\nüìä Chart Legend:")
        print("   üîµ Blue bars = Individual response times")
        print("   üî¥ Red dashed line = Average response time")
        print("   üü¢ Green dotted line = Cache hit threshold (1.0s)")
        print("   Bars below green line = Likely cache hits (fast!)")
        print("\n‚úÖ Visualization complete")
    else:
        print("‚ö†Ô∏è  No valid results to visualize")
else:
    print("‚ö†Ô∏è  Run Cell 2 first to generate test results")

In [None]:
# Cell 4: View Redis Cache Statistics (Optional)

import redis.asyncio as redis
import pandas as pd
import matplotlib.pyplot as plt

# Get Redis configuration from master-lab.env
redis_host = os.environ.get('REDIS_HOST')
redis_port = int(os.environ.get('REDIS_PORT', 10000))
redis_key = os.environ.get('REDIS_KEY')

async def get_redis_info():
    r = await redis.from_url(
        f"rediss://:{redis_key}@{redis_host}:{redis_port}"
    )

    info = await r.info()

    print("üìä Redis Server Information:")
    print(f"   Used Memory: {info['used_memory_human']}")
    print(f"   Cache Hits: {info['keyspace_hits']}")
    print(f"   Cache Misses: {info['keyspace_misses']}")
    print(f"   Evicted Keys: {info['evicted_keys']}")
    print(f"   Expired Keys: {info['expired_keys']}")

    # Calculate hit rate
    total = info['keyspace_hits'] + info['keyspace_misses']
    if total > 0:
        hit_rate = (info['keyspace_hits'] / total) * 100
        print(f"   Hit Rate: {hit_rate:.1f}%")

    # Create visualization
    redis_info = {
        'Metric': ['Cache Hits', 'Cache Misses', 'Evicted Keys', 'Expired Keys'],
        'Value': [info['keyspace_hits'], info['keyspace_misses'], info['evicted_keys'], info['expired_keys']]
    }

    df_redis_info = pd.DataFrame(redis_info)
    df_redis_info.plot(kind='barh', x='Metric', y='Value', legend=False, color='teal')

    plt.title('Redis Cache Statistics')
    plt.xlabel('Count')
    plt.ylabel('Metric')
    plt.tight_layout()
    plt.show()

    await r.aclose()
    print("\n‚úÖ Redis statistics retrieved successfully")

try:
    await get_redis_info()
except Exception as e:
    print(f"‚ö†Ô∏è  Could not connect to Redis: {str(e)[:100]}")
    print("   Make sure Redis is configured in master-lab.env")

# üéâ Semantic Caching Lab Complete!

## What You Learned

‚úÖ How semantic caching reduces API calls for similar queries  
‚úÖ How to measure caching performance  
‚úÖ How vector embeddings enable semantic similarity matching  

## Key Benefits

üí∞ **Cost savings**: Reduced Azure OpenAI API calls (up to 90% reduction!)  
‚ö° **Performance**: Faster response times (15-100x faster for cached requests)  
üìä **Scalability**: Better handling of repetitive queries  

## Configuration

- **Similarity Threshold**: 0.8 (80% match required)
- **Cache TTL**: 20 minutes (1200 seconds)
- **Embeddings Model**: text-embedding-3-small
- **Cache Storage**: Redis

---

**Next Steps**: Integrate semantic caching into your production APIs to reduce costs and improve performance!