# Lab 2: Semantic Caching with Redis

Implement intelligent caching to reduce latency and costs for repeated queries.

## Learning Objectives
- Understand semantic caching vs traditional caching
- Configure Redis for LLM response caching
- Measure cache hit/miss performance
- Optimize caching strategies

## Prerequisites
- Azure CLI authenticated (`az login`)
- Resources deployed via main notebook
- Redis Cache instance created
- APIM with semantic caching policy configured

**Duration:** ~15 minutes  
**Difficulty:** Intermediate

---

In [None]:
# Initialize environment
import sys
sys.path.append('..')
from quick_start.shared_init import quick_init, get_azure_openai_client
import time
import json
import os

config = quick_init()

# Get Redis configuration
redis_host = config['env']['REDIS_HOST']
print(f"\nüîó Redis Host: {redis_host}")

## Configure Redis Connection

Redis is used by APIM's semantic caching policy to store LLM responses.

In [None]:
# Import Redis library
try:
    import redis
    print("‚úÖ Redis library available")
except ImportError:
    print("‚ö†Ô∏è Installing redis library...")
    import subprocess
    subprocess.check_call([sys.executable, "-m", "pip", "install", "redis"])
    import redis
    print("‚úÖ Redis library installed")

# Get Redis access key from Azure
from azure.identity import AzureCliCredential
import subprocess

resource_group = config['resource_group']
redis_name = redis_host.split('.')[0]  # Extract name from hostname

# Get Redis access key
result = subprocess.run(
    ['az', 'redis', 'list-keys', '--name', redis_name, '--resource-group', resource_group, '--query', 'primaryKey', '-o', 'tsv'],
    capture_output=True,
    text=True
)

if result.returncode == 0:
    redis_key = result.stdout.strip()
    print("‚úÖ Retrieved Redis access key")
    
    # Create Redis client
    redis_client = redis.StrictRedis(
        host=redis_host,
        port=6380,
        password=redis_key,
        ssl=True,
        decode_responses=True
    )
    
    # Test connection
    redis_client.ping()
    print("‚úÖ Connected to Redis")
else:
    print(f"‚ùå Failed to retrieve Redis key: {result.stderr}")
    redis_client = None

## Test 1: First Query (Cache Miss)

Send a query and measure the response time. This will be a cache miss.

In [None]:
# Create OpenAI client
client = get_azure_openai_client()

# Define test query
test_query = "What are the key benefits of using Azure AI Gateway?"

print(f"\nüîç Query: {test_query}")
print("\n‚è±Ô∏è Sending first request (cache miss expected)...\n")

# Measure time
start_time = time.time()

response1 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": test_query}
    ],
    max_tokens=150
)

elapsed_time_1 = time.time() - start_time

print(f"‚úÖ Response received in {elapsed_time_1:.3f} seconds")
print(f"\nResponse:\n{response1.choices[0].message.content}")
print(f"\nüìä Cache Status: MISS (first request)")
print(f"   Response Time: {elapsed_time_1:.3f}s")

## Test 2: Same Query (Cache Hit)

Send the exact same query and compare the response time. This should be a cache hit.

In [None]:
# Wait a moment to ensure first request is cached
print("‚è≥ Waiting 2 seconds for cache to propagate...")
time.sleep(2)

print(f"\nüîç Query: {test_query}")
print("\n‚è±Ô∏è Sending second request (cache hit expected)...\n")

# Measure time
start_time = time.time()

response2 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": test_query}
    ],
    max_tokens=150
)

elapsed_time_2 = time.time() - start_time

print(f"‚úÖ Response received in {elapsed_time_2:.3f} seconds")
print(f"\nResponse:\n{response2.choices[0].message.content}")

# Compare results
speedup = elapsed_time_1 / elapsed_time_2 if elapsed_time_2 > 0 else 0
time_saved = elapsed_time_1 - elapsed_time_2

print(f"\nüìä Cache Status: HIT (cached response)")
print(f"   Response Time: {elapsed_time_2:.3f}s")
print(f"   Time Saved: {time_saved:.3f}s ({speedup:.1f}x faster)")

# Verify responses are identical
if response1.choices[0].message.content == response2.choices[0].message.content:
    print(f"\n‚úÖ Responses match - cache working correctly")
else:
    print(f"\n‚ö†Ô∏è Responses differ - cache may not be enabled")

## Test 3: Different Query (Cache Miss)

Send a different query to verify cache miss behavior.

In [None]:
# Different query
test_query_2 = "Explain the purpose of API Management in Azure."

print(f"\nüîç Query: {test_query_2}")
print("\n‚è±Ô∏è Sending new request (cache miss expected)...\n")

# Measure time
start_time = time.time()

response3 = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "user", "content": test_query_2}
    ],
    max_tokens=150
)

elapsed_time_3 = time.time() - start_time

print(f"‚úÖ Response received in {elapsed_time_3:.3f} seconds")
print(f"\nResponse:\n{response3.choices[0].message.content}")
print(f"\nüìä Cache Status: MISS (new query)")
print(f"   Response Time: {elapsed_time_3:.3f}s")

## Verify Cache in Redis (Optional)

Inspect cached keys directly in Redis.

In [None]:
if redis_client:
    # List all keys (limit to 10)
    keys = redis_client.keys('*')[:10]
    
    print(f"\nüì¶ Cached Keys in Redis ({len(keys)} shown):")
    for i, key in enumerate(keys, 1):
        # Get TTL
        ttl = redis_client.ttl(key)
        ttl_str = f"{ttl}s" if ttl > 0 else "no expiry"
        
        print(f"   {i}. {key[:80]}... (TTL: {ttl_str})")
    
    # Get cache statistics
    info = redis_client.info('stats')
    print(f"\nüìä Redis Statistics:")
    print(f"   Total Connections: {info.get('total_connections_received', 'N/A')}")
    print(f"   Total Commands: {info.get('total_commands_processed', 'N/A')}")
    print(f"   Keyspace Hits: {info.get('keyspace_hits', 'N/A')}")
    print(f"   Keyspace Misses: {info.get('keyspace_misses', 'N/A')}")
else:
    print("‚ö†Ô∏è Redis client not available - skip cache inspection")

## Performance Summary

Compare all test results.

In [None]:
# Create performance summary table
print("\n" + "="*70)
print("PERFORMANCE SUMMARY")
print("="*70)

print(f"\n{'Test':<40} {'Time (s)':<12} {'Status':<10}")
print("-" * 70)
print(f"{'Test 1: First query (cache miss)':<40} {elapsed_time_1:>8.3f}     {'MISS':<10}")
print(f"{'Test 2: Same query (cache hit)':<40} {elapsed_time_2:>8.3f}     {'HIT':<10}")
print(f"{'Test 3: Different query (cache miss)':<40} {elapsed_time_3:>8.3f}     {'MISS':<10}")
print("-" * 70)

# Calculate metrics
avg_miss_time = (elapsed_time_1 + elapsed_time_3) / 2
cache_speedup = avg_miss_time / elapsed_time_2 if elapsed_time_2 > 0 else 0
time_saved_pct = ((avg_miss_time - elapsed_time_2) / avg_miss_time * 100) if avg_miss_time > 0 else 0

print(f"\nüìä Key Metrics:")
print(f"   Average cache miss time: {avg_miss_time:.3f}s")
print(f"   Cache hit time: {elapsed_time_2:.3f}s")
print(f"   Cache speedup: {cache_speedup:.1f}x faster")
print(f"   Time saved: {time_saved_pct:.1f}%")

print("\n" + "="*70)

## What You Learned

1. **Semantic caching reduces latency** - Cache hits are significantly faster
2. **Cost optimization** - Cached responses don't consume OpenAI tokens
3. **Cache key generation** - APIM creates cache keys from request content
4. **Redis integration** - APIM policies manage cache storage automatically
5. **Performance gains** - Typical speedups of 3-10x for cached queries

## Best Practices

- **Set appropriate TTL** - Balance freshness vs cache hit rate
- **Monitor cache metrics** - Track hit/miss ratios in Redis
- **Use for stable queries** - FAQs, documentation, common questions
- **Avoid for dynamic content** - Real-time data, personalized responses
- **Configure cache size** - Ensure Redis has sufficient capacity

## Next Steps

- Adjust cache TTL in APIM policy (default: 3600s)
- Configure cache eviction policies in Redis
- Monitor cache hit rates in Azure Monitor
- Implement cache warming for common queries
- Set up cache invalidation strategies

**Next Lab:** `03-message-storing.ipynb` - Store conversations in Cosmos DB