# Anthropic 1-Hour Prompt Caching with ChatBedrockConverse

This notebook demonstrates how to use extended cache TTL (1 hour) with Anthropic models through AWS Bedrock's Converse API using the `ChatBedrockConverse` integration.

## Prerequisites

1. AWS credentials configured
2. Access to Anthropic Claude models in AWS Bedrock
3. langchain-aws package installed

In [None]:
# Install required packages if not already installed
# !pip install langchain-aws

In [None]:
from langchain_aws import ChatBedrockConverse
from langchain_core.messages import HumanMessage, SystemMessage, AIMessage
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser

## Step 1: Initialize ChatBedrockConverse with Beta Header

To enable 1-hour prompt caching with Anthropic models, you must include the beta header in `additional_model_request_fields`.

In [None]:
# Initialize the model with beta header for extended cache TTL
llm = ChatBedrockConverse(
    model="anthropic.claude-3-sonnet-20240229-v1:0",
    region_name="us-west-2",  # Update to your region
    additional_model_request_fields={
        "anthropicBeta": ["extended-cache-ttl-2025-04-11"]
    }
)

print("✓ ChatBedrockConverse initialized with 1-hour caching support")

## Step 2: Create Cache Points

Cache points define where and how long content should be cached.

In [None]:
# Create cache points with different TTLs
cache_1h = ChatBedrockConverse.create_cache_point(ttl="1h")     # 1-hour cache
cache_5m = ChatBedrockConverse.create_cache_point()              # Default 5-minute cache

print("1-hour cache point:", cache_1h)
print("5-minute cache point:", cache_5m)

## Example 1: Basic Usage with Direct Messages

**Important**: 1-hour cache entries must appear before 5-minute cache entries in your message content.

In [None]:
# Create messages with cache points
messages = [
    HumanMessage(content=[
        # Long system prompt that doesn't change often - cache for 1 hour
        """You are an expert AI assistant with extensive knowledge in multiple domains.
        You always provide helpful, accurate, and detailed responses.
        You follow these guidelines:
        1. Be concise but thorough
        2. Use examples when helpful
        3. Admit when you don't know something
        4. Provide sources when possible""",
        cache_1h,  # Cache the above content for 1 hour
        
        # Context that might change more frequently - cache for 5 minutes
        "Today's date is January 17, 2025. The user is located in Seattle.",
        cache_5m,  # Cache this for 5 minutes
        
        # The actual user question (not cached)
        "What's the weather typically like in Seattle in January?"
    ])
]

response = llm.invoke(messages)
print("Response:", response.content[:300] + "...")

# Check cache usage
if hasattr(response, 'usage_metadata') and response.usage_metadata:
    print(f"\nToken usage:")
    print(f"  Input tokens: {response.usage_metadata.get('input_tokens', 0)}")
    print(f"  Output tokens: {response.usage_metadata.get('output_tokens', 0)}")
    
    if 'input_token_details' in response.usage_metadata:
        details = response.usage_metadata['input_token_details']
        print(f"  Cache read tokens: {details.get('cache_read', 0)}")
        print(f"  Cache creation tokens: {details.get('cache_creation', 0)}")

## Example 2: Using ChatPromptTemplate with Caching

This shows how to integrate prompt caching with LangChain's `ChatPromptTemplate`.

In [None]:
# Create a template with cached system message
template = ChatPromptTemplate.from_messages([
    SystemMessage(content=[
        """You are a helpful coding assistant specialized in Python.
        You provide clear explanations with practical examples.
        Always follow PEP 8 style guidelines in your code examples.""",
        cache_1h  # Cache system prompt for 1 hour
    ]),
    HumanMessage(content=[
        "Current context: User is learning {topic}",
        cache_5m,  # Cache context for 5 minutes
        "{question}"  # User's question is not cached
    ])
])

# Create a chain
chain = template | llm | StrOutputParser()

# First invocation (creates cache)
response1 = chain.invoke({
    "topic": "async programming",
    "question": "How do I use async/await in Python?"
})

print("First response (cache created):")
print(response1[:300] + "...")

In [None]:
# Second invocation (uses cache)
response2 = chain.invoke({
    "topic": "async programming",  # Same topic, will use 5m cache
    "question": "What's the difference between asyncio.run() and asyncio.create_task()?"
})

print("Second response (using cache):")
print(response2[:300] + "...")

## Example 3: Structured Output with Caching

In [None]:
from pydantic import BaseModel, Field
from typing import List

# Define output schema
class CodeAnalysis(BaseModel):
    """Analysis of provided code."""
    language: str = Field(description="Programming language")
    issues: List[str] = Field(description="List of potential issues")
    improvements: List[str] = Field(description="Suggested improvements")
    complexity: str = Field(description="Overall complexity: low, medium, or high")

# Create template with caching
code_template = ChatPromptTemplate.from_messages([
    SystemMessage(content=[
        """You are a code review expert. Analyze the provided code and identify:
        1. The programming language
        2. Potential issues or bugs
        3. Possible improvements
        4. Overall complexity assessment
        
        Be thorough but concise in your analysis.""",
        cache_1h
    ]),
    ("human", "Please analyze this code:\n\n{code}")
])

# Create structured output chain
structured_llm = llm.with_structured_output(CodeAnalysis)
code_chain = code_template | structured_llm

# Example code to analyze
code_sample = """
def fibonacci(n):
    if n <= 1:
        return n
    return fibonacci(n-1) + fibonacci(n-2)
"""

# Get structured analysis
analysis = code_chain.invoke({"code": code_sample})

print(f"Language: {analysis.language}")
print(f"Issues: {', '.join(analysis.issues)}")
print(f"Improvements: {', '.join(analysis.improvements)}")
print(f"Complexity: {analysis.complexity}")

## Example 4: Complex Conversation with Multiple Cache Durations

In [None]:
# Complex template with multiple cached sections
# Remember: 1-hour cache must come before 5-minute cache
advanced_template = ChatPromptTemplate.from_messages([
    SystemMessage(content=[
        # Long-term stable instructions (1 hour cache)
        """You are an AI assistant for a software development team.
        Core responsibilities:
        - Code review and optimization
        - Architecture design
        - Best practices guidance
        - Performance analysis
        
        Always consider security, scalability, and maintainability.""",
        cache_1h,
        
        # Medium-term context (5 minute cache)
        "Current sprint focus: {sprint_focus}",
        cache_5m
    ]),
    MessagesPlaceholder(variable_name="history", optional=True),
    ("human", "{question}")
])

advanced_chain = advanced_template | llm | StrOutputParser()

# First call
response1 = advanced_chain.invoke({
    "sprint_focus": "API optimization and database query performance",
    "history": [],
    "question": "What are the best practices for optimizing PostgreSQL queries?"
})

print("Advanced example - First response:")
print(response1[:300] + "...")

# Second call with history
response2 = advanced_chain.invoke({
    "sprint_focus": "API optimization and database query performance",
    "history": [
        HumanMessage("What are the best practices for optimizing PostgreSQL queries?"),
        AIMessage(response1)
    ],
    "question": "How do I implement these optimizations in SQLAlchemy?"
})

print("\nAdvanced example - Second response (using cache):")
print(response2[:300] + "...")

## Testing Cache Effectiveness

In [None]:
# Create a consistent template for testing cache effectiveness
test_template = ChatPromptTemplate.from_messages([
    SystemMessage(content=[
        """You are an expert data scientist with deep knowledge of machine learning algorithms.
        Provide detailed explanations with mathematical foundations when appropriate.
        Always include practical implementation tips.""",
        cache_1h
    ]),
    ("human", "{question}")
])

test_chain = test_template | llm | StrOutputParser()

# Multiple questions using the same cached system prompt
questions = [
    "Explain gradient descent in simple terms.",
    "What is the difference between supervised and unsupervised learning?",
    "How does a neural network learn?"
]

responses = []
for i, question in enumerate(questions, 1):
    print(f"\nQuestion {i}: {question}")
    response = test_chain.invoke({"question": question})
    responses.append(response)
    print(f"Response {i}: {response[:150]}...")
    
    # Check for cache usage (after first call)
    if i > 1:
        print("  → Should be using cached system prompt")

## Key Takeaways

### Setup Requirements
1. **Beta Header**: Include `{"anthropicBeta": ["extended-cache-ttl-2025-04-11"]}` in `additional_model_request_fields`
2. **Anthropic Models**: Only works with Anthropic Claude models through AWS Bedrock

### Cache Point Creation
- `ChatBedrockConverse.create_cache_point(ttl="1h")` for 1-hour caching
- `ChatBedrockConverse.create_cache_point()` for default 5-minute caching

### Message Format
Add cache points as list items in message content:
```python
SystemMessage(content=["Your prompt text", cache_point])
```

### Important Rules
1. **Ordering**: 1-hour cache entries must appear before 5-minute cache entries
2. **Best Practices**:
   - Cache stable, reusable content (system prompts, instructions)
   - Use 1-hour cache for content that rarely changes
   - Use 5-minute cache for session-specific context
   - Don't cache user-specific or frequently changing content

### Integration
- Works seamlessly with `ChatPromptTemplate`
- Compatible with structured output
- Supports conversation history
- Can be used in LangChain chains

### Monitoring
- Check cache usage through `response.usage_metadata`
- Monitor `cache_read` and `cache_creation` tokens
- Significant cost and latency savings for repeated calls