# Caching Responses and Cost-efficient Embedding Storage with Amazon Bedrock

## Objective Three: Implement Response Caching to Reduce Costs

In this hands-on walkthrough, you’ll use the boto3 Python SDK to interact with Amazon Bedrock, installing the required dependencies and setting up the Bedrock client. Once that’s done, you’ll craft a prompt to ensure that cache checkpoints are created and then you will perform a model invocation using the initialized bedrock client to validate if cache checkpoints are created and if you have a cache hit, reducing inference costs.

### 1. Installing Packages for Amazon Bedrock Access

This step quietly upgrades two essential AWS-related Python packages to ensure you're using the latest features, bug fixes, and security updates—helping maintain smooth and reliable interactions with AWS services:

- **botocore**: This is a low-level foundational library used by AWS SDKs for Python to interact with AWS services.
- **boto3**: This is the official AWS SDK for Python, built on top of botocore.

While upgrading, you might see some pip dependency warnings. These can be safely ignored as they won’t impact the steps we’re performing here

In [None]:
%pip install --upgrade -q botocore
%pip install --upgrade -q boto3

### 2. Load Libraries and Restart Kernel for Updates to Take Effect

In this step, you import boto3 to interact with AWS services programmatically, and json to format request and response data. Since boto3 is built on top of botocore, it automatically relies on botocore under the hood when making service calls

In [None]:
try:
    import boto3
    import json
    print("----------------------------")
    print("✅ Libraries loaded successfully.")
except ImportError as e:
    print("----------------------------")
    print("❌ Failed to load libraries.")
    print(f"Error: {e}")

The following code restarts the Jupyter notebook kernel to clear the current environment. This is useful after installing or upgrading packages to ensure that all changes take effect properly

In [None]:
from IPython.core.display import HTML
from IPython.display import display

try:
    display(HTML("<script>Jupyter.notebook.kernel.restart()</script>"))
    print("✅ Kernel restarted successfully")
except Exception as e:
    print("❌ Failed to restart the kernel")
    print(f"Error: {e}")

### 3. Initialize the Bedrock Client

In this step, you create a Bedrock client using boto3, which allows you to interact with the Amazon Bedrock service by specifying the service name and an AWS region that supports it

In [None]:
import boto3

print("----------------------------")
try:
    client = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    print("✅ bedrock-runtime client initialized successfully.")
except Exception as e:
    print("❌ Failed to initialize bedrock-runtime client.")
    print(f"Error: {e}")

### 4. Load Privacy Statement Document from File

In this step, a sample privacy statement is loaded from a local file and stored in a variable. This content will be used throughout the exercise to simulate user queries and evaluate how the model responds to questions based on the document

In [None]:
def read_privacy_statement(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            print("✅ Privacy statement file loaded successfully.")
            return file.read()
    except FileNotFoundError:
        print("❌ File not found. Please check the file path.")
    except Exception as e:
        print("❌ Failed to read privacy statement file.")
        print(f"Error: {e}")
    return None  # return None if read fails

privacy_text = read_privacy_statement('privacy_statement.txt')

### 5. Load Instructions from File

To ensure cache checkpoints are created, the combined token count from the prompt and the response must exceed the 1024-token minimum required by the Claude 3 Sonnet model. To help with this, we’ll load clear and detailed instructions from an instructions.txt file, which will be used in the next steps to build an effective messages block

In [None]:
def read_instructions(file_path):
    try:
        with open(file_path, 'r', encoding='utf-8') as file:
            print("✅ Instructions file loaded successfully.")
            return file.read()
    except FileNotFoundError:
        print("❌ File not found. Please check the file path.")
    except Exception as e:
        print("❌ Failed to read instructions file.")
        print(f"Error: {e}")
    return None  # return None if read fails

instructions_text = read_instructions('instructions.txt')

### 6. Define the Request Body for the Bedrock API

In this step, we define a user inquiry and build the request body for the Claude 3 Sonnet model. The body includes instructions from instructions.txt, the privacy statement from privacy_statement.txt, and the user’s question. The anthropic_version and max_tokens set the model version and token limit

In [None]:
user_inquiry = "How you use my information?"

messages_API_body = {
    "anthropic_version": "bedrock-2023-05-31",
    "max_tokens": 4096,
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": instructions_text,
                    "cache_control": {
                        "type": "ephemeral"
                    }
                },
                {
                    "type": "text",
                    "text": privacy_text,
                    "cache_control": {
                        "type": "ephemeral"
                    }
                },
                {
                    "type": "text",
                    "text": user_inquiry
                },
            ]
        }
    ]
    }
print("✅ Message payload constructed successfully.")

### 7. Invoke the Model Using the Bedrock Client and Display Token Usage and Response

This step sends a request to Amazon Bedrock to invoke the Claude 3 Sonnet model with the request body and required headers. The model is invoked twice: the first call creates a cache entry, while the second call retrieves the response from the cache, resulting in a cache hit. Because the second call uses the cached response, it incurs lower costs as no new model inference is performed. You can see the difference by comparing the cache_creation_input_tokens and cache_read_input_tokens values printed after each call, along with the model’s responses

In [None]:
def invoke_and_print(client, body, model_id, label):
    try:
        response = client.invoke_model(
            body=json.dumps(body),
            modelId=model_id,
            accept="application/json",
            contentType="application/json"
        )
        print(f"✅ {label} - Model invoked successfully.")

        response_body = json.loads(response.get("body").read())
        text = response_body["content"][0]["text"]
        creation_tokens = response_body["usage"]["cache_creation_input_tokens"]
        read_tokens = response_body["usage"]["cache_read_input_tokens"]

        print(text)
        print("-------------------------------------------------")
        print("cache_creation_input_tokens:", creation_tokens)
        print("cache_read_input_tokens:", read_tokens)
        print("-------------------------------------------------")
    except Exception as e:
        print(f"❌ {label} - Failed to invoke model.")
        print(f"Error: {e}")

invoke_and_print(client, messages_API_body, "us.anthropic.claude-3-7-sonnet-20250219-v1:0", "First Execution")
invoke_and_print(client, messages_API_body, "us.anthropic.claude-3-7-sonnet-20250219-v1:0", "Second Execution")