# Caching Responses and Cost-efficient Embedding Storage with Amazon Bedrock

## Objective Two: Implement Response Caching to Reduce Costs

In this hands-on walkthrough, you’ll use the `boto3 Python SDK` to interact with `Amazon Bedrock`, installing the required dependencies and setting up the `Bedrock` client. Once that’s done, you’ll craft a prompt to ensure that cache checkpoints are created, and then you’ll perform a model invocation using the initialized `Bedrock` client to validate whether cache checkpoints are created and whether you get a cache hit, reducing inference costs.

### 1. Prepare the Environment

This step includes the code to install the required `Python` packages needed for the rest of the exercise and restart the kernel to ensure the packages are properly loaded. While running, you might see some `pip` dependency warnings. These can be safely ignored as they won’t impact the steps we’re performing here.

In [None]:
print("✅ Please wait while the installation completes. This may take a few "
      "minutes. If you encounter any dependency errors, you can ignore them.")
%pip install --upgrade -q botocore
%pip install --upgrade -q boto3
print("✅ Installation completed!")
print("✅ Restarting kernel")

from IPython.core.display import HTML
from IPython.display import display

try:
    display(HTML("<script>Jupyter.notebook.kernel.restart()</script>"))
    print("✅ Kernel restarted successfully")
except Exception as e:
    print("❌ Failed to restart the kernel")
    print(f"Error: {e}")

In this step, you import `boto3` to interact with AWS services programmatically and `json` to handle `JSON` data formatting for requests and responses. You also create a `Bedrock` client using `boto3`, which allows you to interact with the `Amazon Bedrock` service.

In [None]:
try:
    import boto3
    import json
    import time
    print("----------------------------")
    print("✅ Libraries loaded successfully.")
except ImportError as e:
    print("----------------------------")
    print("❌ Failed to load libraries.")
    print(f"Error: {e}")
try:
    client = boto3.client(
        service_name="bedrock-runtime",
        region_name="us-east-1"
    )
    print("✅ bedrock-runtime client initialized successfully.")
except Exception as e:
    print("❌ Failed to initialize bedrock-runtime client.")
    print(f"Error: {e}")

### 2. Load Intructions and Privacy Statement Files

In this step, the contents of two local text files — `instructions.txt` and `privacy_statement.txt` — are loaded, normalized and stored in variables. This content will be used throughout the exercise to simulate user queries and test how the model responds to questions about the document.

In [None]:
# Function to read file content and normalize newlines
def read_file_content(filepath):
    try:
        with open(filepath, 'r', encoding='utf-8') as f:
            content = f.read().replace('\r\n', '\n').replace('\r', '\n').strip()
            print("✅ File loaded successfully.")
            return content
    except FileNotFoundError:
        print("❌ File not found. Please check the file path.")
        print(f"Error: {e}")
        return None
    except Exception as e:
        print("❌ Failed to read file.")
        print(f"Error: {e}")
        return None

# File paths
instructions_file = "instructions.txt"
privacy_statement_file = "privacy_statement.txt"

# Load content
instructions_content = read_file_content(instructions_file)
privacy_statement_content = read_file_content(privacy_statement_file)

### 3. Construct Prompt for Model Invocation

In this step, we define a user inquiry and build a combined promt. To ensure cache checkpoints are created, the combined token count from the prompt and the response must exceed the 1000-token minimum required by the `Amazon Nova Lite` model. To help with this, we have loaded clear and detailed instructions from an `instructions.txt` file, along with the `privacy_statement.txt` file.

In [None]:
# Define the static part of the prompt
static_prompt_content = (
    f"--- BEGIN INSTRUCTIONS ---\n{instructions_content}\n--- END INSTRUCTIONS ---\n\n"
    f"--- BEGIN PRIVACY STATEMENT ---\n{privacy_statement_content}\n--- END PRIVACY STATEMENT ---\n\n"
    "Based on the provided information, please answer the following question:"
).strip()

### 4. Define Function to Invoke a Bedrock Model with Caching and Display Usage Metrics

In this step, a function is defined to send a crafted prompt to the `Amazon Nova Lite` model on `Amazon Bedrock`, combining static and dynamic content with a cache point marker. It records the model's response time, prints usage metrics from the response, and displays the generated output.

In [None]:
model_id = "amazon.nova-lite-v1:0"

def invoke_model_and_print_all_metrics(static_content, dynamic_content, invocation_number=1):
    print(f"\n--- Invoking Bedrock Model: {model_id} (Invocation {invocation_number}) ---")
    
    messages_content = [
        {"text": static_content},
        {"cachePoint": {"type": "default"}}, # Always include cachePoint for caching behavior
        {"text": dynamic_content}
    ]

    messages = [
        {
            "role": "user",
            "content": messages_content
        }
    ]
    
    body = {
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 2048,
            "temperature": 0.1,
            "topP": 0.9
        }
    }

    try:
        start_time = time.time()
        response = client.converse(
            modelId=model_id,
            messages=body["messages"],
            inferenceConfig=body["inferenceConfig"]
        )
        end_time = time.time()
        latency = (end_time - start_time) * 1000 # in milliseconds

        usage = response.get("usage", {})
        
        print(f"Response Latency: {latency:.2f} ms")
        print("--- Metrics from Usage Field ---")
        if usage:
            for metric_name, metric_value in usage.items():
                print(f"- {metric_name}: {metric_value}")
        else:
            print("No usage metrics found in the response.")
        print("------------------------------------\n")

        # Print model response snippet
        model_response_text = ""
        if 'output' in response and 'message' in response['output'] and 'content' in response['output']['message']:
            for content_block in response['output']['message']['content']:
                if 'text' in content_block:
                    model_response_text += content_block['text']
        print(f"Model response: {model_response_text.strip()[:200]}...")

    except Exception as e:
        print(f"Error invoking model (Invocation {invocation_number}): {e}")

### 5. Invoke the Model and Display Token Usage and Response

This step invokes the model twice: the first call creates a cache entry, while the second call retrieves the response from the cache, resulting in a cache hit. Because the second call uses the cached response, it incurs lower costs as no new model inference is performed. You can see the difference by comparing the `cacheReadInputTokens` and `cacheWriteInputTokens` values printed after each call, along with the model’s responses. The cache TTL is 5 minutes, so keep this in mind if you run the code multiple times.

In [None]:
# --- Execute the two model invocations ---
# First invocation: This will attempt to write the static_content to cache.
# Define the dynamic user inquiry
user_inquiry = "How is my information used according to your privacy statement?"
invoke_model_and_print_all_metrics(static_prompt_content, user_inquiry, invocation_number=1)

# Small delay before the second invocation
time.sleep(3) 

# Second invocation: This will attempt to read the static_content from cache.
# Define the dynamic user inquiry
user_inquiry = "How do you handle my information, based on your privacy policy?"
invoke_model_and_print_all_metrics(static_prompt_content, user_inquiry, invocation_number=2)

print("\n--------------------")
print("The output above shows all metrics returned in the 'usage' field for both model invocations.")
print("Observe 'cacheWriteInputTokens' for the first invocation and 'cacheReadInputTokens' for the second.")
print("Also compare the latency in miliseconds for each execution.")

Congratulations! You have successfully completed objective two **Implement Response Caching to Reduce Costs**. You can now close the `objective_two.ipynb` file.