# Environment Setup for Labs

## How to use the Notebook.
1. Do not run all cells by clicking on two arrows.
2. Run one cell at a time in sequential order by clicking the single arrow.
3. An empty square bracket to the left of the cell indicates that the cell has not been executed.
4. After clicking on the arrow, if it shows * to the left of the cell in square brackets, it indicates that it's still executing. Some cells could take more time. Please wait until the * is replaced by a number in the square brackets.

In this notebook we will prepare the environment for following labs:
1. Install python dependencies
2. Download & upload sample documents for a knowledge base.
3. Update the knowledge base with the document.

We will use Amazon's return policies available in the web site as sample documents. The original documents are available at:
* https://www.amazon.in/gp/help/customer/display.html?nodeId=202111910 (India)
* https://www.amazon.co.uk/gp/help/customer/display.html?nodeId=GKM69DUUYKQWKWX7 (UK)
* https://www.amazon.com/gp/help/customer/display.html/?nodeId=GKM69DUUYKQWKWX7 (US)

The metadata files are pre-created for the documents under "metadata" folder.

In [None]:
# This might take a few seconds and will throw some Errors. You can ignore the errors.
!pip install -r requirements.txt -Uq

In [None]:
import boto3
from utils import get_param_value

# Get AWS Account ID and Region
session = boto3.Session()

sts = session.client('sts')
identity = sts.get_caller_identity()
account_id = identity['Account']
region = boto3.Session().region_name or 'us-west-2'

print(f"Account ID: {account_id}")
print(f"Region: {region}")

In [None]:
# Import the process_urls function from the web_scraper module in utils package
from utils.web_scraper import process_urls

# Define a list of tuples containing Amazon return policy URLs and their corresponding output filenames
# Each tuple contains (URL, filename)
# The URLs are for Amazon's return policy pages from different regional sites (India, UK, and US)
urls = [
    ("https://www.amazon.in/gp/help/customer/display.html?nodeId=202111910", "Amazon-return-policy-in"),
    ("https://www.amazon.co.uk/gp/help/customer/display.html?nodeId=GKM69DUUYKQWKWX7","Amazon-return-policy-uk"),
    ("https://www.amazon.com/gp/help/customer/display.html/?nodeId=GKM69DUUYKQWKWX7", "Amazon-return-policy-us")
]

# Print start message to indicate processing has begun
print("Processing URLs...")

# Process each URL: scrape the content and convert it to PDF
# The PDFs will be saved in the kb_docs directory with the specified filenames
# process_urls(urls)

# Print completion message to indicate all URLs have been processed
print("Done!")

In [None]:
# We are copying the downloaded content (that we executed in the previous cell to the s3 folder in an S3 bucket in this AWS account)
!aws s3 sync ./kb_docs s3://{account_id}-{region}-kb-data-bucket

In [None]:
# Retrieve the knowledge base ID & Data Source ID from AWS Systems Manager Parameter Store

kb_id = get_param_value(f"/app/workshop/kb/knowledge-base-id")
ds_id = get_param_value(f"/app/workshop/kb/data-source-id")

print(f"kb_id = {kb_id}")
print(f"ds_id = {ds_id}")

In [None]:
%%time

# Bedrock Knowledgebase manages the pipeline where it chunks the data, generates vector embeddings
# and stores the chunks and the embeddings  in a vector store.

from utils.knowledgebase import ingest_documents_to_kb
ingest_documents_to_kb(session, kb_id, ds_id, region)

### Enable Generative AI Observability

X-Ray transaction search is needed for CloudWatch GenAI observability because it provides the distributed tracing capabilities that are essential for monitoring AI applications. <br/>
**Please note that the update will be applied 10-15 minutes after the cell execution.**


In [None]:
import json

# Step 1: Put resource policy
logs = session.client("logs")

#Policy
transaction_search_policy_dict = {
    "Version":"2012-10-17",
    "Statement": [
        {
            "Sid": "TransactionSearchXRayAccess",
            "Effect": "Allow",
            "Principal": {
                "Service": "xray.amazonaws.com"
            },
            "Action": "logs:PutLogEvents",
            "Resource": [
                f"arn:aws:logs:{region}:{account_id}:log-group:aws/spans:*",
                f"arn:aws:logs:{region}:{account_id}:log-group:/aws/application-signals/data:*"
            ],
            "Condition": {
                "ArnLike": {
                    "aws:SourceArn": f"arn:aws:xray:{region}:{account_id}:*"
                },
                "StringEquals": {
                    "aws:SourceAccount": account_id
                }
            }
        }
    ]
}

logs.put_resource_policy(
    policyName="xray_policy_transaction_search",
    policyDocument=json.dumps(transaction_search_policy_dict)
)

# Step 2: Set Amazon Xray destination to CloudWatch
!aws xray update-trace-segment-destination --destination CloudWatchLogs

# IGNORE ERROR MESSAGE "The destination is already set to CloudWatchLogs"

# Step 3: Update indexing rule
!aws xray update-indexing-rule --name "Default" --rule '{"Probabilistic": {"DesiredSamplingPercentage": 1.0}}'