# LLM Web Search - Google

> *This notebook should work well with the **`Data Science 3.0`** kernel in SageMaker Studio*

## Introduction

In this notebook we show you how to:
- Define a tool that the LLM can reliably call using JSON output
- Use the googlesearch module to search the internet if the LLM cannot answer a research question itself
- Scrape and process the HTML pages into context for the LLM
- Create a Bedrock Guardrail 
- Use the Guardrail in your calls to the Bedrock API

We will use Bedrock's Claude Sonnet base model using the Boto3 API. 

**Note:** *This notebook can be used in SageMaker Studio or run locally if you setup your AWS credentials.*

#### Prerequisites
- This notebook requires permissions to access Amazon Bedrock
- Ensure you have gone to the Bedrock models access page and enabled acceess to Anthropic Claude 3.5 Sonnet and Claude 3 Haiku 

If you are running this notebook without an Admin role, make sure that your notebook's role includes the following managed policies:
- AmazonBedrockFullAccess

#### Use case
You are building a research assistant GenAI application. In some cases the user's question may be about an event, product, or service that is more recent than the cutoff training date for the LLM model or not within the model's knowledge. For these cases, we want the LLM model to call the internet search tool to gather context relating to the question. Then we can supply that context back to the LLM to answer the question. 

***

## Notebook setup
Before starting, let's install and import the required python packages. Then configure the region and modelId variables we need.

In [None]:
!pip3 install -qU pip
!pip3 install -r requirements.txt
!pip3 install -qU boto3

In [None]:
import boto3
import json
import requests
import string
import pprint
#from datetime import date
#from datetime import datetime
from googlesearch import search
from bs4 import BeautifulSoup
from botocore.exceptions import ClientError

session = boto3.Session()
region = session.region_name

#modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
#modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
modelId = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

print(f"Using modelId: {modelId}")
print(f"Using region: {region}")
print('Running boto3 version:', boto3.__version__)

In [None]:
# Create a boto3 runtime client for calling the LLM and create a boto3 admin client for creating our Guardrail
bedrock_runtime_client = boto3.client(service_name = 'bedrock-runtime', region_name = region,)
bedrock_admin_client = boto3.client('bedrock')

***

## Web Searching and scraping

In this example we create three function:
* handle_search
    * This function first calls Google search to get a list of URLs related to the user's question
    * Then it iterates through the list of URLs to compile aggregated text that can be supplied to the LLM as context in the prompt
* google_search
    * This function uses the googlesearch module to obtain URL's related to the user's question
    * Use the num_results parameter to control how many URLs you want returned
* get_page_content
    * This function uses the BeautifulSoup module to parse the html content of each URL
    * Then the text is processed to remove spaces, blank lines, and short lines

In [None]:
def get_page_content(url):
    try:
        # Use the requests module to get the contents of the URL
        headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/111.0.0.0 Safari/537.36'}
        # Check for link to PDF docs
        if ".pdf" in url.split('/')[-1]:
            print(f"Found a PDF file: {url} skipping...")
            return "skip page"
        else:
            response = requests.get(url, headers=headers, timeout=10)
            #print(f"response right after requests get is: \n{response}")
            if response:
                # Parse HTML content
                soup = BeautifulSoup(response.text, 'html.parser')
                # Remove script and style elements
                for script_or_style in soup(["script", "style"]):
                    script_or_style.decompose()
                # Get the text
                text = soup.get_text()
                # Break into lines and remove leading and trailing space on each
                lines = (line.strip() for line in text.splitlines())
                # Break multi-headlines into a line each
                chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                # Drop blank lines
                no_blank_lines = '\n'.join(chunk for chunk in chunks if chunk)
                # Break into lines again and remove any short lines
                lines = no_blank_lines.splitlines()
                cleaned_text = ""
                character_count = 0
                for line in lines:
                    if len(line) >= 20:
                        cleaned_text += line
                return cleaned_text
            else:
                raise Exception("No response from the web server.")
    except requests.exceptions.Timeout as timeout_err: 
        print(f"Timeout on this URL: {url} skipping...")
        return "skip page"
    except Exception as e:
        print(f"Error while requesting content from {url} skipping...: {e}")
        return "skip page"

def search_google(query):
    try:
        search_results = []
        # Use the googlesearch module to get URLs related to the user's question
        for url in search(query, sleep_interval=5, num_results=3):
            search_results.append(url)
        return search_results
    except Exception as e:
        print(f"Error during Google search: {e}")
        return []

def handle_search(query):
    # Proceed with Google search
    print("Searching Google...\n")
    urls_to_scrape = ['dummy']
    # Sometimes Google will only return one url even if asked for more, try again if only one
    while len(urls_to_scrape) == 1:
        urls_to_scrape = search_google(query)
        if len(urls_to_scrape) != 1:
            break
    aggregated_content = ""
    for url in urls_to_scrape:
        print(f"Scraping URL: {url}")
        content = get_page_content(url)
        #print(f"\nCONTENT for this url: {url} is: \n{content}")
        if content and content != "skip page":
            aggregated_content += content
        else:
            continue

    return aggregated_content



In [None]:
# Tool definition

provider_websearch_schema = {
      "toolSpec": {
        "name": "google_search",
        "description": "A tool to retrieve up to date information from a Google search.",
        "inputSchema": {
          "json": {
            "type": "object",
            "properties": {
              "question": {
                "type": "string",
                "description": "The users question as-is to be searched by Google"
              }
            },
            "required": ["question"]
          }
        }
      }
    }

# In this example, we save only one tool schema to the configuration, but you could have many tools
toolConfig = {
    "tools": [provider_websearch_schema]
}

In [None]:
#The function that answers the user's question based on aggregated Google search content
def answer_question_with_content(question, content):
    query = f"""
    Based solely on this content:
    <content>
    {content}
    </content>
    Answer this question:
    <question>
    {question}
    </question>
    Skip any preamble or references to the tool.
    """

    converse_api_params = {
        "modelId": modelId,
        "messages": [{"role": "user", "content": [{"text": query}]}],
        "system": [{ "text": "You are an expert research assistant." }],
        "inferenceConfig": {
            "maxTokens": 4096,
            "temperature": 0
        }
    }

    response = bedrock_runtime_client.converse(**converse_api_params)
    
    return response

In [None]:
# The function that answers the user's question directly or outputs tool use JSON if an internet search is required
def answer_question(question):
    query = f"""
    <question>
    {question}
    </question>

    You have access to the google_search tool. Only use the google_search tool if you cannot answer the question from your knowledge. 
    For example only use the tool if the subject or event is too new.
    Skip the preamble.
    
    """

    converse_api_params = {
        "modelId": modelId,
        "messages": [{"role": "user", "content": [{"text": query}]}],
        "toolConfig": toolConfig,
        "system": [{ "text": "You are an expert research assistant."}],
        "inferenceConfig": {
            "maxTokens": 4096,
            "temperature": 0
        }
    }

    response = bedrock_runtime_client.converse(**converse_api_params)
    
    # Check the LLM's response to see if it answered the question or needs to use the internet search tool
    google_search = None
    for content in response['output']['message']['content']:
        if isinstance(content, dict) and 'toolUse' in content:
            tool_use = content['toolUse']
            if tool_use['name'] == "google_search":
                google_search = tool_use['input']
                break

    if google_search:
        question = google_search["question"]
        # Call the function to get the content from the internet
        content = handle_search(question)
        if content:
            print("\nGoogle search successful")
            response = answer_question_with_content(question, content)
            print(f"\nFinal answer = {response['output']['message']['content'][-1]['text']}\n")
        else:
            print("No content found from Google search")
    else:
        print("No Google search needed.")
        print(f"\nFinal answer is: {response['output']['message']['content'][-1]['text']}\n")

In [None]:
answer_question("Who won the 2019 Masters golf tournament?")

In [None]:
answer_question("Who won the 2023 Masters golf tournament?")

In [None]:
answer_question("What is the current weather in Seattle, Wa right now?")

In [None]:
answer_question("What is the current time and date in Seattle, WA?")

In [None]:
answer_question("What is the current price on Amazon stock?")

In [None]:
answer_question("Which country won the most gold medals in the 2024 olympics?")

In [None]:
answer_question("Which country won the most gold medals in the 2020 olympics?")

In [None]:
answer_question("Who is favored to be the next Prime Minister of Canada?")

In [None]:
answer_question("How many Grizzly bears are living in Washington State?")

***

## Create a Guardrail
Guardrails for Amazon Bedrock have multiple components which include Content Filters, Denied Topics, Word and Phrase Filters, and Sensitive Word (PII & Regex) Filters. For a full list check out the [documentation](https://docs.aws.amazon.com/bedrock/latest/userguide/guardrails-create.html) 

For our research assistant with web access usecase, we want to prevent inappropriate or malicious questions from being sent to the LLM model as well as preventing our model from returning inappropriate responses or exposing any PII data.

In [None]:
# Use the boto3 bedrock client to create a Bedrock Guardrail based on the specific controls we want to enforce
create_response = bedrock_admin_client.create_guardrail(
    name='research-assistant-guardrail',
    description='Prevents inappropriate or malicious questions and model answers. Also blocks political topics and anonymizes PII data.',
    topicPolicyConfig={
        'topicsConfig': [
            {
                'name': 'Politics',
                'definition': 'Preventing the user from asking questions related to politics for any country.',
                'examples': [
                    'Who is expected to win the next race for Prime Minister of India?',
                    'Which politcial party is in power in England?',
                    'Which country has had the most impeachments of heads of state?',
                    'Who should I vote for in the next election?',
                    'Which countries have had the most political scandals this year?'
                ],
                'type': 'DENY'
            }
        ]
    },
    contentPolicyConfig={
        'filtersConfig': [
            {
                'type': 'SEXUAL',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'VIOLENCE',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'HATE',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'INSULTS',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            },
            {
                'type': 'MISCONDUCT',
                'inputStrength': 'HIGH',
                'outputStrength': 'HIGH'
            }
        ]
    },
    wordPolicyConfig={
        'wordsConfig': [
            {'text': 'political party'},
            {'text': 'voting for'},
            {'text': 'politics'},
            {'text': 'voting advice'},
            {'text': 'vote for President'},
            {'text': 'vote for Prime'},
            {'text': 'vote for Chancellor'},
            {'text': 'King and Queen'},
            {'text': 'Duke and Duchess'},
            {'text': 'Chairman of North'},
            {'text': 'Supreme Leader'}
        ],
        'managedWordListsConfig': [
            {'type': 'PROFANITY'}
        ]
    },
    sensitiveInformationPolicyConfig={
        'piiEntitiesConfig': [
            {'type': 'EMAIL', 'action': 'ANONYMIZE'},
            {'type': 'PHONE', 'action': 'ANONYMIZE'},
            {'type': 'US_SOCIAL_SECURITY_NUMBER', 'action': 'ANONYMIZE'},
            {'type': 'US_BANK_ACCOUNT_NUMBER', 'action': 'ANONYMIZE'},
            {'type': 'CREDIT_DEBIT_CARD_NUMBER', 'action': 'ANONYMIZE'}
        ]
    },
    blockedInputMessaging="""I can provide answers for your research, but I'm not allowed to answer this particular question. Please try a different question. """,
    blockedOutputsMessaging="""I'm not allowed to share the answer to this particular question. Please try a different question.""",
    tags=[
        {'key': 'purpose', 'value': 'fiduciary-advice-prevention'},
        {'key': 'environment', 'value': 'production'}
    ]
)

pprint.pprint(create_response)

In [None]:
# Create a versioned snapshot of our draft Guardrail 
version_response = bedrock_admin_client.create_guardrail_version(
    guardrailIdentifier=create_response['guardrailId'],
    description='Version of research assistant Guardrail'
)
pprint.pprint(version_response)

In [None]:
# Create a Guardrail config that we can pass into the Converse API call
# Use the Guardrail ID and version that we just created above.
# Optionally, enable the Guardrail trace so that we can view the effect it has on questions and answers.
guardrail_config = {
    "guardrailIdentifier": version_response['guardrailId'],
    "guardrailVersion": version_response['version'],
    "trace": "enabled"
}

## Testing our Guardrail

In [None]:
# Modify the function that answers the question based on google search content to use the Guardrail
# Add the Guardrail context to the messages array that we use in the converse API call 
# Add the Guardrail config to the converse API parameters
def answer_question_with_content(question, content):
    query = f"""
    Based solely on this content:
    <content>
    {content}
    </content>
    Answer this question:
    <question>
    {question}
    </question>
    Skip any preamble or references to the tool.
    """

    converse_api_params = {
        "modelId": modelId,
        "messages":[
            {
            "role": "user",
            "content": [{"guardContent": {"text": {"text": query}}}]
            }
        ],
        "system": [{ "text": "You are an expert research assistant." }],
        "inferenceConfig":{
            "maxTokens": 4096,
            "temperature": 0
        },
        "guardrailConfig": guardrail_config,
    }

    response = bedrock_runtime_client.converse(**converse_api_params)
    if response['stopReason'] == "guardrail_intervened":
            trace = response['trace']
            print("Guardrail trace:")
            pprint.pprint(trace['guardrail'])
    
    return response

In [None]:
# Modify the function that answers the question directly or outputs tool use if an internet search is required
# Add the Guardrail context to the messages array that we use in the converse API call 
# Add the Guardrail config to the converse API parameters
def answer_question(question):
    query = f"""
    <question>
    {question}
    </question>

    You have access to the google_search tool. Only use the google_search tool if you cannot answer the question from your knowledge. 
    For example only use the tool if the subject or event is too new.
    Skip the preamble.
    
    """

    converse_api_params = {
        "modelId": modelId,
        "messages":[
            {
            "role": "user",
            "content": [{"guardContent": {"text": {"text": query}}}]
            }
        ],
        "toolConfig": toolConfig,
        "system": [{ "text": "You are an expert research assistant." }],
        "inferenceConfig": {
            "maxTokens": 4096,
            "temperature": 0
        },
        "guardrailConfig": guardrail_config,
    }

    response = bedrock_runtime_client.converse(**converse_api_params)
    if response['stopReason'] == "guardrail_intervened":
            trace = response['trace']
            print("Guardrail trace:")
            pprint.pprint(trace['guardrail'])


    google_search = None
    for content in response['output']['message']['content']:
        if isinstance(content, dict) and 'toolUse' in content:
            tool_use = content['toolUse']
            if tool_use['name'] == "google_search":
                google_search = tool_use['input']
                break

    if google_search:
        question = google_search["question"]
        content = handle_search(question)
        if content:
            print("\nGoogle search successful")
            response = answer_question_with_content(question, content)
            print(f"\nFinal answer = {response['output']['message']['content'][-1]['text']}\n")
        else:
            print("No content found from Google search")
    else:
        print("No Google search needed.")
        print(f"\nFinal answer is: {response['output']['message']['content'][-1]['text']}\n")

In [None]:
answer_question("Who won the 2023 Masters golf tournament?")

In [None]:
answer_question("Who is favored to win the next election for Prime Minister of Canada?")

In [None]:
answer_question("What is the email address for AWS Support?")

In [None]:
answer_question("Where can I purchace brass knuckles?")

In [None]:
answer_question("How many Grizzly bears are living in Washington State?")