# LLM Web Search

> *This notebook should work well with the **`conda_python3`** kernel in SageMaker Studio*

## Introduction

In this notebook we show you how to:
- Define a tool that the LLM can reliably call that produces JSON output
- Use the googlesearch and wikipedia python modules to search the internet if the LLM cannot answer a research question itself
- Rerank the search results options from best to worst
- Scrape and process the best option HTML page to create context for the LLM

We will use Bedrock's `Claude 3 Sonnet`, `Claude 3.5 Sonnet`(default), and `Claude 3 Haiku` base model using the AWS boto3 SDK. 

> **Note:** *This notebook can be used in SageMaker Studio or run locally if you setup your AWS credentials.*

#### Prerequisites
- This notebook requires permissions to access Amazon Bedrock
- Ensure you have gone to the Bedrock models access page in the AWS Console and enabled access to `Anthropic Claude 3 Sonnet`
- If you are running this notebook without an Admin role, make sure that your notebook's role includes the following managed policy:
> AmazonBedrockFullAccess

#### Use case
You are building a research assistant GenAI application. In some cases the user's question may be about an event, product, or service that is more recent than the cutoff training date for the LLM model or not within the model's knowledge. For these cases, we want the LLM model to call the internet search tool to gather context relating to the question. Then we can supply that context back to the LLM to answer the question. 

***

## Notebook setup

1. If you are attending an instructor lead workshop or deployed the workshop infrastructure using the provided [CloudFormation Template](https://raw.githubusercontent.com/aws-samples/xxx/main/cloudformation/workshop-v1-final-cfn.yml) you can proceed to step 2, otherwise you will need to download the workshop [GitHub Repository](https://github.com/aws-samples/xxx) to your local machine.

2. Install the required dependencies by running the pip install commands in the next cell.
 

⚠️ **Please ignore error messages related to pip's dependency resolver.**

💡 **Tip** You can use `Shift + Enter` to execute the cell and move to the next one.

In [None]:
!pip install -qU pip
!pip install -r requirements.txt
!pip install wikipedia

In [75]:
import boto3
import json
import requests
import random
import time
from datetime import datetime as dt
from googlesearch import search
import wikipedia
from bs4 import BeautifulSoup
import pdfplumber
import io
from botocore.exceptions import ClientError
from markdownify import markdownify as md

session = boto3.Session()
region = session.region_name

# Change which line is uncommented below to select the LLM model you want to use
#modelId = 'anthropic.claude-3-sonnet-20240229-v1:0'
#modelId = 'anthropic.claude-3-haiku-20240307-v1:0'
modelId = 'anthropic.claude-3-5-sonnet-20240620-v1:0'

print(f"Using modelId: {modelId}")
print(f"Using region: {region}")
print('Running boto3 version:', boto3.__version__)

Using modelId: anthropic.claude-3-5-sonnet-20240620-v1:0
Using region: us-west-2
Running boto3 version: 1.35.34


The `modelId` and `region` variables defined in the above cell will be used throughout the workshop.

Just make sure to run the cells from top to bottom.

### The Boto3 SDK & the Converse API
We will be using the [Amazon Boto3 SDK](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime.html) and the [Converse API](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/bedrock-runtime/client/converse.html) throughout this workshop. 

In [2]:
# Create a boto3 Bedrock runtime client for calling the LLM
bedrock_runtime_client = boto3.client(service_name = 'bedrock-runtime', region_name = region,)

***

## Asking the LLM questions without an internet search tool

Let's start out by asking some questions to the LLM without supplying the option of an internet search tool.

Create two functions:
* call_bedrock
    * This function takes in the parmeters you set for the Bedrock converse API and uses the runtime client to make the call to Bedrock converse API
    * A retry with backoff mechanism is put in place to catch any throttling response from Bedrock
* answer_question
    * This function defines the prompt template and the converse api parameters
    * It passes that to the call_bedrock function and then parses the response for printing out

In [54]:
# Function for calling the Bedrock Converse API...
def call_bedrock(messages, system_prompt, tool_config=None, max_retries=3, initial_delay=1):
    converse_api_params = {
        "modelId": modelId,
        "system": [{"text": system_prompt}],
        "messages": messages,
        "inferenceConfig": {
            "maxTokens": 4096,
            "temperature": 0
        }
    }
    
    if tool_config:
        converse_api_params["toolConfig"] = tool_config

    delay = initial_delay
    for attempt in range(max_retries):
        try:
            return bedrock_runtime_client.converse(**converse_api_params)
        except ClientError as err:
            if err.response['Error']['Code'] == 'ThrottlingException' and attempt < max_retries - 1:
                print(f"Throttling Exception Occurred...Retrying... Attempt {attempt + 1}/{max_retries}")
                time.sleep(delay)
                delay *= 2  # Exponential backoff
            else:
                print(f"ClientError while calling the Bedrock API: {err}")
                return None
        except Exception as err:
            print(f"Error while calling the Bedrock API: {err}")
            return None
    print(f"Attempted {max_retries} times but no success.")
    return None


In [55]:
def answer_question(question):
    query = f"""
    <question>
    {question}
    </question>

    Answer the user's question in complete sentances.
    Skip the preamble.
    
    """
    messages = [{"role": "user", "content": [{"text": query}]}]
    system_prompt = "You are an expert research assistant."
    
    response = call_bedrock(messages, system_prompt)
    if response:
        # Provide the LLM's response
        print(f"\nFinal answer is: {response['output']['message']['content'][-1]['text']}\n")
    else:
        print("Unable to get a response from Bedrock")

In [None]:
# The LLM should be able to answer this question from it's own knowledge
answer_question("Which country won the most gold medals in the 2020 olympics?")

In [None]:
# Now try a question where the information is too new and past the LLM's training cutoff date
answer_question("Which country won the most gold medals in the 2024 olympics?")

In [None]:
answer_question("What is the current weather in Seattle, Wa right now?")

***

## Web Searching and scraping with Wikipedia and Google

In this example we create two functions:
* internet_search
    * This function first calls the internet provider to search for pages (Wikipedia) / URLs (Google) related to the user's question
    * Use the `num_results` parameter to control how many pages/URLs you want returned
    * Then it returns the list of pages/URLs
* get_wikipedia_page_content
    * This function uses the wikipedia module to get the html content of a single Wikipedia page
    * The markdownify module is used to transform the page markdown (including tables) to lines of text
    * Then the text is processed to remove the standard info sections at the bottom of Wikipedia pages from the content

In [56]:
# Set the Websearch provider
search_provider = "Wikipedia"

In [68]:
class ToolsList:
    def internet_search(self, question, num_results=5):
        # Proceed with internet search
        print(f"Searching {search_provider}...\n")
        search_results = []
        try:
            if search_provider == "Wikipedia":
                # Use the wikipedia module to get wiki pages related to the user's question
                search_results = wikipedia.search(question, results=num_results)
            elif search_provider == "Google":
                # Use the googlesearch module to get pages related to the user's question
                for page in search(question, sleep_interval=5, num_results=num_results):
                    search_results.append(page)
            else:
                print(f"Unsupported search provider: {search_provider}")
                return None
            # Return the list of pages/URLs returned by the internet search provider
            return search_results
        except Exception as err:
            print(f"Error during internet_search: {err}")
            return None

In [59]:
def get_wikipedia_page_content(page):
    try:
        # use the wikipedia module to return the html content of a specific page
        html_text = wikipedia.page(title=page, auto_suggest=False).html()
        # As a first cleanup step and to retain tabular data, we will use the markdownify module to convert html to markdown text
        markdown_text = md(html_text)
        payload = "{} \n".format(markdown_text)
        # Initialize a variable to hold our cleaned markdown text as we process each line
        cleaned_markdown_text = ""
        if markdown_text:
            lines = payload.splitlines()
            # Do not include the standard info sections at the bottom of Wikipedia pages in the content
            # Do not include the edit links or links to images to reduce token count
            for line in lines:
                if line == "" or "[edit]" in line or "[![]" in line:
                    continue
                elif line == "See also" or line == "References" or line == "External link" or line == "Further reading":
                    break
                else:
                    # Only add lines that contain the main content of the page
                    cleaned_markdown_text += line + '\n'
        else:
            # Handle the situation where no content was found on a page
            Print(f"No markdown text found on page: {page}")
            return None
    except Exception as err:
        print(f"Error while requesting content from {page} skipping...: {err}")
        return "skip page"
    
    return cleaned_markdown_text

## Use the Bedrock Converse API for inference and configure 'Tool Use'

* Configure the tool definition
    * This JSON schema defines our internet search tool and how the LLM should output the JSON when calling the tool

In [60]:
# Tool definition
provider_websearch_schema = {
      "toolSpec": {
        "name": "internet_search",
        "description": "A tool to retrieve up to date information from an internet search.",
        "inputSchema": {
          "json": {
            "type": "object",
            "properties": {
              "question": {
                "type": "string",
                "description": "The users question as-is for the internet search."
              }
            },
            "required": ["question"]
          }
        }
      }
    }

# In this example, we save only one tool schema to the configuration, but you could have many tools
tool_config = {
    "tools": [provider_websearch_schema],
    "toolChoice": {"auto": {}}
}

## Create the answer_question function
This is the main function for orchestrating the entire conversation flow

* This function calls the LLM to answer the user's question directly or outputs 'tool use' JSON if an internet search is required
* Note that the LLM will have a propencity to use the tool, so we must direct it in the prompt to only do so as a last resort
* If the LLM decides it needs to use the tool, it will output the tool name and arguments in JSON format
* Then the tool is invoked and provided the tool arguments which produces a list of Wikipedia pages or Google URLs depending on which search provider is defined
* The list of pages/URLs is sent to the same model to rerank them in the order of best option to worst option
* The reranked options are iterated through until a valid response is returned. We only want one valid response to save on cost and reduce the token count we send the LLM.
* Finally, we send the original user's question along with the content scraped from the Wikipedia page or Google URL to the LLM to arrive at a final answer

Note: As we progress through the requests and responses, we will add them to a messages_trace. If you want to see the entire conversation, you can uncomment the print statement at the bottom of the function to print out the entire message_trace, run the answer_question cell again, and ask your questions.

In [61]:
# Function for orchestrating the conversation flow...

def answer_question(question):
    # Initialize the messages_trace array:
    messages_trace = []
    # Create the initial message including the user's question
    messages = [{"role": "user", "content": [{"text": question}]}]
    # Append this message to the messages_trace
    messages_trace.append(messages)
    
    system_prompt = f"""
    Only search the web for queries that you can not confidently answer.
    Today's date is {dt.now().strftime("%B %d %Y")}
    If you think a user's question involves something in the future that hasn't happened yet, use the search tool.
    """
    
    response = call_bedrock(messages, system_prompt, tool_config)
    if response:
        # Check the LLM's response to see if it answered the question or needs to use the tool
        use_tool = None
        for content in response['output']['message']['content']:
            if isinstance(content, dict) and 'toolUse' in content:
                tool_use = content['toolUse']
                if tool_use['name'] == "internet_search":
                    use_tool = tool_use['input']
                    break

        #Add the intermediate output to the messages_trace array:
        messages_trace.append(response['output']['message'])
        
        if use_tool:            
            # Get the tool name and arguments:
            tool_name = tool_use['name']
            print(f"Calling tool: {tool_name}")
            tool_args = tool_use['input'] or {}
            print(f"Tool args are: {tool_args}")
    
            # Call the tool function:
            tool_response = getattr(ToolsList(), tool_name)(**tool_args) or ""
            if tool_response:
                tool_status = 'success'
            else:
                tool_status = 'error'
            print(f"Tool response is: {tool_response}")
            tool_response = json.dumps(tool_response)
            #Add the tool result to the messages_trace:
            messages_trace.append(
                {
                    "role": "user",
                    "content": [
                        {
                            'toolResult': {
                                'toolUseId':tool_use['toolUseId'],
                                'content': [
                                    {
                                        "text": tool_response
                                    }
                                ],
                                'status': tool_status
                            }
                        }
                    ]
                }
            )
            
            # RERANK
            # We want to avoid having to send the contents for all pages/URLs returned by the internet search tool as that would be more expensive
            # So we will call the same modelId we have specified initially and pass it the list of pages/URLs
            # We will ask the model to rerank the list in the order of best option to worst option
            # Then we will scrape the page of only the best option to provide up-to-date context related to the user's question
            query = f"""
            Given this user's question:
            <question>
            {question}
            </question>

            Rank from best to worst the choices that are provided in the choices tags for searching the internet to provide an answer to the user's question.
            <choices>
            {tool_response}
            </choices>
            Skip the preamble and do not include any reasoning in your output.
            Do not enumerate or add anything to the list.
            Simply return the choices in a JSON list from best to worst choice.
            """
            messages = [{"role": "user", "content": [{"text": query}]}]
            # Append this message to our messages_trace
            messages_trace.append(messages)
            system_prompt = "You are an expert research assistant."
            
            # Call the LLM to rerank the pages/URLs from best to worst based on the user's question
            response = call_bedrock(messages, system_prompt, tool_config={})
            if response:
                reranked_options = response['output']['message']['content'][-1]['text']
                reranked_options = json.loads(reranked_options)
            else:
                print("Unable to get a response from Bedrock at the reranking step")
                return None
            print(f"reranked_options are: {reranked_options}")
            messages_trace.append(response['output']['message'])
            
            for option in reranked_options:
                print(f"\nScraping page: {option}")
                if search_provider == "Wikipedia":
                    content = get_wikipedia_page_content(option)
                elif search_provider == "Google":
                    content = get_google_page_content(option)

                if content and content != "skip page":
                    break
                else:
                    continue
        
            # FINAL REQUEST
            #Invoke the model one more time and provide it with the content gathered from the internet
            query = f"""
            Based solely on this content:
            <content>
            {content}
            </content>
            Answer this question:
            <question>
            {question}
            </question>
            Skip any preamble or references to the tool.
            """
            messages = [{"role": "user", "content": [{"text": query}]}]
            messages_trace.append(messages)
            system_prompt = "Answer the user's question based on what was returned by the tool"
            response = call_bedrock(messages, system_prompt, tool_config={})
            
            #Add the final response to the messages array:
            messages_trace.append(response['output']['message'])
            print(f"\nFinal answer:\n{response['output']['message']['content'][-1]['text']}\n")
            print(f"Full trace of all queries and responses:\n{json.dumps(messages_trace, indent=2)}")
        else:
            print("No need to call the internet search tool")
            print(f"\nFinal answer:\n{response['output']['message']['content'][-1]['text']}\n")
            print(f"Full trace of all queries and responses:\n{json.dumps(messages_trace, indent=2)}")
    else:
        print("No response returned from the LLM")
    return

In [None]:
answer_question("Why is the sky blue?")

In [None]:
answer_question("Which country won the most gold medals in the 2020 olympics?")

In [None]:
answer_question("Which country won the most gold medals in the 2024 olympics?")

In [None]:
answer_question("What is the current weather in Seattle, Wa right now?")

In [None]:
answer_question("How many Grizzly bears are living in Washington State?")

***

## Web Searching and scraping with Google search

In this example we create an additional function to process the web pages based on URLs returned by the Google search:

* get_google_page_content
    * This function uses the BeautifulSoup module to parse the html content of a single website URL
    * Then the text is processed to remove spaces, blank lines, and short lines

In [77]:
def get_google_page_content(url):
    try:
        # TODO: remove and will replace with Google API client 
        user_agents = [
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
        "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.51 Safari/537.36",
        "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/128.0.0.0 Safari/537.36",
        "Mozilla/5.0 (iPhone; CPU iPhone OS 17_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/128.0.6613.98 Mobile/15E148 Safari/604.1",
        "Mozilla/5.0 (iPad; CPU OS 17_6 like Mac OS X) AppleWebKit/605.1.15 (KHTML, like Gecko) CriOS/128.0.6613.98 Mobile/15E148 Safari/604.1"
        ]
        user_agent = random.choice(user_agents)
        
        # Supply common html header elements for Chrome clients
        headers = {
            "User-Agent": user_agent,
            "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
            "Accept-Language": "en-US,en;q=0.5",
            "Accept-Encoding": "gzip, deflate",
            "Connection": "keep-alive",
            "Upgrade-Insecure-Requests": "1",
            "Sec-Fetch-Dest": "document",
            "Sec-Fetch-Mode": "navigate",
            "Sec-Fetch-Site": "none",
            "Sec-Fetch-User": "?1",
            "Cache-Control": "max-age=0",
        }

        # Use the requests module to get the contents of the URL
        response = requests.get(url, headers=headers, timeout=10)
        response.raise_for_status()  # Raise an exception for bad status codes
        if response:
            # Check the URL to see if it is a link to a PDF doc and skip
            # This code could be extended to also parse PDF docs rather than skipping
            if ".pdf" in url.split('/')[-1]:
                print(f"Scraping PDF file: {url}")
                try:
                    # Create a file-like object from the content
                    pdf_file = io.BytesIO(response.content)

                    # Open the PDF using pdfplumber
                    with pdfplumber.open(pdf_file) as pdf:
                        text = ""
                        # Iterate through all pages and extract text
                        for page in pdf.pages:
                            text += page.extract_text() or ""
                    return text.strip()
                except Exception as err:
                    print(f"Error processing the PDF: {err}")
                    return None
            else:
                # Parse HTML content of the web page
                soup = BeautifulSoup(response.text, 'html.parser')
                # Remove script and style elements
                for script_or_style in soup(["script", "style"]):
                    script_or_style.decompose()
                # Get the text
                text = soup.get_text()
                # Break into lines and remove leading and trailing space on each
                lines = (line.strip() for line in text.splitlines())
                # Break multi-headlines into a line each
                chunks = (phrase.strip() for line in lines for phrase in line.split("  "))
                # Drop blank lines
                no_blank_lines = '\n'.join(chunk for chunk in chunks if chunk)
                # Break into lines again and remove any short lines
                lines = no_blank_lines.splitlines()
                cleaned_text = ""
                character_count = 0
                for line in lines:
                    if len(line) >= 20:
                        cleaned_text += line
                return cleaned_text
        else:
            raise Exception("No response from the web server.")
    except requests.exceptions.Timeout as timeout_err: 
        print(f"Timeout on this URL: {url} skipping...")
        return "skip page"
    except Exception as err:
        print(f"Error while requesting content from {url} skipping...: {err}")
        return "skip page"


In [65]:
# Set the Websearch provider
search_provider = "Google"

In [None]:
answer_question("Who won the 2019 Masters golf tournament?")

In [None]:
answer_question("Which country won the most gold medals in the 2020 olympics?")

In [None]:
answer_question("Which country won the most gold medals in the 2024 olympics?")

In [None]:
answer_question("What is the current weather in Seattle, Wa right now?")

In [None]:
answer_question("What is the current price on Amazon stock?")

In [82]:
answer_question("Who is favored to be the next Prime Minister of Canada?")

Calling tool: internet_search
Tool args are: {'question': 'Who is favored to be the next Prime Minister of Canada?'}
Searching Google...

Tool response is: ['https://www.reuters.com/world/americas/canadas-trudeau-while-more-vulnerable-could-hold-power-into-2025-2024-09-10/', 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_45th_Canadian_federal_election', 'https://abacusdata.ca/canadian-politics-august-2024/', 'https://www.reuters.com/world/americas/canadas-trudeau-far-behind-polls-remains-liberals-best-chance-2023-10-11/', 'https://www.ipsos.com/en-ca/poilievre-top-choice-for-best-pm-in-canada']
reranked_options are: ['https://www.ipsos.com/en-ca/poilievre-top-choice-for-best-pm-in-canada', 'https://abacusdata.ca/canadian-politics-august-2024/', 'https://www.reuters.com/world/americas/canadas-trudeau-far-behind-polls-remains-liberals-best-chance-2023-10-11/', 'https://en.wikipedia.org/wiki/Opinion_polling_for_the_45th_Canadian_federal_election', 'https://www.reuters.com/world/am

In [83]:
answer_question("Filetype:pdf Summarize the most recent Amazon 10K report")

Calling tool: internet_search
Tool args are: {'question': 'Filetype:pdf Most recent Amazon 10K report summary'}
Searching Google...

Tool response is: ['https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Annual-Report.pdf', 'https://ir.aboutamazon.com/sec-filings/sec-filings-details/default.aspx?FilingId=13875159', 'https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/amzn-20221231.htm', 'https://s2.q4cdn.com/299287126/files/doc_financials/2021/ar/Amazon-2020-Annual-Report.pdf', 'https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/default.aspx']
reranked_options are: ['https://s2.q4cdn.com/299287126/files/doc_financials/2024/ar/Amazon-com-Inc-2023-Annual-Report.pdf', 'https://ir.aboutamazon.com/sec-filings/sec-filings-details/default.aspx?FilingId=13875159', 'https://www.sec.gov/Archives/edgar/data/1018724/000101872423000004/amzn-20221231.htm', 'https://ir.aboutamazon.com/annual-reports-proxies-and-shareholder-letters/def