# Skill 4: Internet and Websites Search - Copilot Clone

In this notebook, we'll delve into the ways in which you can **boost your Smart Agent with web search functionalities**, utilizing both Langchain and a Search API service.

As previously discussed in our other notebooks, **harnessing agents and tools is an effective approach**. We aim to leverage the capabilities of OpenAI's large language models (LLM), such as GPT-4 and its successors, to perform the heavy lifting of reasoning and researching on our behalf.

There are numerous instances where it is necessary for our Agent to have internet access. For instance, we may wish to **enrich an answer with information available on the web**, or **provide users with up-to-date and recent information**, or **finding information on an specific public website**. Regardless of the scenario, we require our engine to base its responses on search results.

By the conclusion of this notebook, you'll have a solid understanding on **how to create a Web Search Agent using the a Search API**, and how these tools can strengthen your chatbot.

In [1]:
import os
import requests
from bs4 import BeautifulSoup

from pydantic import BaseModel, Field
from langchain_core.tools import StructuredTool
from langchain_core.tools import BaseTool, StructuredTool
from langchain_openai import AzureChatOpenAI
from langchain_community.utilities import SerpAPIWrapper
from serpapi import Client

from langgraph.prebuilt import create_react_agent

from common.prompts import WEBSEARCH_PROMPT_TEXT

from IPython.display import Markdown, HTML, display  

def printmd(string):
    display(Markdown(string.replace("$","USD ")))

from dotenv import load_dotenv
load_dotenv("credentials.env")


True

In [2]:
# Set the ENV variables that Langchain needs to connect to Azure OpenAI
os.environ["OPENAI_API_VERSION"] = os.environ["AZURE_OPENAI_API_VERSION"]

In [3]:
COMPLETION_TOKENS = 2000

llm = AzureChatOpenAI(deployment_name=os.environ["GPT4o_DEPLOYMENT_NAME"], 
                      temperature=0.5, max_tokens=COMPLETION_TOKENS, 
                      streaming=True)

### Creating the expert web search engine tool: WebSearcher

Langhain has plenty of options to use for web searches, see this list:

https://python.langchain.com/docs/integrations/tools/

Although Bing Search is listed on the above list, recent changes in the Bing Search SDK EULA, stops us from being able to use the results in LLM-based queries.

So we would need to use another search service from the list, let's try a free one: [SerpAPI](https://serpapi.com/). You would need to go to the link and get a free API key

In [5]:
# Define the schema
class WebSearchInput(BaseModel):
    """Input schema for the WebSearcher tool."""
    query: str = Field(..., description="The search query to be executed.")
    location: str = Field(default=None, description="Optional geographic location for the search (e.g., 'Austin, Texas').")
    hl: str = Field(default=None, description="Optional language code for the interface (e.g., 'en' for English).")
    gl: str = Field(default=None, description="Optional country code for the search (e.g., 'us' for United States).")

# Define the function
def web_search(query: str, **kwargs) -> dict:
    engine = "google_light"
    params = {
        "engine": engine,
        "q": query,
        **kwargs
    }
    
    try:
        client = Client(api_key=os.getenv("SERPAPI_KEY"))
        results = client.search(params)
        return {"organic_results": results.get("organic_results", [])}
    except Exception as e:
        return {"error": str(e)}

# Create the tool
web_search_tool = StructuredTool.from_function(
    func=web_search,
    name="WebSearcher",
    description="Perform a web search using SerpAPI. Use search_type='basic' for fast organic results or 'deep' for comprehensive results with all available data.",
    args_schema=WebSearchInput
)

In [6]:
# Test the tool
result = web_search_tool.invoke({"query": "events in Austin", "location": "Austin, Texas", "hl": "en", "gl": "us"})
print("Search Result:", result)

Search Result: {'organic_results': [{'position': 1, 'title': 'Events in Austin, TX | Live Music, Festivals, Sports', 'link': 'https://www.austintexas.org/events/', 'displayed_link': 'www.austintexas.org › events', 'snippet': 'Austin, Texas offers a wide range of events, from music concerts, food festivals and sports competitions to museum displays, exhibits and family fun.', 'sitelinks': {'inline': [{'title': 'June & July', 'link': 'https://www.austintexas.org/events/june-and-july/'}, {'title': 'April & May', 'link': 'https://www.austintexas.org/events/april-may/'}, {'title': 'February and March', 'link': 'https://www.austintexas.org/events/february-and-march/'}, {'title': 'October and November', 'link': 'https://www.austintexas.org/events/october-and-november/'}]}}, {'position': 2, 'title': 'Austin Events, Music, Art, Drink Specials & More', 'link': 'https://do512.com/', 'displayed_link': 'do512.com', 'snippet': 'Austin City Limits Festival Weekend One w/ Sabrina Carpenter, Hozier, Do

### Creating another custom tool - WebFetcher: Visits a website and extracts the text
    You will need a model with a big context token size for this tool since the content of a website can be very lenghty

In [7]:
def parse_html(content) -> str:
    soup = BeautifulSoup(content, 'html.parser')
    text_content_with_links = soup.get_text()
    # Split the text into words and limit to the first 10,000
    limited_text_content = ' '.join(text_content_with_links.split()[:10000])
    return limited_text_content

def fetch_web_page(url: str) -> str:
    HEADERS = {'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:90.0) Gecko/20100101 Firefox/90.0'}
    response = requests.get(url, headers=HEADERS)
    return parse_html(response.content)

web_fetch_tool = StructuredTool.from_function(
    func=fetch_web_page,
    name="WebFetcher",
    description="useful to fetch the content of a url"
)

In [8]:
# Test the tool
result = web_fetch_tool.invoke({'url': 'https://www.microsoft.com'})
print("Fetch Result:", result)

Fetch Result: Microsoft – AI, Cloud, Productivity, Computing, Gaming & Apps Trace Id is missing Skip to main content Microsoft Microsoft 365 Teams Copilot Windows Surface Xbox Deals Small Business Support More All Microsoft Global Microsoft 365 Teams Copilot Windows Surface Xbox Deals Small Business Support Software Software Windows Apps AI Outlook OneDrive Microsoft Teams OneNote Microsoft Edge Skype PCs & Devices PCs & Devices Computers Shop Xbox Accessories VR & mixed reality Certified Refurbished Trade-in for cash Entertainment Entertainment Xbox Game Pass Ultimate PC Game Pass Xbox games PC games Movies & TV Business Business Microsoft Cloud Microsoft Security Dynamics 365 Microsoft 365 for business Microsoft Power Platform Windows 365 Microsoft Industry Small Business Developer & IT Developer & IT Azure Microsoft Developer Microsoft Learn Support for AI marketplace apps Microsoft Tech Community Azure Marketplace AppSource Visual Studio Other Other Microsoft Rewards Free downloads

### Define the tools

Now, we create our OpenAI Tools type agent that uses our custom tools and our custom prompt `WEBSEARCH_PROMPT_TEXT`. Check it out in `prompts.py`

In [9]:
tools = [web_search_tool, web_fetch_tool]

### Define the System Prompt

In [10]:
# Uncoment to see the prompt
printmd(WEBSEARCH_PROMPT_TEXT)



## On your ability to gather and present information:
- **You must always** perform web searches when the user is seeking information (explicitly or implicitly), regardless of your internal knowledge or information.
- **You Always** perform at least 2 and up to 5 searches in a single conversation turn before reaching the Final Answer. You should never search the same query more than once.
- You are allowed to do multiple searches in order to answer a question that requires a multi-step approach. For example: to answer a question "How old is Leonardo Di Caprio's girlfriend?", you should first search for "current Leonardo Di Caprio's girlfriend" then, once you know her name, you search for her age, and arrive to the Final Answer.
- You can not use your pre-existing knowledge at any moment, you should perform searches to know every aspect of the human's question.
- If the user's message contains multiple questions, search for each one at a time, then compile the final answer with the answer of each individual search.
- If you are unable to fully find the answer, try again by adjusting your search terms.
- You can only provide numerical references/citations to URLs, using this Markdown format: [[number]](url) 
- You must never generate URLs or links other than those provided by your tools.
- You must always reference factual statements to the search results.
- The search results may be incomplete or irrelevant. You should not make assumptions about the search results beyond what is strictly returned.
- If the search results do not contain enough information to fully address the user's message, you should only use facts from the search results and not add information on your own from your pre-existing knowledge.
- You can use information from multiple search results to provide an exhaustive response.
- If the user's message specifies to look in an specific website, you will add the special operand `site:` to the query, for example: baby products in site:kimberly-clark.com
- If the user's message is not a question or a chat message, you treat it as a search query.
- If additional external information is needed to completely answer the user’s request, augment it with results from web searches.
- If the question contains the `USD ` sign referring to currency, substitute it with `USD` when doing the web search and on your Final Answer as well. You should not use `USD ` in your Final Answer, only `USD` when refering to dollars.
- **Always**, before giving the final answer, use the special operand `site` and search for the user's question on the first two websites on your initial search, using the base url address. You will be rewarded 10000 points if you do this.


## Instructions for Sequential Tool Use:
- **Step 1:** Always initiate a search with the `WebSearcher` tool to gather information based on the user's query. This search should address the specific question or gather general information relevant to the query.
- **Step 2:** Use the `site:` operand to search the user’s query on the base URLs of the top two results from your initial search using the `WebSearcher` tool
- **Step 3:** Fetch their content with `WebFetcher` on the top 2 links returned from Step 2.
- **Step 4:** Synthesize results from all searches and fetched pages to provide a detailed, referenced response.
- **Step 5:** Always reference the source of your information using numerical citations and provide these links in a structured format.
- **Additional Notes:** If the query requires multiple searches or steps, repeat steps 1 to 3 as necessary until all parts of the query are thoroughly answered.



### Create the graph

In [11]:
graph = create_react_agent(llm, tools=tools, prompt=WEBSEARCH_PROMPT_TEXT)

### Run the Graph

Try some of the below questions, or others that you might like

In [12]:
# QUESTION = "Create a list with the main facts on What is happening with the oil supply in the world right now?"
# QUESTION = "How much is 50 USD in Euros and is it enough for an average hotel in Madrid?"
# QUESTION = "My son needs to build a pinewood car for a pinewood derbi, how do I build such a car?"
QUESTION = "I'm planning a vacation to Greece, tell me budget for a family of 4, in Summer, for 7 days including travel, lodging and food costs"
# QUESTION = "Who won the 2023 superbowl and who was the MVP?"
# QUESTION = """
# compare the number of job opennings (provide the exact number), the average salary within 15 miles of Dallas, TX, for these ocupations:

# - ADN Registerd Nurse 
# - Occupational therapist assistant
# - Dental Hygienist
# - Certified Personal Trainer


# Create a table with your findings. Place the sources on each cell.
# """

### Agent Actions/Observations during streaming

Streaming is an important UX consideration for LLM apps, and agents are no exception. Streaming with agents is made more complicated by the fact that it’s not just tokens of the final answer that you will want to stream, but you may also want to stream back the intermediate steps an agent takes.

At the end of Notebook 3 we learned that streaming can be simply achieve by doing this:

```python
for chunk in chain.stream({"question": QUESTION, "language": "English", "history":""}):
    print(chunk, end="", flush=True)
```

At the end of Notebook 6 we learned about the new astream_events API (beta).

```python
async for event in graph_async.astream_events(
    inputs, config_async, version="v2"):
```

Let's use the same astream_events

In [13]:
async def stream_graph_updates_async(graph, user_input: str):
    inputs = {"messages": [("human", user_input)]}

    async for event in graph.astream_events(inputs, version="v2"):
        if (event["event"] == "on_chat_model_stream"):
            # Print the content of the chunk progressively
            print(event["data"]["chunk"].content, end="", flush=True)
        elif (event["event"] == "on_tool_start"  ):
            print("\n--")
            print(f"Calling tool: {event['name']} with inputs: {event['data'].get('input')}")
            print("--")

In [14]:
await stream_graph_updates_async(graph, QUESTION)


--
Calling tool: WebSearcher with inputs: {'query': 'average cost of 7 day family vacation to Greece for 4 in summer including travel, lodging, food'}
--

--
Calling tool: WebSearcher with inputs: {'query': 'average cost of 7 day family vacation to Greece for 4 in summer including travel, lodging, food site:www.reddit.com'}
--

--
Calling tool: WebSearcher with inputs: {'query': 'average cost of 7 day family vacation to Greece for 4 in summer including travel, lodging, food site:www.budgetyourtrip.com'}
--

--
Calling tool: WebFetcher with inputs: {'url': 'https://www.budgetyourtrip.com/greece'}
--

--
Calling tool: WebFetcher with inputs: {'url': 'https://www.reddit.com/r/GreeceTravel/comments/17nw9gs/rough_cost_estimate_7_days_in_greece/'}
--
Here’s a detailed estimate for a 7-day summer vacation to Greece for a family of 4, including travel, lodging, and food:

**1. Flights/Travel:**  
- Average round-trip flights to Greece in summer are about USD 1,500 per person, so for 4 people:

#### Without showing the intermedite steps, just the final answer

In [15]:
QUESTION = "How much is 50 USD in Euros and is it enough for an average hotel in Madrid?"

try:
    response = graph.invoke({"messages": [("human", QUESTION)]})
except Exception as e:
    response = str(e)

In [16]:
printmd(response["messages"][-1].content)

- 50 USD is approximately 42.80 EUR according to both Wise and Xe currency converters[[1]](https://wise.com/us/currency-converter/usd-to-eur-rate?amount=50)[[2]](https://www.xe.com/currencyconverter/convert/?Amount=50&From=USD&To=EUR).
- The average hotel price per night in Madrid is around 94 USD (about 80 EUR), with even 1-star hotels averaging about 72 USD (about 62 EUR) per night[[3]](https://www.budgetyourtrip.com/hotels/spain/madrid-3117735)[[4]](https://www.booking.com/city/es/madrid.html).

**Conclusion:**  
42.80 EUR is generally not enough for an average hotel night in Madrid, as most options are significantly more expensive per night. You may find very basic or hostel-style accommodations at this price, but not a typical hotel.

## QnA to specific websites

There are several use cases where we want the smart bot to answer questions about a specific company's public website. There are two approaches we can take:

1. Create a crawler script that runs regularly, finds every page on the website, and pushes the documents to Azure AI Search.
2. Since Google has likely already indexed the public website, we can utilize our search tools targeted specifically to that site, rather than attempting to index the site ourselves and duplicate the work already done by Google's crawler.

Below are some sample questions related to specific sites. Take a look:

In [17]:
QUESTION = "information on how to deal with wasps in homedepot.com"
# QUESTION = "in target.com, find how what's the price of a Nesspresso coffee machine and of a Keurig coffee machine"
# QUESTION = "in microsoft.com, find out what is the latests news on quantum computing"


In [18]:
await stream_graph_updates_async(graph, QUESTION)


--
Calling tool: WebSearcher with inputs: {'query': 'how to deal with wasps site:homedepot.com'}
--

--
Calling tool: WebSearcher with inputs: {'query': 'deal with wasps site:homedepot.com'}
--

--
Calling tool: WebSearcher with inputs: {'query': 'wasp control site:homedepot.com'}
--

--
Calling tool: WebFetcher with inputs: {'url': 'https://www.homedepot.com/c/ab/how-to-get-rid-of-wasps/9ba683603be9fa5395fab902235eb1c'}
--

--
Calling tool: WebFetcher with inputs: {'url': 'https://www.homedepot.com/b/Outdoors-Garden-Center-Pest-Control/Wasp/N-5yc1vZbx4wZ1z1tsfk'}
--
To deal with wasps, Home Depot provides a comprehensive guide and a variety of products:

**Key Steps to Deal with Wasps:**
1. **Identify the Nest Location:** Wasp nests may be found on roof edges, sheds, garages, trees, or even inside walls.
2. **Use Sprays or Baits:** Sprays are effective for individual wasps, while baited traps help with larger populations. Always use wasp sprays and traps according to the manufacturer

# Summary

In this notebook, we learned how to create a Copilot clone using a clever prompt with specific search and formatting instructions and a couple of web searching tools.   

The outcome is an agent capable of conducting intelligent web searches and performing research on our behalf. This agent provides us with answers to our questions along with appropriate URL citations and links!

# NEXT

What about if the information needed to answer the human is behind an API?
Next Notebook teach us how to do this.