## Building a Bring Your Own Browser (BYOB) Tool for Web Browsing and Summarization

**Disclaimer: This cookbook is for educational purposes only. Ensure that you comply with all applicable laws and service terms when using web search and scraping technologies. This cookbook will restrict the search to openai.com domain to retrieve the public information illustrate the concepts.**

Large Language Models (LLMs) like GPT-4 have a knowledge cutoff date, which means they lack information about events that occurred after that point. In scenarios where the most recent data is essential, it's necessary to provide LLMs with access to current web information to ensure accurate and relevant responses.

In this guide, we will build a Bring Your Own Browser (BYOB) tool using Python to overcome this limitation. Our goal is to create a system that helps us compile a list of the most recent product launches by OpenAI. By integrating web search capabilities with an LLM, we'll enable the model to generate responses based on the latest information available online.

While you can use any publicly available search APIs, we'll utilize Google's Custom Search API to perform web searches. The retrieved information from the search results will be processed and passed to the LLM to generate the final response through Retrieval-Augmented Generation (RAG).

**Bring Your Own Browser (BYOB)** tools allow users to perform web browsing tasks programmatically. In this notebook, we'll create a BYOB tool that:

#1 Performs web searches using a publicly available API such as the Google Custom Search API.  
#2 Retrieves and cleans text content from web pages, and Summarizes the content using an LLM such as gpt-4o-mini    
#3 Performs Retrieval-Augmented Generation (RAG) to generate a final outcome with reference to sources

### Setting up a BYOB tool
Before we begin, ensure you have the following: **Python 3.7 or later** installed on your machine. You will also need a Google Custom Search API key and Custom Search Engine ID (CSE ID). Necessary Python packages installed: `requests`, `beautifulsoup4`, `openai`. And ensure the OPENAI_API_KEY is set up as an environment variable.

#### Step 1: Set Up a Search Engine to Provide Web Search Results
You can use any publicly available web search APIs to perform this task. We will configure a custom search engine using Google's Custom Search API. This engine will fetch a list of relevant web pages based on the user's query, focusing on obtaining the most recent and pertinent results.  

**a. Obtain API Credentials:** Acquire a Google API key and a Custom Search Engine ID (CSE ID) from the Google Developers Console. You can navigate to this [Programmable Search Engine Link](https://developers.google.com/custom-search/v1/overview) to set up an API key as well as Search Engine ID. 

**b. Configure Search Function:** The `search` function below sets up the search based on search term, the API and CSE ID keys, as well as number of search results to return. We'll introduce a parameter `site_filter` to restrict the output to only `openai.com`
  

In [108]:
def search(search_item, api_key, cse_id, search_depth=10, site_filter=None):
    service_url = 'https://www.googleapis.com/customsearch/v1'

    params = {
        'q': search_item,
        'key': api_key,
        'cx': cse_id,
        'num': search_depth
    }

    try:
        response = requests.get(service_url, params=params)
        response.raise_for_status()
        results = response.json()

        # Check if 'items' exists in the results
        if 'items' in results:
            if site_filter is not None:
                
                # Filter results to include only those with site_filter in the link
                filtered_results = [result for result in results['items'] if site_filter in result['link']]

                if filtered_results:
                    return filtered_results
                else:
                    print("No results with 'openai.com' found.")
                    return []
            else:
                if 'items' in results:
                    return results['items']
                else:
                    print("No search results found.")
                    return []

    except requests.exceptions.RequestException as e:
        print(f"An error occurred during the search: {e}")
        return []


**c. Identify the search terms for search engine:** Typically the natural language prompt does not produce the desired results on the search engine. An effective approach is to use LLM to produce the right search term before invoking the search function.   

In this example, we have the `search_query` as the user's desire to list OpenAI product launches in the reverse chronological order. In order to retrieve meaningful results, we first invoke the search engine to produce relevant search terms. 

In [109]:
from openai import OpenAI

client = OpenAI()

search_query = "List the latest OpenAI product launches in chronological order from latest to oldest in the past 2 years"

search_term = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": "Provide a google search term based on search query provided below in 3-4 words"},
        {"role": "user", "content": search_query}]
).choices[0].message.content

print(search_term)

Latest OpenAI product launches


**d. Invoke the search function:** Given the search result, we will invoke the search function to retrieve the results. The results only have the link of the web page and a snippet at this point. 

In [112]:
from dotenv import load_dotenv
import os

load_dotenv('.env')

api_key = os.getenv('API_KEY')
cse_id = os.getenv('CSE_ID')

search_items = search(search_item=search_term, api_key=api_key, cse_id=cse_id, search_depth=10, site_filter="https://openai.com")


In [113]:
for item in search_items:
    print(f"Link: {item['link']}")
    print(f"Snippet: {item['snippet']}\n")

Link: https://openai.com/news/
Snippet: Product · Product. Jul 25, 2024. SearchGPT is a prototype of new AI search features. Home > News > Card ... (opens in a new window) (opens in a new window)

Link: https://openai.com/index/new-models-and-developer-products-announced-at-devday/
Snippet: Nov 6, 2023 ... GPT-4 Turbo with 128K context · We released the first version of GPT-4 in March and made GPT-4 generally available to all developers in July.

Link: https://openai.com/
Snippet: We believe our research will eventually lead to artificial general intelligence, a system that can solve human-level problems. Building safe and beneficial ...

Link: https://openai.com/news/product/
Snippet: Discover the latest product advancements from OpenAI and the ways they're being used by individuals and businesses.



#### Step 2: Build a Search Dictionary with Titles, URLs, and Summaries of Web Pages
After obtaining the search results, we'll extract and organize the relevant information.

**a. Scrape Web Page Content:** For each URL in the search results, scrape the web page to extract textual content while filtering out non-relevant data like scripts and advertisements.    

**b. Summarize Content:** Use an AI assistant or LLM to generate concise summaries of the scraped content, focusing on information pertinent to the user's query.  
  
**c. Create a Structured Dictionary:** Organize the data into a dictionary or a DataFrame containing the title, URL, and summary for each web page.  


In [116]:
import requests
from bs4 import BeautifulSoup

TRUNCATE_SCRAPED_TEXT = 50000  # Adjust based on your model's context window
SEARCH_DEPTH = 5

def retrieve_content(url, max_tokens=TRUNCATE_SCRAPED_TEXT):
        try:
            headers = {'User-Agent': 'Mozilla/5.0'}
            response = requests.get(url, headers=headers, timeout=10)
            response.raise_for_status()

            soup = BeautifulSoup(response.content, 'html.parser')
            for script_or_style in soup(['script', 'style']):
                script_or_style.decompose()

            text = soup.get_text(separator=' ', strip=True)
            characters = max_tokens * 4  # Approximate conversion
            text = text[:characters]
            return text
        except requests.exceptions.RequestException as e:
            print(f"Failed to retrieve {url}: {e}")
            return None
        
def summarize_content(content, search_term, character_limit=500):
        prompt = (
            f"You are an AI assistant tasked with summarizing content relevant to '{search_term}'. "
            f"Please provide a concise summary in {character_limit} characters or less."
        )
        try:
            response = client.chat.completions.create(
                model="gpt-4o-mini",
                messages=[
                    {"role": "system", "content": prompt},
                    {"role": "user", "content": content}]
            )
            summary = response.choices[0].message.content
            return summary
        except Exception as e:
            print(f"An error occurred during summarization: {e}")
            return None

def get_search_results(search_items, character_limit=500):
    # Generate a summary of search results for the given search term
    results_list = []
    for idx, item in enumerate(search_items, start=1):
        url = item.get('link')
        
        snippet = item.get('snippet', '')
        web_content = retrieve_content(url, TRUNCATE_SCRAPED_TEXT)
        
        if web_content is None:
            print(f"Error: skipped URL: {url}")
        else:
            summary = summarize_content(web_content, search_term, character_limit)
            result_dict = {
                'order': idx,
                'URL': url,
                'title': snippet,
                'Summary': summary
            }
            results_list.append(result_dict)
    return results_list

In [117]:
results = get_search_results(search_items)

for result in results:
    print(f"Search order: {result['order']}")
    print(f"Link: {result['URL']}")
    print(f"Snippet: {result['title']}")
    print(f"Summary: {result['Summary']}")
    print('-' * 80)

Search order: 1
Link: https://openai.com/news/
Snippet: Product · Product. Jul 25, 2024. SearchGPT is a prototype of new AI search features. Home > News > Card ... (opens in a new window) (opens in a new window)
Summary: OpenAI has launched several new products recently, including **SearchGPT** (July 2024), a prototype enhancing AI search capabilities, and **GPT-4o mini** (July 2024), aimed at providing cost-efficient AI solutions. Other notable releases include **ChatGPT Edu** (May 2024) for educational applications, and improved data analysis features in ChatGPT. In May 2024, OpenAI also introduced **OpenAI for Nonprofits**, expanding access to AI for non-profit organizations. Continuous improvements to the fine-tuning API and custom model capabilities were also emphasized.
--------------------------------------------------------------------------------
Search order: 2
Link: https://openai.com/index/new-models-and-developer-products-announced-at-devday/
Snippet: Nov 6, 2023 ... GPT-4

#### Step 3: Generate a RAG Response to the User Query
With the structured data, we'll enhance the LLM's response by incorporating the most recent information.

Combine Summaries: Aggregate the summaries from all relevant web pages into a single context.
Perform Retrieval-Augmented Generation: Feed the combined summaries into the LLM as additional context to generate a comprehensive and up-to-date response to the user's query.
Deliver the Final Response: Present the LLM's output, which now includes information beyond its original knowledge cutoff, providing accurate and current insights.


In [119]:
import json 

final_prompt = (
    f"The user will provide a dictionary of search results in JSON format for search query {search_term} Based on on the search results provided by the user, provide a detailed response to this query: **'{search_query}'**. Make sure to cite all the sources at the end of your answer."
)

response = client.chat.completions.create(
    model="gpt-4o",
    messages=[
        {"role": "system", "content": final_prompt},
        {"role": "user", "content": json.dumps(results)}],
    temperature=0

)
summary = response.choices[0].message.content



In [120]:
print(summary)

Based on the provided search results, here is a chronological list of the latest OpenAI product launches from the past two years, starting from the most recent:

1. **SearchGPT** - July 25, 2024
   - A prototype enhancing AI search capabilities.
   - [Source](https://openai.com/news/)

2. **GPT-4o mini** - July 18, 2024
   - Aimed at providing cost-efficient AI solutions.
   - [Source](https://openai.com/news/product/)

3. **ChatGPT Edu** - May 30, 2024
   - Designed for educational applications.
   - [Source](https://openai.com/news/product/)

4. **OpenAI for Nonprofits** - May 2024
   - Expanding access to AI for non-profit organizations.
   - [Source](https://openai.com/news/)

5. **GPT-4 Turbo** - November 6, 2023
   - Features a 128K context window and reduced pricing.
   - [Source](https://openai.com/index/new-models-and-developer-products-announced-at-devday/)

6. **Assistants API** - November 6, 2023
   - For building custom AI applications.
   - [Source](https://openai.com/ind