# AI-based web research assistant

In today's digital landscape, researchers, analysts, and professionals face the challenge of quickly extracting meaningful insights from vast amounts of online information. Traditional web searching often results in information scattered across multiple sources, requiring significant time to read through and synthesize. This notebook presents an intelligent solution that automates the research process by combining web search capabilities with language model-based summarization.

Our system addresses the core problem of information overload by automatically searching the web for relevant content and generating concise, actionable summaries. Rather than manually browsing through dozens of search results, users can obtain key insights from multiple sources in a structured format, dramatically reducing research time while maintaining information quality. The tool leverages DuckDuckGo's search API for web queries and OpenAI's language models for intelligent summarization.


In [1]:
import os
from langchain.tools import DuckDuckGoSearchResults
from langchain_openai import ChatOpenAI
from langchain import PromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field
from typing import List, Dict, Any, Tuple, Optional
import re
import nltk
from dotenv import load_dotenv

# Download necessary NLTK data for text processing
nltk.download('punkt', quiet=True)  # Sentence tokenizer
nltk.download('stopwords', quiet=True)  # Common stopwords

# Load environment variables
load_dotenv()

# Set OpenAI API key
os.environ["OPENAI_API_KEY"] = os.getenv('OPENAI_API_KEY')

### Initialize DuckDuckGo search engine
Here we initialize our web search capabilities using DuckDuckGo's search API. DuckDuckGo provides privacy-focused search without requiring API keys and offers reliable results for automated research tasks. This component will serve as our primary interface to the web for gathering information.

In [2]:
# Initialize DuckDuckGo Search tool
search = DuckDuckGoSearchResults(backend="html")

The search object now serves as our gateway to web information. DuckDuckGo's API through LangChain provides a clean interface that returns structured search results including snippets, titles, and links, which we will process in subsequent steps.

### Define data models
To ensure consistent data handling throughout our application, we define structured data models using Pydantic. This approach provides type safety, validation, and clear documentation of our data structures, making the code more maintainable and less prone to errors.

In [3]:
class SummarizeText(BaseModel):
    """Pydantic model for text summarization input validation."""
    text: str = Field(..., title="Text to summarize", description="The text to be summarized")

By defining this data model, we create a contract for how text data should be structured when passed to our summarization functions. The `Field` definition provides metadata that can be used for validation and documentation, ensuring robust data handling throughout the application.


### Search result processing functions
The raw output from search engines requires parsing and structuring to be useful for our summarization pipeline. This section implements functions to perform web searches and transform unstructured search results into clean, organized data that can be easily processed by subsequent components.
- Parsing search results: The output from the DuckDuckGo search API is a long, unstructured string. To make it usable, we break it into structured dictionaries that contain the snippet (a short excerpt from the page), the title, and the URL. This gives us clarity and easy access to the key parts of each result.
- Performing a web search: A function that conducts a search based on a user query. If a specific website is provided, it performs two searches:
  - One focused on the given site (`site:example.com`)
  - Another excluding it (`-site:example.com`) to diversify the perspective

It then combines and returns a subset of the most relevant results.

In [4]:
# Parse raw search results string into structured dictionaries with title, snippet, and link
def parse_search_results(results_string: str) -> List[dict]:
    """Parse a string representation of search results into a list of dictionaries."""
    results = []

    # Each search result is embedded in a long string separated by ', snippet: '
    # Split the results string using the snippet delimiter - this separates individual search results from the concatenated string
    entries = results_string.split(', snippet: ')

    # Process each entry
    for entry in entries[1:]:  # Skip the first split as it's empty
        # Extract snippet and title-link portion
        parts = entry.split(', title: ')
        if len(parts) == 2:
            snippet = parts[0]

            # Further split to separate title and link
            title_link = parts[1].split(', link: ')
            if len(title_link) == 2:
                title, link = title_link

                # Create structured result dictionary
                results.append({
                    'snippet': snippet,
                    'title': title,
                    'link': link
                })
    return results

# Execute web search with optional site-specific filtering.
def perform_web_search(query: str, specific_site: Optional[str] = None) -> Tuple[List[str], List[Tuple[str, str]]]:
    """Perform a web search based on a query, optionally including a specific website."""
    try:
        if specific_site:
            # Perform site-specific search using site: operator
            specific_query = f"site:{specific_site} {query}"
            print(f"Searching for: {specific_query}")
            specific_results = search.run(specific_query)
            print(f"Specific search results: {specific_results}")
            specific_parsed = parse_search_results(specific_results)

            # Complement with general search excluding the specific site - this provides broader context while avoiding duplicate content
            general_query = f"-site:{specific_site} {query}"
            print(f"Searching for: {general_query}")
            general_results = search.run(general_query)
            print(f"General search results: {general_results}")
            general_parsed = parse_search_results(general_results)

            # Combine results, prioritizing site-specific content - limit to top 3 results to maintain focus and processing efficiency
            combined_results = (specific_parsed + general_parsed)[:3]
        else:
            # If no specific site is provided, perform a general open search
            print(f"Searching for: {query}")
            web_results = search.run(query)
            print(f"Web results: {web_results}")
            combined_results = parse_search_results(web_results)[:3]

        # Extract only the textual snippets for summarization
        web_knowledge = [result.get('snippet', '') for result in combined_results]

        # Store the title and link of each result separately for attribution
        sources = [(result.get('title', 'Untitled'), result.get('link', '')) for result in combined_results]

        print(f"Processed web_knowledge: {web_knowledge}")
        print(f"Processed sources: {sources}")
        return web_knowledge, sources
    except Exception as e:
        print(f"Error in perform_web_search: {str(e)}")
        import traceback
        traceback.print_exc()
        return [], []

This section prepares search data for AI summarization in two steps:
- `parse_search_results()` takes a raw search result string (typically formatted with `, snippet:`, `, title:`, and `, link:` delimiters) and splits it into clean dictionaries. Each dictionary contains the key elements needed: a snippet of text, the page title, and the URL.
- `perform_web_search()` makes use of this parser. It uses the DuckDuckGo tool to issue a query. If a `specific_site` is given, it searches within that site and also outside it for comparison. It then combines the two sets of results and extracts the important parts: just the text for summarization and the source info (title + URL) for reference.

By the end of this process, we are left with two clean lists: one with the text to summarize and one with corresponding sources. These are now ready to feed into the language model.

### Text summarization function
The summarization component leverages OpenAI's language models to distill lengthy web content into concise, actionable insights. This section implements the core AI functionality that transforms raw text snippets into structured summaries with proper source attribution.

In [5]:
# Generate AI-powered summary of web content with source attribution
def summarize_text(text: str, source: Tuple[str, str]) -> str:
    """Summarize the given text using OpenAI's language model."""
    try:
        # # Initialize OpenAI language model - temperature 0.7 provides good balance between consistency and creativity
        llm = ChatOpenAI(temperature=0.7, model="gpt-4o-mini-2024-07-18")

        # Create structured prompt for consistent summarization output
        prompt_template = "Please summarize the following text in 1-2 bullet points:\n\n{text}\n\nSummary:"

        # Build prompt template with input variable definition
        prompt = PromptTemplate(
            template=prompt_template,
            input_variables=["text"],
        )

        # Create processing chain combining prompt and language model
        summary_chain = prompt | llm

        # Prepare input data for the AI model
        input_data = {"text": text}

        # Execute the summarization process
        summary = summary_chain.invoke(input_data)
        # Extract content from the AI response object - handle both direct string responses and object responses
        summary_content = summary.content if hasattr(summary, 'content') else str(summary)

        # Format the final output with source attribution
        formatted_summary = f"Source: {source[0]} ({source[1]})\n{summary_content.strip()}\n"
        return formatted_summary
    except Exception as e:
        # Handle AI model errors
        print(f"Error in summarize_text: {str(e)}")
        return ""

This function represents the intelligence core of our research assistant. It configures the AI model with appropriate parameters for summarization tasks, constructs effective prompts that guide the model toward concise outputs, and handles the response processing to ensure consistent formatting.


### Main search and summarize function
The final component brings together all previous elements into a cohesive research workflow. This function orchestrates the entire process from query input to formatted output, managing the complexity of coordinating web searches with AI summarization while maintaining source attribution throughout.

In [6]:
# Execute complete research workflow: search, extract, and summarize
def search_summarize(query: str, specific_site: Optional[str] = None) -> str:
    """Perform a web search and summarize the results."""
    # Execute web search and extract structured information (content + sources)
    web_knowledge, sources = perform_web_search(query, specific_site)

    # Validate that search produced usable results
    if not web_knowledge or not sources:
        print("No web knowledge or sources found.")
        return ""

    # Process each search result through AI summarization - list comprehension with filtering ensures only successful summaries are included
    summaries = [
        summarize_text(knowledge, source)
        for knowledge, source in zip(web_knowledge, sources)
        if summarize_text(knowledge, source)  # Filter out empty summaries
    ]

    # Combine all summaries into a single formatted output - this creates a comprehensive research report from multiple sources
    combined_summary = "\n".join(summaries)
    return combined_summary

This function coordinates all previously defined subcomponents into a linear workflow. It starts by calling the search engine handler to retrieve snippets and metadata. If nothing is returned, it gracefully exits.

Otherwise, it proceeds to run each snippet through the AI summarizer, ensuring that every result is paired with its source title and URL. The function uses a `zip()` to maintain alignment between the content and its source, and filters out any cases where the summarizer returns an empty string (which could happen due to malformed input or transient API issues).

The result is a clean summary of key points sourced from across the web—tailored to a single query. This output is ready to be printed, logged, displayed in a UI, or stored.


### Example usage
To demonstrate the research assistant's capabilities, we will execute a real research query. This example shows how users can leverage the tool for investigating current developments in artificial intelligence, with optional focus on academic sources.

In [7]:
# Define research parameters
query = "What are the latest advancements in artificial intelligence?" # Choose a topic
specific_site = "https://www.nature.com"  # Optional: specify a site or set to None

# Execute the complete research workflow
result = search_summarize(query, specific_site)

# Display formatted results with clear context
print(f"Summary of latest advancements in AI (including information from {specific_site if specific_site else 'various sources'}):")
print(result)

Searching for: site:https://www.nature.com What are the latest advancements in artificial intelligence?
Specific search results: 
Searching for: -site:https://www.nature.com What are the latest advancements in artificial intelligence?
General search results: 
Processed web_knowledge: []
Processed sources: []
No web knowledge or sources found.
Summary of latest advancements in AI (including information from https://www.nature.com):

