# Adding a component-level eval to the research workflow


We will design a **component-level evaluation** to check the quality of sources returned by the research step.  

The evaluation will compare the URLs retrieved by the agent against a **predefined list of preferred domains** (e.g., `arxiv.org`, `nature.com`, `nasa.gov`).  

This allows you to quantify whether the system is pulling information from trustworthy sources, using an **objective, per-example ground truth evaluation**.


###  overview

The idea is to verify whether the web search tool is returning sources from preferred domains, and to quantify the ratio of preferred vs. total results. This evaluation will be implemented as a single function that performs an **objective, per-example check**. It will:

* Parse the Tavily output (web search tool).  
* Identify which URLs belong to the list of **preferred domains**.  
* Compute the ratio of preferred vs. total retrieved sources.  
* Return both a boolean flag (**PASS/FAIL**) and a Markdown-formatted summary that can be embedded directly into reports.  

In [1]:
# =========================
# Imports
# =========================

# --- Standard library 
from datetime import datetime
import json
import re

# --- Third-party ---
from aisuite import Client

# --- Local / project ---
import research_tools
import utils

client = Client()

In [2]:
from IPython.display import Markdown, display

## Defining Tools

In [3]:
import os
import xml.etree.ElementTree as ET

import requests
from dotenv import load_dotenv
from tavily import TavilyClient
import wikipedia

# Init env
load_dotenv()  # load variables 

False

In [4]:
# Set user-agent for requests to arXiv
session = requests.Session()
session.headers.update({
    "User-Agent": "LF-ADP-Agent/1.0 (mailto:your.email@example.com)"
})

def arxiv_search_tool(query: str, max_results: int = 5) -> list[dict]:
    """
    Searches arXiv for research papers matching the given query.
    """
    url = f"https://export.arxiv.org/api/query?search_query=all:{query}&start=0&max_results={max_results}"

    try:
        response = session.get(url, timeout=60)
        response.raise_for_status()
    except requests.exceptions.RequestException as e:
        return [{"error": str(e)}]

    try:
        root = ET.fromstring(response.content)
        ns = {'atom': 'http://www.w3.org/2005/Atom'}

        results = []
        for entry in root.findall('atom:entry', ns):
            title = entry.find('atom:title', ns).text.strip()
            authors = [author.find('atom:name', ns).text for author in entry.findall('atom:author', ns)]
            published = entry.find('atom:published', ns).text[:10]
            url_abstract = entry.find('atom:id', ns).text
            summary = entry.find('atom:summary', ns).text.strip()

            link_pdf = None
            for link in entry.findall('atom:link', ns):
                if link.attrib.get('title') == 'pdf':
                    link_pdf = link.attrib.get('href')
                    break

            results.append({
                "title": title,
                "authors": authors,
                "published": published,
                "url": url_abstract,
                "summary": summary,
                "link_pdf": link_pdf
            })

        return results
    except Exception as e:
        return [{"error": f"Parsing failed: {str(e)}"}]


arxiv_tool_def = {
    "type": "function",
    "function": {
        "name": "arxiv_search_tool",
        "description": "Searches for research papers on arXiv by query string.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search keywords for research papers."
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return.",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    }
}

In [5]:
def tavily_search_tool(query: str, max_results: int = 5, include_images: bool = False) -> list[dict]:
    """
    Perform a search using the Tavily API.

    Args:
        query (str): The search query.
        max_results (int): Number of results to return (default 5).
        include_images (bool): Whether to include image results.

    Returns:
        list[dict]: A list of dictionaries with keys like 'title', 'content', and 'url'.
    """
    params = {}
    api_key = os.getenv("TAVILY_API_KEY")
    if not api_key:
        raise ValueError("TAVILY_API_KEY not found in environment variables.")
    params['api_key'] = api_key

    #client = TavilyClient(api_key)

    api_base_url = os.getenv("DLAI_TAVILY_BASE_URL")
    if api_base_url:
        params['api_base_url'] = api_base_url

    client = TavilyClient(api_key=api_key, base_url=api_base_url)

    try:
        response = client.search(
            query=query,
            max_results=max_results,
            include_images=include_images
        )

        results = []
        for r in response.get("results", []):
            results.append({
                "title": r.get("title", ""),
                "content": r.get("content", ""),
                "url": r.get("url", "")
            })

        if include_images:
            for img_url in response.get("images", []):
                results.append({"image_url": img_url})

        return results

    except Exception as e:
        return [{"error": str(e)}]  # For LLM-friendly agents
    

tavily_tool_def = {
    "type": "function",
    "function": {
        "name": "tavily_search_tool",
        "description": "Performs a general-purpose web search using the Tavily API.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search keywords for retrieving information from the web."
                },
                "max_results": {
                    "type": "integer",
                    "description": "Maximum number of results to return.",
                    "default": 5
                },
                "include_images": {
                    "type": "boolean",
                    "description": "Whether to include image results.",
                    "default": False
                }
            },
            "required": ["query"]
        }
    }
}

In [6]:
## Wikipedia search tool
def wikipedia_search_tool(query: str, sentences: int = 5) -> list[dict]:
    """
    Searches Wikipedia for a summary of the given query.

    Args:
        query (str): Search query for Wikipedia.
        sentences (int): Number of sentences to include in the summary.

    Returns:
        list[dict]: A list with a single dictionary containing title, summary, and URL.
    """
    try:
        page_title = wikipedia.search(query)[0]
        page = wikipedia.page(page_title)
        summary = wikipedia.summary(page_title, sentences=sentences)

        return [{
            "title": page.title,
            "summary": summary,
            "url": page.url
        }]
    except Exception as e:
        return [{"error": str(e)}]

# Tool definition
wikipedia_tool_def = {
    "type": "function",
    "function": {
        "name": "wikipedia_search_tool",
        "description": "Searches for a Wikipedia article summary by query string.",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search keywords for the Wikipedia article."
                },
                "sentences": {
                    "type": "integer",
                    "description": "Number of sentences in the summary.",
                    "default": 5
                }
            },
            "required": ["query"]
        }
    }
}

In [7]:
# Tool mapping
tool_mapping = {
    "tavily_search_tool": tavily_search_tool,
    "arxiv_search_tool": arxiv_search_tool,
    "wikipedia_search_tool": wikipedia_search_tool
}

## Research Step

Defining a function `find_references` for web search functionality(**gather external information** from tools such as **Arxiv**, **Tavily**, and **Wikipedia**).

In [8]:
def find_references(task: str, model: str = "openai:gpt-4o", return_messages: bool = False):
    """Perform a research task using external tools (arxiv, tavily, wikipedia)."""

    prompt = f"""
    You are a research function with access to:
    - arxiv_tool: academic papers
    - tavily_tool: general web search (return JSON when asked)
    - wikipedia_tool: encyclopedic summaries

    Task:
    {task}

    Today is {datetime.now().strftime('%Y-%m-%d')}.
    """.strip()

    messages = [{"role": "user", "content": prompt}]
    tools = [
        research_tools.arxiv_search_tool,
        research_tools.tavily_search_tool,
        research_tools.wikipedia_search_tool,
    ]

    try:
        response = client.chat.completions.create(
            model=model,
            messages=messages,
            tools=tools,
            tool_choice="auto",
            max_turns=5,
        )
        content = response.choices[0].message.content
        return (content, messages) if return_messages else content
    except Exception as e:
        return f"[Model Error: {e}]"

In [9]:
research_task = "Find 2 recent papers about recent developments in black hole science"
research_result = find_references(research_task)

print(research_result)

I couldn't find recent papers published around your specification, but here are two papers on recent developments in black hole science available via arXiv:

1. **Title**: [Accretion onto Supermassive Black Holes in Quasars: Learning from Optical/UV Observations](http://arxiv.org/abs/astro-ph/0606678v1)  
   **Authors**: Paola Marziani, Deborah Dultzin-Hacyan, Jack W. Sulentic  
   **Published**: 2006-06-28  
   **Summary**: This paper explores the complexities of accretion processes in quasars and active galactic nuclei, which are critical to understanding black hole physics and dynamics. It emphasizes the current research aimed at accurately measuring black hole mass and accretion rates using optical and UV broad emission lines. These measurements are key to understanding the evolution of quasars and related cosmological phenomena.
   [Read PDF](https://arxiv.org/pdf/astro-ph/0606678v1) 

2. **Title**: [Disturbing the Black Hole](http://arxiv.org/abs/gr-qc/9805045v1)  
   **Authors**

## Evaluation Step

Not all sources retrieved by web search are equally reliable.  

- If the problem lies in web search (usually the **first step** in a graded lab workflow), rerunning the *entire* pipeline (search → draft → reflect) every time can be **expensive** and noisy.  
- Small improvements in web search quality may be hidden by randomness introduced by later components.  
- By evaluating the web search *alone*, you get a **clearer signal** of whether that component is improving.  

Component-level evals are also efficient when multiple teams are working on different pieces of a system: each team can optimize its own component using a clear metric, without needing to run or wait for full end-to-end tests.  

In [10]:
# list of preferred domains for Tavily results
TOP_DOMAINS = {
    # General reference / institutions / publishers
    "wikipedia.org", "nature.com", "science.org", "sciencemag.org", "cell.com",
    "mit.edu", "stanford.edu", "harvard.edu", "nasa.gov", "noaa.gov", "europa.eu",

    # CS/AI venues & indexes
    "arxiv.org", "acm.org", "ieee.org", "neurips.cc", "icml.cc", "openreview.net",

    # Other reputable outlets
    "elifesciences.org", "pnas.org", "jmlr.org", "springer.com", "sciencedirect.com",

    # Extra domains (case-specific additions)
    "pbs.org", "nova.edu", "nvcc.edu", "cccco.edu",

    # Well known programming sites
    "codecademy.com", "datacamp.com"
}

In [11]:
def evaluate_tavily_results(TOP_DOMAINS, raw: str, min_ratio=0.4):
    """
    Evaluate whether plain-text research results mostly come from preferred domains.

    Args:
        TOP_DOMAINS (set[str]): Set of preferred domains (e.g., 'arxiv.org', 'nature.com').
        raw (str): Plain text or Markdown containing URLs.
        min_ratio (float): Minimum preferred ratio required to pass (e.g., 0.4 = 40%).

    Returns:
        tuple[bool, str]: (flag, markdown_report)
        flag -> True if PASS, False if FAIL
        markdown_report -> Markdown-formatted summary of the evaluation
    """

    # Extract URLs from the text
    url_pattern = re.compile(r'https?://[^\s\]\)>\}]+', flags=re.IGNORECASE)
    urls = url_pattern.findall(raw)

    if not urls:
        return False, """### Evaluation — Tavily Preferred Domains
        No URLs detected in the provided text. 
        Please include links in your research results.
        """

    # Count preferred vs total
    total = len(urls)
    preferred_count = 0
    details = []

    for url in urls:
        domain = url.split("/")[2]
        preferred = any(td in domain for td in TOP_DOMAINS)
        if preferred:
            preferred_count += 1
        details.append(f"- {url} → {'✅ PREFERRED' if preferred else '❌ NOT PREFERRED'}")

    ratio = preferred_count / total if total > 0 else 0.0
    flag = ratio >= min_ratio

    # Markdown report
    report = f"""
    ### Evaluation — Tavily Preferred Domains
    - Total results: {total}
    - Preferred results: {preferred_count}
    - Ratio: {ratio:.2%}
    - Threshold: {min_ratio:.0%}
    - Status: {"✅ PASS" if flag else "❌ FAIL"}

    **Details:**
    {chr(10).join(details)}
    """
    return flag, report

In [12]:
print(f"Sample Trusted Domains: \n{json.dumps(list(TOP_DOMAINS)[:4], indent=2)}\n{'-'*50}\n")
print(f"Research Results: \n{research_result}\n{'-'*50}\n")

flag, report = evaluate_tavily_results(TOP_DOMAINS, research_result)
print(f"Evaluation Summary: \n{display(Markdown(report))}")

Sample Trusted Domains: 
[
  "noaa.gov",
  "wikipedia.org",
  "codecademy.com",
  "pnas.org"
]
--------------------------------------------------

Research Results: 
I couldn't find recent papers published around your specification, but here are two papers on recent developments in black hole science available via arXiv:

1. **Title**: [Accretion onto Supermassive Black Holes in Quasars: Learning from Optical/UV Observations](http://arxiv.org/abs/astro-ph/0606678v1)  
   **Authors**: Paola Marziani, Deborah Dultzin-Hacyan, Jack W. Sulentic  
   **Published**: 2006-06-28  
   **Summary**: This paper explores the complexities of accretion processes in quasars and active galactic nuclei, which are critical to understanding black hole physics and dynamics. It emphasizes the current research aimed at accurately measuring black hole mass and accretion rates using optical and UV broad emission lines. These measurements are key to understanding the evolution of quasars and related cosmological


    ### Evaluation — Tavily Preferred Domains
    - Total results: 4
    - Preferred results: 4
    - Ratio: 100.00%
    - Threshold: 40%
    - Status: ✅ PASS

    **Details:**
    - http://arxiv.org/abs/astro-ph/0606678v1 → ✅ PREFERRED
- https://arxiv.org/pdf/astro-ph/0606678v1 → ✅ PREFERRED
- http://arxiv.org/abs/gr-qc/9805045v1 → ✅ PREFERRED
- https://arxiv.org/pdf/gr-qc/9805045v1 → ✅ PREFERRED
    

Evaluation Summary: 
None


## Experiments

In [13]:
topic = "recent developments in black hole science" 
min_ratio = 0.4 
run_reflection = True

# Short list of preferred domains
TOP_DOMAINS = {
    "wikipedia.org", "nature.com", "science.org", "arxiv.org",
    "nasa.gov", "mit.edu", "stanford.edu", "harvard.edu"
}

# Show a sample of preferred domains
print(f"Sample Preferred Domains: \n{json.dumps(list(TOP_DOMAINS)[:4], indent=2)}\n{'-'*50}\n")

# 1) Research
research_task = f"Find 2–3 key papers and reliable overviews about {topic}."
research_output = find_references(research_task)
print(f"Research Results: \n{research_result}\n{'-'*50}\n")

# 2) Evaluate sources (preferred domains ratio)
flag, eval_md = evaluate_tavily_results(TOP_DOMAINS, research_output, min_ratio=min_ratio)
print(f"Evaluation Summary: \n{display(Markdown(eval_md))}")

Sample Preferred Domains: 
[
  "stanford.edu",
  "wikipedia.org",
  "nasa.gov",
  "harvard.edu"
]
--------------------------------------------------

Research Results: 
I couldn't find recent papers published around your specification, but here are two papers on recent developments in black hole science available via arXiv:

1. **Title**: [Accretion onto Supermassive Black Holes in Quasars: Learning from Optical/UV Observations](http://arxiv.org/abs/astro-ph/0606678v1)  
   **Authors**: Paola Marziani, Deborah Dultzin-Hacyan, Jack W. Sulentic  
   **Published**: 2006-06-28  
   **Summary**: This paper explores the complexities of accretion processes in quasars and active galactic nuclei, which are critical to understanding black hole physics and dynamics. It emphasizes the current research aimed at accurately measuring black hole mass and accretion rates using optical and UV broad emission lines. These measurements are key to understanding the evolution of quasars and related cosmologi


    ### Evaluation — Tavily Preferred Domains
    - Total results: 7
    - Preferred results: 7
    - Ratio: 100.00%
    - Threshold: 40%
    - Status: ✅ PASS

    **Details:**
    - https://arxiv.org/abs/astro-ph/0606678v1 → ✅ PREFERRED
- https://arxiv.org/pdf/astro-ph/0606678v1 → ✅ PREFERRED
- http://arxiv.org/abs/gr-qc/9805045v1 → ✅ PREFERRED
- https://arxiv.org/pdf/gr-qc/9805045v1 → ✅ PREFERRED
- http://arxiv.org/abs/0805.3007v2 → ✅ PREFERRED
- https://arxiv.org/pdf/0805.3007v2 → ✅ PREFERRED
- https://en.wikipedia.org/wiki/Black_hole_information_paradox → ✅ PREFERRED
    

Evaluation Summary: 
None


## Takeaways

* Component-level evaluation checked whether the retrieved URLs were in a predefined list of **preferred domains**. This is an example of an **objective evaluation** with a clear **per-example ground truth**.  
* To build an evaluation set, design ~10 prompts covering different topics (astronomy, robotics, finance, etc.) and define preferred domains for each. The percentage of retrieved sources that matched the list of preferred domains provides a useful **metric** to guide improvements, such as adjusting the prompt or tool parameters.  
* This approach is **simpler and cheaper** than evaluating full essays with reflection and rewrites, since it only focus on the web search component.  

---