# `webai.tools` — User Guide

Unified web search interface over **Tavily** and **OpenAI** web search, with normalized `SearchResult` output and optional LangChain `Document` conversion.

```
webai/tools.py
│
├── SearchProvider (Enum)
│   ├── TAVILY                       Tavily Search API
│   └── OPENAI                       OpenAI native web search
│
├── SearchResult (dataclass)
│   ├── title: str
│   ├── content: str
│   ├── source: str
│   └── raw_data: dict
│
└── WebSearcher
    ├── __init__(provider, tavily_api_key, openai_api_key,
    │            max_results, include_raw_content, debug)
    │
    ├── search_tavily(query, topic)   → list[SearchResult]   direct Tavily call
    ├── search_openai(query)          → list[SearchResult]   direct OpenAI call
    ├── search(query, provider,       → list[SearchResult]   unified + fallback
    │         topic, fallback)
    │
    ├── format_results(results)       → str    numbered list, title+source+content
    ├── format_minimal(results)       → str    title+content, no source
    ├── format_content_only(results)  → str    content with dividers
    ├── get_content_only(results)     → str    joined plain text
    ├── get_content_list(results)     → list[str]
    ├── get_first_content(results)    → str | None
    └── to_documents(results)         → list[Document]
```

| Section | Needs API keys |
|---|---|
| 1 — Setup | No |
| 2 — Data types | No |
| 3 — Initialization | No |
| 4 — Provider-specific search | Yes (`TAVILY_API_KEY` / `OPENAI_API_KEY`) |
| 5 — Unified `search()` | Yes |
| 6 — Formatting methods | **No** (uses mock results) |
| 7 — End-to-end pattern | Yes |
| 8 — Error handling reference | Partial |

## Sections
1. [Setup](#1-setup)
2. [Data types](#2-data-types)
3. [Initialization](#3-initialization)
4. [Provider-specific search](#4-provider-specific-search)
5. [Unified search() with fallback](#5-unified-search-with-fallback)
6. [Formatting methods](#6-formatting-methods)
7. [End-to-end pattern](#7-end-to-end-pattern)
8. [Error handling reference](#8-error-handling-reference)

## 1 — Setup <a id="1-setup"></a>

In [None]:
import logging
import os

from dotenv import load_dotenv

from webai.tools import SearchProvider, SearchResult, WebSearcher

load_dotenv()

logging.basicConfig(
    level=logging.WARNING,  # flip to DEBUG to see full internal trace
    format="%(asctime)s [%(levelname)s] %(name)s — %(message)s",
    force=True,
)

_HAS_TAVILY = bool(os.environ.get("TAVILY_API_KEY"))
_HAS_OPENAI = bool(os.environ.get("OPENAI_API_KEY"))

print(f"TAVILY_API_KEY : {'set' if _HAS_TAVILY else 'MISSING'}")
print(f"OPENAI_API_KEY : {'set' if _HAS_OPENAI else 'MISSING'}")

## 2 — Data types <a id="2-data-types"></a>

These two types are the lingua franca of the module. Everything that goes into a search is a `SearchProvider`; everything that comes out is a `SearchResult`.

### 2a — `SearchProvider`

An `Enum` that selects the search backend. Pass it to `WebSearcher(provider=...)` or to a specific `search()` call as an override.

| Member | Value | Behaviour |
|---|---|---|
| `SearchProvider.TAVILY` | `"tavily"` | Returns N structured results with real URLs |
| `SearchProvider.OPENAI` | `"openai"` | Returns 1 synthesized prose answer |

In [None]:
# Enumerate members — no API key needed
for member in SearchProvider:
    print(f"SearchProvider.{member.name:8s}  value={member.value!r}")

# Access by value (useful when deserialising stored config)
provider_from_string = SearchProvider("tavily")
print(f"\nSearchProvider('tavily') -> {provider_from_string}")

### 2b — `SearchResult`

A frozen-ish `dataclass` that normalises the raw provider response into a consistent shape.

| Field | Type | Notes |
|---|---|---|
| `title` | `str` | Real page title (Tavily) or `"Web search: <query>"` (OpenAI) |
| `content` | `str` | Page snippet or full synthesized prose |
| `source` | `str` | Source URL; may be `""` for OpenAI results |
| `raw_data` | `dict` | Original provider response dict for advanced use |

You can construct `SearchResult` objects directly — useful for testing formatting methods without making live API calls.

In [None]:
# Construct results manually — no API key needed
r = SearchResult(
    title="Nvidia Q4 2025 Earnings Beat",
    content="Nvidia reported Q4 revenue of $39.3B, up 78% year-over-year, driven by data-center GPU demand.",
    source="https://www.reuters.com/technology/nvidia-earnings-2025",
    raw_data={"score": 0.99, "url": "https://www.reuters.com/technology/nvidia-earnings-2025"},
)

print(f"title   : {r.title}")
print(f"content : {r.content}")
print(f"source  : {r.source}")
print(f"raw_data: {r.raw_data}")

## 3 — Initialization <a id="3-initialization"></a>

`WebSearcher.__init__` initializes both provider clients eagerly and **never raises** — missing keys cause the respective provider to be silently skipped. The first call to a missing provider raises `ValueError` at search time.

### Constructor parameters

| Parameter | Default | Effect |
|---|---|---|
| `provider` | `SearchProvider.OPENAI` | Default provider for `search()` calls |
| `tavily_api_key` | `$TAVILY_API_KEY` | Falls back to env var |
| `openai_api_key` | `$OPENAI_API_KEY` | Falls back to env var |
| `max_results` | `5` | Max Tavily results; **ignored** by OpenAI |
| `include_raw_content` | `False` | Full page text from Tavily; **ignored** by OpenAI |
| `debug` | `False` | Sets `webai.tools` logger to `DEBUG` (backwards-compat shortcut) |

> **Both keys in `.env`:** Even if you only use one provider, the other client initializes silently when its key is present. Only the key you need is strictly required.

In [None]:
# Tavily-only searcher — OpenAI key is optional here
tavily_searcher = WebSearcher(
    provider=SearchProvider.TAVILY,
    max_results=5,
    include_raw_content=False,
)
print(f"provider             : {tavily_searcher.provider}")
print(f"max_results          : {tavily_searcher.max_results}")
print(f"include_raw_content  : {tavily_searcher.include_raw_content}")
print(f"tavily_tool ready    : {tavily_searcher.tavily_tool is not None}")
print(f"openai ready         : {tavily_searcher.llm is not None}")

In [None]:
# Full searcher — both providers initialized if both keys are present
full_searcher = WebSearcher(
    provider=SearchProvider.TAVILY,  # default for .search() calls
    max_results=3,
)
print(f"tavily_tool ready : {full_searcher.tavily_tool is not None}")
print(f"openai ready      : {full_searcher.llm is not None}")

## 4 — Provider-specific search <a id="4-provider-specific-search"></a>

Call `search_tavily` or `search_openai` directly when you need explicit control over the provider and don't want the fallback logic of `search()`.

### 4a — `search_tavily(query, topic)`

Returns up to `max_results` structured `SearchResult` objects with real titles, URLs, and page snippets.

| `topic` value | Best for |
|---|---|
| `"general"` | Broad web search (default) |
| `"news"` | Recent news articles |
| `"finance"` | Financial data, earnings, market news |

In [None]:
if not _HAS_TAVILY:
    print("Skipping — requires TAVILY_API_KEY.")
else:
    results = tavily_searcher.search_tavily(
        "NVIDIA earnings 2025", topic="finance"
    )

    print(f"Results returned : {len(results)}")
    print()
    for i, r in enumerate(results, 1):
        print(f"[{i}] {r.title}")
        print(f"    source  : {r.source}")
        print(f"    content : {r.content[:120]}...")
        print()

### 4b — `search_openai(query)`

Returns **one** `SearchResult` whose `content` is a synthesized prose answer from OpenAI's built-in web search tool. `max_results`, `include_raw_content`, and `topic` have no effect here.

The `source` field may be empty if OpenAI's response did not include URL citations in `AIMessage.additional_kwargs["annotations"]`.

In [None]:
if not _HAS_OPENAI:
    print("Skipping — requires OPENAI_API_KEY.")
else:
    openai_searcher = WebSearcher(provider=SearchProvider.OPENAI)
    results_openai = openai_searcher.search_openai("NVIDIA earnings 2025")

    print(f"Results returned : {len(results_openai)}  (always 1 for OpenAI)")
    r = results_openai[0]
    print(f"title   : {r.title}")
    print(f"source  : {r.source!r}  (may be empty string)")
    print(f"content : {r.content[:300]}...")

## 5 — Unified `search()` with fallback <a id="5-unified-search-with-fallback"></a>

`search()` is the main entry point for most use-cases. It dispatches to the configured provider and retries with the other one when `fallback=True` (the default).

### Parameters

| Parameter | Default | Notes |
|---|---|---|
| `query` | *(required)* | Search query string |
| `provider` | instance default | Per-call override; does not change `self.provider` |
| `topic` | `"general"` | Tavily topic; emits `UserWarning` if passed to OpenAI |
| `fallback` | `True` | Retry with the other provider on failure |

### Fallback logic

```
search(query, provider=TAVILY, fallback=True)
    └─ try search_tavily(query, topic)
           OK  -> return results
           ERR -> try search_openai(query)
                      OK  -> return results
                      ERR -> raise RuntimeError("Both providers failed...")
```

In [None]:
# Basic search() call using the instance default provider (Tavily)
if not _HAS_TAVILY:
    print("Skipping — requires TAVILY_API_KEY.")
else:
    results = full_searcher.search("AI chip market trends 2025")
    print(f"Provider used    : {full_searcher.provider.value}")
    print(f"Results returned : {len(results)}")
    for i, r in enumerate(results, 1):
        print(f"  [{i}] {r.title[:80]}")

In [None]:
# Per-call provider override — force OpenAI for this one query
# even though the instance default is Tavily
if not _HAS_OPENAI:
    print("Skipping — requires OPENAI_API_KEY.")
else:
    import warnings
    with warnings.catch_warnings():
        warnings.simplefilter("ignore")  # suppress topic= warning for demo
        results_oa = full_searcher.search(
            "AI chip market trends 2025",
            provider=SearchProvider.OPENAI,
        )
    print(f"Results returned : {len(results_oa)}  (always 1 for OpenAI)")
    print(f"Content preview  : {results_oa[0].content[:300]}...")

## 6 — Formatting methods <a id="6-formatting-methods"></a>

All seven formatters are **pure functions** over `list[SearchResult]` — no network calls. This section uses hand-crafted mock results so it runs without any API keys.

| Method | Returns | Includes source | Best for |
|---|---|---|---|
| `format_results` | `str` | Yes | Human-readable output, debugging |
| `format_minimal` | `str` | No | Clean display without attribution |
| `format_content_only` | `str` | No | Divider-separated snippets |
| `get_content_only` | `str` | No | Newline-joined string for LLM context |
| `get_content_list` | `list[str]` | No | Programmatic content access |
| `get_first_content` | `str\|None` | No | Quick "top result" extraction |
| `to_documents` | `list[Document]` | In metadata | LangChain pipeline integration |

In [None]:
# Shared mock results used in all formatting cells below — no API key needed
MOCK_RESULTS = [
    SearchResult(
        title="NVIDIA Q4 2025 Earnings Beat Expectations",
        content="Nvidia reported Q4 revenue of $39.3B, up 78% YoY, led by data-center GPU demand from hyperscalers.",
        source="https://www.reuters.com/technology/nvidia-q4-2025",
        raw_data={"score": 0.99},
    ),
    SearchResult(
        title="AMD Challenges Nvidia in AI Chip Race",
        content="AMD's MI300X GPU is gaining traction with cloud providers seeking alternatives to Nvidia's H100.",
        source="https://www.wsj.com/technology/amd-mi300x-ai-chips",
        raw_data={"score": 0.97},
    ),
    SearchResult(
        title="Global Semiconductor Outlook 2025",
        content="Analysts forecast 12% global semiconductor revenue growth in 2025, driven by AI training hardware.",
        source="https://www.ft.com/content/semiconductor-outlook-2025",
        raw_data={"score": 0.95},
    ),
]

# Shared formatter instance — init with no keys (only used for formatting, not search)
_fmt = WebSearcher.__new__(WebSearcher)  # bypass __init__ entirely
print(f"Mock results ready: {len(MOCK_RESULTS)}")

In [None]:
# format_results — numbered list with title, source URL, and content
print("=== format_results ===")
print(_fmt.format_results(MOCK_RESULTS))

# Empty-list guard
print("--- empty list ---")
print(_fmt.format_results([]))

In [None]:
# format_minimal — title + content, no source URL
print("=== format_minimal ===")
print(_fmt.format_minimal(MOCK_RESULTS))

In [None]:
# format_content_only — content blocks separated by numbered dividers
print("=== format_content_only ===")
print(_fmt.format_content_only(MOCK_RESULTS))

In [None]:
# get_content_only — single string, newline-joined; ideal as LLM context input
joined = _fmt.get_content_only(MOCK_RESULTS)
print("=== get_content_only ===")
print(joined)

# get_content_list — list[str]; one string per result for programmatic access
content_list = _fmt.get_content_list(MOCK_RESULTS)
print(f"\n=== get_content_list ({len(content_list)} items) ===")
for i, c in enumerate(content_list, 1):
    print(f"  [{i}] {c[:60]}...")

# get_first_content — quick top-result extraction
first = _fmt.get_first_content(MOCK_RESULTS)
print(f"\n=== get_first_content ===\n{first}")

# edge case: empty list returns None
print(f"\nget_first_content([]) -> {_fmt.get_first_content([])!r}")

In [None]:
# to_documents — LangChain Document objects for RAG/retrieval pipelines
docs = _fmt.to_documents(MOCK_RESULTS)

print(f"=== to_documents ({len(docs)} docs) ===")
for i, doc in enumerate(docs, 1):
    print(f"\n[{i}]")
    print(f"  page_content : {doc.page_content[:80]}...")
    print(f"  metadata.title  : {doc.metadata['title']}")
    print(f"  metadata.source : {doc.metadata['source']}")
    print(f"  metadata keys   : {list(doc.metadata.keys())}")

## 7 — End-to-end pattern <a id="7-end-to-end-pattern"></a>

A realistic pipeline: search with Tavily, fall back to OpenAI if needed, then feed the content into an LLM chain as context. The helper is self-contained and uses the `full_searcher` built in section 3.

In [None]:
def search_and_summarize(
    searcher: WebSearcher,
    query: str,
    topic: str = "general",
    max_chars: int = 2000,
) -> dict:
    """
    Search with Tavily (finance topic), build a context string from the
    top results, and return a structured dict ready to pass to an LLM.

    Args:
        searcher:  Initialized WebSearcher instance.
        query:     Search query.
        topic:     Tavily topic ('general', 'news', 'finance').
        max_chars: Truncate total context to this many characters.

    Returns:
        dict with keys 'query', 'n_results', 'sources', 'context'.
    """
    results = searcher.search(query, topic=topic, fallback=True)
    context = searcher.get_content_only(results)[:max_chars]
    sources = [r.source for r in results if r.source]
    return {
        "query":     query,
        "n_results": len(results),
        "sources":   sources,
        "context":   context,
    }


if not _HAS_TAVILY:
    print("Skipping — requires TAVILY_API_KEY.")
else:
    payload = search_and_summarize(
        full_searcher,
        "semiconductor sector outlook 2025",
        topic="finance",
    )

    print(f"query      : {payload['query']}")
    print(f"n_results  : {payload['n_results']}")
    print(f"sources    : {len(payload['sources'])} URLs")
    for url in payload["sources"]:
        print(f"  {url}")
    print(f"\ncontext ({len(payload['context'])} chars):")
    print(payload["context"][:500], "...")

## 8 — Error handling reference <a id="8-error-handling-reference"></a>

| Method | Condition | Exception |
|---|---|---|
| `search_tavily` | Tavily not initialized (missing key) | `ValueError` |
| `search_tavily` | Tavily API call fails | `RuntimeError` |
| `search_openai` | OpenAI not initialized (missing key) | `ValueError` |
| `search_openai` | OpenAI API call fails | `RuntimeError` |
| `search` | Primary fails + `fallback=False` | re-raises primary exception |
| `search` | Both providers fail | `RuntimeError("Both providers failed...")` |

> **`__init__` never raises.** Missing keys are silently ignored at construction time; errors surface at search time.

In [None]:
# ValueError: search_tavily on a searcher with no Tavily key
# (Construct a searcher without any keys — __init__ won't raise)
import os as _os
_saved_tv = _os.environ.pop("TAVILY_API_KEY", None)
_saved_oa = _os.environ.pop("OPENAI_API_KEY", None)

no_key_searcher = WebSearcher()  # no keys -> both tools are None

try:
    no_key_searcher.search_tavily("test")
except ValueError as e:
    print(f"search_tavily (no key) -> ValueError: {e}")

try:
    no_key_searcher.search_openai("test")
except ValueError as e:
    print(f"search_openai (no key) -> ValueError: {e}")

# Restore keys
if _saved_tv:
    _os.environ["TAVILY_API_KEY"] = _saved_tv
if _saved_oa:
    _os.environ["OPENAI_API_KEY"] = _saved_oa
print("Keys restored.")

In [None]:
# fallback=False re-raises the primary error immediately
# Demonstrated with a no-key searcher (no live call needed)
_saved_tv = _os.environ.pop("TAVILY_API_KEY", None)
_saved_oa = _os.environ.pop("OPENAI_API_KEY", None)

no_key_searcher2 = WebSearcher(provider=SearchProvider.TAVILY)

try:
    no_key_searcher2.search("test query", fallback=False)
except (ValueError, RuntimeError) as e:
    print(f"search(fallback=False) -> {type(e).__name__}: {e}")

if _saved_tv:
    _os.environ["TAVILY_API_KEY"] = _saved_tv
if _saved_oa:
    _os.environ["OPENAI_API_KEY"] = _saved_oa
print("Keys restored.")