- **TavilyMap**: Automatically discovers and maps website structure
- **TavilyExtract**: Extracts clean, structured content from webpages

# 0. Setup

In [1]:
# !pip install langchain-tavily certifi
#
# # for pretty printing and visualization
# !pip install rich

In [2]:
import asyncio
import os
import ssl
from typing import Any, Dict, List

import certifi
from langchain_tavily import TavilyExtract, TavilyMap # langchain tools
from rich.console import Console
from rich.panel import Panel

# configure SSL context to use certifi certificates
# for making tons of requests for Tavily API
ssl_context = ssl.create_default_context(cafile=certifi.where())
os.environ["SSL_CERT_FILE"] = certifi.where()
os.environ["REQUEST_CA_BUNDLE"] = certifi.where()

# Initialize rich consile for pretty printing
console = Console()

print("✅All import successful!!")

✅All import successful!!


In [3]:
from dotenv import load_dotenv
load_dotenv()

True

# 1. TavilyMap: Website Structure Discovery

TavilyMap automatically discovers and maps websites by crawling through links. It's perfect for:
- Documentation sites
- Blog archives
- Knowledge bases
- Any structured websites

**KEY PARAMETERS:**
- `max_depth`: how deep to crawl (default: 3)
- `max_breadth`: how many links per page (default: 10)
- `limit`: maximum total pages to discover (default: 100)

In [5]:
# initialize TavilyMap with custom settings
tavily_map = TavilyMap(
    max_depth=3,     # crawl up to 3 levels deep
    max_breadth=15,  # follow up to 15 links per page
    limit=50,        # limit to 50 total pages for demo
) # this API will receive an URL as input

print("✅ TavilyMap initialized successfully")

✅ TavilyMap initialized successfully


## 1.1 Demo: Mapping a documentation site

Let's map the structure of a popular documentation site. We'll use the Langchain documentation as an example.

In [10]:
# example website to map
demo_url = "https://python.langchain.com/docs/introduction/"

console.print(f"⚙️ Mapping website structure for: {demo_url}", style="bold blue")
console.print("This may take a while...")

# map the website structure
# tavily_map is actually langchain tool
site_map = tavily_map.invoke(demo_url)

# display results
urls = site_map.get('results', [])
console.print(f"\n✅ Successfully mapped {len(urls)} URLs...", style="bold green")

# show first 10 URLs as examples
console.print(f"\n🔥 First 10 discovered URLs:", style="bold yellow")
for i, url in enumerate(urls[:10], 1):
    console.print(f"  {i:2d}. {url}")

if len(urls) > 10:
    console.print(f"  ... and {len(urls) - 10} more URLs.")

In [26]:
site_map
#urls

{'base_url': 'https://python.langchain.com/docs/introduction/',
 'results': ['https://python.langchain.com/docs/introduction',
  'https://python.langchain.com/api_reference',
  'https://python.langchain.com/docs/contributing',
  'https://python.langchain.com/docs/concepts',
  'https://python.langchain.com/docs/people',
  'https://python.langchain.com/docs/tutorials',
  'https://python.langchain.com/docs/how_to',
  'https://python.langchain.com/docs/security',
  'https://python.langchain.com/docs/how_to/document_loader_office_file',
  'https://python.langchain.com/docs/how_to/chatbots_tools',
  'https://python.langchain.com/docs/concepts/callbacks',
  'https://python.langchain.com/docs/concepts/few_shot_prompting',
  'https://python.langchain.com/docs/how_to/tools_model_specific',
  'https://python.langchain.com/docs/how_to/lcel_cheatsheet',
  'https://python.langchain.com/docs/how_to/sql_csv',
  'https://python.langchain.com/docs/how_to/self_query',
  'https://python.langchain.com/docs

# 2. TavilyExtract: Clean Content Extraction

TavilyMap takes URLs and returns clean, structured content without ads, naigation, or orther noise.. It's perfect for:
- Documentation processing
- Content analysis
- Research and data collection
- Building knowledge bases

**KEY FEATURES:**
- Removes HTML markuo and navigation
- Extracts main content only
- Handles JAvaScript-rendered content
- Batch processing suppport

In [11]:
# initialize TavilyExtract
tavily_extract = TavilyExtract()

print("✅ TavilyExtract initialized successfully")

✅ TavilyExtract initialized successfully


## 2.1 Demo: Extracting content from URLs

Let's extract clean content from some of the URLs we discovered earlier.

In [27]:
urls[:2]

['https://python.langchain.com/docs/introduction',
 'https://python.langchain.com/api_reference']

In [20]:
# select a list of URLs for extraction
sample_urls= urls[:2]
console.print(f"📚 Extracting content from {len(sample_urls)} URLs...", style="bold blue")

# extract content concurrently
extraction_result = await tavily_extract.ainvoke(input={"urls": sample_urls})

# display results
extracted_docs = extraction_result.get('results', [])
console.print(f"\n✅ Successfully extracted {len(extracted_docs)} documents!", style="bold green")

# show summary of each extracted document
for i, doc in enumerate(extracted_docs, 1):
    url = doc.get('url', 'Unknown')
    content = doc.get('raw_content', '')

    # create a panel for each document
    panel_content = f"""URL: {url}
Content Length: {len(content):,} characters
Preview: {content}..."""

    console.print(Panel(panel_content, title=f"Document {i}", border_style="blue"))
    print()  # Add spacing







## 2.2 Batch Processing Demo

For larger darasets, we can process URLs in batches to optimize performance and handle rate limits

In [47]:
def chunk_urls(urls: List[str], chunk_size: int = 3) -> List[List[str]]:
    """Split URLs into chunks of specified size."""
    chunks = []
    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]
        chunks.append(chunk)
    return chunks

async def extract_batch(urls: List[str], # a batch
                        batch_num: int # for logging
                        ) -> List[Dict[str, Any]]:
    """Extract documents from a batch of URLs."""
    try:
        console.print(f"🔄 Processing batch {batch_num} with {len(urls)} URLs", style="blue")
        docs = await tavily_extract.ainvoke(input={"urls": urls})
        results = docs.get('results', [])
        console.print(f"✅ Batch {batch_num} completed - extracted {len(results)} documents", style="green")
        return results
    except Exception as e:
        console.print(f"❌ Batch {batch_num} failed: {e}", style="red")
        return []


# process a larger set of URLs in batches
url_batches = chunk_urls(urls[:9], chunk_size=3) # take first 9 URLs for batch demo, splits into batches of 3

console.print(f"📦 Processing 9 URLs in {len(url_batches)} batches", style="bold yellow")

# Process batches concurrently
tasks = [extract_batch(batch, i + 1) for i, batch in enumerate(url_batches)]
batch_results = await asyncio.gather(*tasks, return_exceptions=True)

# flatten results
all_extracted = []
for batch_result in batch_results:
    all_extracted.extend(batch_result)

console.print(f"\n🎉 Batch processing complete! Total documents extracted: {len(all_extracted)}", style="bold green")

In [32]:
url_batches

[['https://python.langchain.com/docs/introduction',
  'https://python.langchain.com/api_reference',
  'https://python.langchain.com/docs/contributing'],
 ['https://python.langchain.com/docs/concepts',
  'https://python.langchain.com/docs/people',
  'https://python.langchain.com/docs/tutorials'],
 ['https://python.langchain.com/docs/how_to',
  'https://python.langchain.com/docs/security',
  'https://python.langchain.com/docs/how_to/document_loader_office_file']]

In [28]:
tasks

[<coroutine object extract_batch at 0x00000193B98C4480>,
 <coroutine object extract_batch at 0x00000193B98C5360>,
 <coroutine object extract_batch at 0x00000193B98C5470>]

In [46]:
batch_results[2][2]

{'url': 'https://python.langchain.com/docs/how_to/document_loader_office_file',
 'images': []}

In [45]:
all_extracted[8]

{'url': 'https://python.langchain.com/docs/how_to/document_loader_office_file',
 'images': []}